Data Classification

What is data??


Data is a collection of objects described using its attributes or features

Machine Learning data

Data Type


Data Types are very important concept for statistics, which needs to be understood, to correctly apply statistical measurements to data and therefore to correctly conclude certain assumptions about it. Here we are shoing different data types you need to know, to do proper exploratory data analysis (EDA), which is one of the most underestimated parts of a machine learning algorithms.

We need to know which data type we are dealing with to choose the right visualization method. Think of data types as a way to categorize different types of variables.

Data types

I. Categorical Data

Categorical data represents characteristics. Therefore it can represent things like a person's gender, language etc. Categorical data can also take on numerical values (Example: 1 for female and 0 for male). Note that those numbers don't have mathematical meaning.

1. Nominal Data

Nominal values represent discrete units and are used to label variables, that have no quantitative value. Just think of them as "labels". Note that nominal data that has no order. Therefore if you would change the order of its values, the meaning would not change. You can see two examples of nominal features below:

Nominal

2. Binary Data

Nominal values with only two values.

  • Symmetric Binary: Both outcomes equally important. Ex: gender.
  • Asymmetric Binary: Outcomes not equally important. Ex: Medical test(positive, Negative)-Pregnancy test. assign 1 most important outcome.
Binary

3. Ordinal Data

Ordinal values represent discrete and ordered units. It is therefore nearly the same as nominal data, except that it's ordering matters. You can see an example below:

Nominal

ordinal scales are usually used to measure non-numeric features like happiness, customer satisfaction and so on.

II. Numerical Data

1. Interval Data

Interval values represent ordered units that have the same difference. Therefore we speak of interval data when we have a variable that contains numeric values that are ordered and where we know the exact differences between the values. An example would be a feature that contains temperature of a given place like you can see below:

Interval

The problem with interval values data is that they don't have a "true zero". That means in regards to our example, that there is no such thing as no temperature. With interval data, we can add and subtract, but we cannot multiply, divide or calculate ratios. Because there is no true zero, a lot of descriptive and inferential statistics can't be applied.

2. Ratio Data

Ratio values are also ordered units that have the same difference. Ratio values are the same as interval values, with the difference that they do have an absolute zero. Good examples are height, weight, length etc.

Interval

Discrete and continuous Attributes

1. Discrete

We speak of discrete data if its values are distinct and separate. In other words: We speak of discrete data if the data can only take on certain values. This type of data can't be measured but it can be counted. It basically represents information that can be categorized into a classification. An example is the number of heads in 100 coin flips.

Has a finite or countably infinite set of values. Often integer values are used to denote the values.
Ex: No of employees, set of words in a document collection.

Note: Binary attributes are special case of discrete variables

2. Continuous

Continuous Data represents measurements and therefore their values can't be counted but they can be measured. An example would be the height of a person, which you can describe by using intervals on the real number line.

Summary:

To visualise continuous data, you can use a histogram or a box-plot.

Visualisation Methods: To visualise nominal data you can use a pie chart or a bar chart.

In Data Science, you can use one hot encoding, to transform nominal data into a numeric feature

In Data Science, you can use one label encoding, to transform ordinal data into a numeric feature.