The data can have many irrelevant and missing parts. To handle this part, data cleaning is done. It involves handling of different quality issues.
Data cleaning is a procedure to "clean" the data by filling in missing values, smoothing noisy data, identifying or removing outliers, and resolving inconsistencies
Data Quality Issues
- Missing values
- Duplicate data
- Inconsistent / Invalid data
This situation arises when some data is missing in the data. It can be handled in various ways.
Some of them are:
- Ignore the tuples:This approach is suitable only when the dataset we have is quite large and multiple values are missing within a tuple.
- Fill the Missing values:There are various ways to do this task. You can choose to fill the missing values manually, by attribute mean or the most probable value.
Duplicate data occurs when your data set has redundant data objects
Duplicate data can be corrected by Delete old data or Merge duplicate records
Inconsistent / Invalid data
Impossible value for a feature
- Ex: age -6
- 7 letter Income -10000
- zip code in India- only 6 digits can't be more than that
Primarily occur due to data entry error
To solve this:
- Use external knowledge bases to get the right values.
- Apply reasoning and domain knowledge to come with a reasonable estimate.
Noisy data is a meaningless data that can’t be interpreted by machines.It can be generated due to faulty data collection, data entry errors etc.
To solve this:
- Filter out the noise component
- This may result in partial loss of data if not done carefully.
It can be handled in following methods also:
- Binning Method: This method works on sorted data in order to smooth it. The whole data is divided into segments of equal size and then various methods are performed to complete the task. Each segmented is handled separately. One can replace all data in a segment by its mean or boundary values can be used to complete the task.
Here data can be made smooth by fitting it to a regression function.The regression used may be linear (having one independent variable) or multiple (having multiple independent variables).
This approach groups the similar data in a cluster. The outliers may be undetected or it will fall outside the clusters.
A data object that is considerable different from others general behavior of the data.
To solve this:
- Algorithms like Linear Regression, K-Nearest Neighbor, Adaboost are sensitive to noise.
- Outlier can significantly skew the distribution of your data.
- Outliers can be identified using summary statistics and plots of the data.
How to Clean Data?
- Handling Missing values::
Ignore the tuple
Fill in the missing value manually
Use a global constant to fill in the missing value
Use the attribute mean to fill in the missing value
Use the attribute mean for all samples belonging to the same class as the given tuple
Use the most probable value to fill in the missing value
- Handle Noisy Data::
Binning: Binning methods smooth a sorted data value by consulting its "neighborhood".
Regression: Data can be smoothed by fitting the data to a function, such as with regression.
Clustering: Outliers may be detected by clustering, where similar values are organized into groups, or "clusters."