Data Reduction

Data reduction obtains a reduced representation of the data set that is much smaller in volume, yet produces the same (or almost the same) analytical results.

Data reduction is the transformation of numerical or alphabetical digital information derived empirically or experimentally into a corrected, ordered, and simplified form. The basic concept is the reduction of multitudinous amounts of data down to the meaningful parts.

When information is derived from instrument readings there may also be a transformation from analog to digital form. When the data are already in digital form the 'reduction' of the data typically involves some editing, scaling, encoding, sorting, collating, and producing tabular summaries. When the observations are discrete but the underlying phenomenon is continuous then smoothing and interpolation are often needed. Often the data reduction is undertaken in the presence of reading or measurement errors.

Need for data reduction

A database/data warehouse may store terabyte of data. Complex data analysis/mining may take a very long time to run on the complete data set. That's why data reduction is needed.

Data Reduction Strategies

  • Data cube aggretion
  • Attribute Subset Selection
  • Numerosity reduction- ex: fit data into models
  • Dimensionality reduction- Data Compression
  • Discretization and concept hierarchy generation

I. Data cube aggretion

Aggregation operation is applied to data for the construction of the data cube.

II. Attribute Subset Selection

1. Feature selection (i.e., attribute subset selection):

  • Select a minimum set of features such that the probability distribution of different classes given the values for those features is as close as possible to the original distribution given the values of all features
  • reduce # of patterns in the patterns, easier to understand

2. Heuristic methods (due to exponential # of choices):

  • Step-wise forward selection: The procedure starts with an empty set of attributes as the reduced set. The best of the original attributes is determined and added to the reduced set, which continue till the best of remaining original attributes is added to the set.
  • Step-wise backward elimination: The procedure starts with the full set of attributes. At each step, it removes the worst attributes remaining in the set.
  • Combining forward selection and backward elimination: It is the combination of above two approaches so that, at each step, the procedures selects the best attribute and removes the worst from among the remaining attributes.
  • Decision-tree induction: It construct a FC like structure where each internal (non-leaf) node denotes a test on an attribute, each branch corresponds to an outcome of the test, and each external (leaf) node denotes a class prediction. When DTI is used for attribute subset selection, a tree is constructed from the given data. All attributes that do not appear in the tree are assumed to be irrelevant. The set of attributes appearing in the tree from the reduced subset of attributes.

III. Numerosity Reduction

  • Reduce data volume by choosing alternative, smaller forms of data representation
  • Parametric methods: Assume the data fits some model, estimate model parameters, store only the parameters, and discard the data (except possible outliers)
    Example: Log-linear models—obtain value at a point in m-D space as the product on appropriate marginal subspaces
  • Non-parametric methods: storing reduced representation of data – Do not assume models – Major families: histograms, clustering, sampling

IV. Dimensionality Reduction:

This reduce the size of data by encoding mechanisms.It will be lossy or lossless. If after reconstruction from compressed data, original data can be retrieved, such reduction are lossless reduction else it is lossy reduction. The two effective methods of dimensionality reduction are: Wavelet transforms and PCA (Principal Componenet Analysis).