Monday, January 03, 2022

Ruminating on high dimensional data

 In simple terms, dimensions of a dataset refer to the number of attributes (or features) that a dataset has. This concept of dimensions of data is not new and quite common in the data warehousing world as explained here

Many datasets can have a large number of features (variables/attributes) such as healthcare data, signal processing, bioinformatics. When the number of dimensions are staggeringly high, ML calculations become extremely difficult. It is also possible that the number of features can exceed the number of observations (or records in a dataset) - e.g. microarrays, which measure gene expression, can contain hundreds of samples/records, but each record can contain tens of thousands of genes.

In such highly dimensional data, we experience something called as the "Curse of dimensionality" - i.e. all records appear to be sparse and dissimilar in many ways, which prevents common data organization strategies from being efficient. The more dimensions we add to a data set, the more sparse the data becomes and this results in an exponential decrease in the ML model performance (i.e. predictive capabilities). 

A typical rule of thumb is that there should be at least 5 training examples for each dimension in the dataset. Another interesting excerpt from Wikipedia is given below:   

In machine learning and insofar as predictive performance is concerned, the curse of dimensionality is used interchangeably with the peaking phenomenon, which is also known as Hughes phenomenon. This phenomenon states that with a fixed number of training samples, the average (expected) predictive power of a classifier or regressor first increases as the number of dimensions or features used is increased but beyond a certain dimensionality it starts deteriorating instead of improving steadily.

To handle high dimensional datasets, data scientists typically perform various data dimension reduction techniques on the datasets - e.g. feature selection, feature projection, etc. More information about dimension reduction can be found here

No comments:

Post a Comment