Thursday, December 30, 2021

ML Data Preparation - Impute missing values

 In any AI project, data plumbing/engineering takes up 60-80% of the effort. Before training a model on the input dataset, we have to cleanse and standarize the dataset.

One common challenge is around missing values (features) in the dataset. We can either ignore these records or try to "impute" the value. The word "impute" means to assign a value to something by inference. 

Since most AI models cannot work with blanks or NaN values (for numerical features), it is important to impute values of missing features in a recordset.  

Given below are some of the techniques used to impute the value of missing features in the dataset. 

  • Numerical features: We can use the MEAN value of the feature or the MEDIAN value. 
  • Categorical features: For categorical features, we can use the most frequent value in the training dataset or we can use a constant like 'NA' or 'Not Available'. 

No comments:

Post a Comment