In machine learning workflows, the standard practice is to split datasets into training and testing subsets before applying most preprocessing transformations to prevent data leakage.
However, certain preliminary data cleaning operations may be performed safely on the entire dataset beforehand, as they do not depend on statistical summaries or introduce information from the test set into the training process.
Given below are examples of preprocessing that can be done before splitting.
- Removing duplicates.
- Fixing data types - e.g. date strings
- Remove bad data or impossible values - e.g. age > 150
- Removing whitespace from strings - e.g. trim the text
Thus, as long as you are not using statistics to impute missing values in the dataset, you can do the preprocessing before the split (into training/test).
Operations involving data-derived statistics—such as imputation with means/medians, standardization, one-hot encoding based on frequencies, or percentile-based outlier removal—must be fitted exclusively on the training set. Hence this kind of data pre-processing should be only done after splitting, or you will end up with something called as 'data leakage'.
So what exactly is data leakage? You can understand it with the following analogy.
- Imagine you're studying for an exam.
- You’re supposed to practice using your textbook (training data) and then take the exam (test data) to see how well you’ve learned.
- Now imagine someone secretly shows you some of the exam questions while you’re studying.
- When you take the test, you score really high — but not because you truly understood the material. You just recognized the questions. That’s data leakage!!!
In simple terms:
- The training data is what the model learns from.
- The test data is supposed to check how well it learned.
- If information from the test data sneaks into training, the model gets an unfair advantage.
- It looks like it performs very well.
- But when you give it completely new data in the real world, performance drops.
- So data leakage makes the model look smarter than it actually is — and that’s dangerous because it won’t work as well in real-life situations.
Example where imputation is done before splitting.
- Suppose you are building a model to predict house prices, and the dataset contains missing values in the feature “Lot Size.”
- You calculate the mean lot size using the entire dataset (including both training and test data) and use that value to fill in all missing entries.
- After performing this imputation, you split the data into training and test sets.
- This creates data leakage because the imputed values were influenced by information from the test set.
- As a result, the model’s evaluation may appear more accurate than it truly is, since the training process indirectly incorporated knowledge from unseen data.
Another example of data leakage is target leakage explained here - https://www.narendranaidu.com/2026/02/ruminating-on-target-leakage-in-ml.html
No comments:
Post a Comment