Tuesday, February 17, 2026

Split before data pre-processing or after?

In machine learning workflows, the standard practice is to split datasets into training and testing subsets before applying most preprocessing transformations to prevent data leakage. 

However, certain preliminary data cleaning operations may be performed safely on the entire dataset beforehand, as they do not depend on statistical summaries or introduce information from the test set into the training process. 

Given below are examples of preprocessing that can be done before splitting. 

  • Removing duplicates. 
  • Fixing data types - e.g. date strings
  • Remove bad data or impossible values - e.g. age > 150
  • Removing whitespace from strings - e.g. trim the text
Thus, as long as you are not using statistics to impute missing values in the dataset, you can do the preprocessing before the split (into training/test). 

Operations involving data-derived statistics—such as imputation with means/medians, standardization, one-hot encoding based on frequencies, or percentile-based outlier removal—must be fitted exclusively on the training set. Hence this kind of data pre-processing should be only done after splitting, or you will end up with something called as 'data leakage'.

So what exactly is data leakage? You can understand it with the following analogy. 
  • Imagine you're studying for an exam.
  • You’re supposed to practice using your textbook (training data) and then take the exam (test data) to see how well you’ve learned.
  • Now imagine someone secretly shows you some of the exam questions while you’re studying.
  • When you take the test, you score really high — but not because you truly understood the material. You just recognized the questions. That’s data leakage!!!
In simple terms:
  • The training data is what the model learns from.
  • The test data is supposed to check how well it learned.
  • If information from the test data sneaks into training, the model gets an unfair advantage.
  • It looks like it performs very well.
  • But when you give it completely new data in the real world, performance drops.
  • So data leakage makes the model look smarter than it actually is — and that’s dangerous because it won’t work as well in real-life situations.
Example where imputation is done before splitting. 
  • Suppose you are building a model to predict house prices, and the dataset contains missing values in the feature “Lot Size.”
  • You calculate the mean lot size using the entire dataset (including both training and test data) and use that value to fill in all missing entries.
  • After performing this imputation, you split the data into training and test sets.
  • This creates data leakage because the imputed values were influenced by information from the test set.
  • As a result, the model’s evaluation may appear more accurate than it truly is, since the training process indirectly incorporated knowledge from unseen data.
 Another example of data leakage is target leakage explained here - https://www.narendranaidu.com/2026/02/ruminating-on-target-leakage-in-ml.html

Ruminating on target leakage in ML models

Target leakage is a type of data leakage where training data includes info directly tied to the outcome (target variable), but that info wouldn't exist at prediction time. Your model "cheats" during training, looks amazing on paper, but fails on new data. This often sneaks in through feature engineering or data collection, leading to overfitting. 

Examples:

  • You're building a model to spot who'll get a sinus infection. Your dataset has a feature "took_antibiotics." Sounds useful, right? Wrong—patients take antibiotics after getting sick, so this feature leaks the target. Drop it!
  • Predicting if employees will quit. Including "retention_bonus_offered" leaks info because bonuses come after quit signals, not before. The model learns from a reaction to churn, not its causes.
  • In credit card fraud prediction, using "chargeback_filed" as a feature is leakage gold. Chargebacks happen post-fraud, so the model peeks at the future.

Golden rule to avoid target leakage: Always ask: "Would this feature exist before the prediction?" If no, remove it.