Target leakage is a type of data leakage where training data includes info directly tied to the outcome (target variable), but that info wouldn't exist at prediction time. Your model "cheats" during training, looks amazing on paper, but fails on new data. This often sneaks in through feature engineering or data collection, leading to overfitting.
Examples:
- You're building a model to spot who'll get a sinus infection. Your dataset has a feature "took_antibiotics." Sounds useful, right? Wrong—patients take antibiotics after getting sick, so this feature leaks the target. Drop it!
- Predicting if employees will quit. Including "retention_bonus_offered" leaks info because bonuses come after quit signals, not before. The model learns from a reaction to churn, not its causes.
- In credit card fraud prediction, using "chargeback_filed" as a feature is leakage gold. Chargebacks happen post-fraud, so the model peeks at the future.
Golden rule to avoid target leakage: Always ask: "Would this feature exist before the prediction?" If no, remove it.
No comments:
Post a Comment