Data Leakage Between Train and Test Sets

Problem statement: The model shows very high accuracy during evaluation but performs poorly in production because features in the training data unintentionally include information from the target or test set (data leakage), leading to over-optimistic offline metrics.

1. How data leakage happens in practice

Most common causes (see which ones match your code):

Preprocessing done on full data before split

Scaling, encoding, imputing, feature selection done on full dataset, then train/test split.

Target used in feature engineering

Using groupby(...).mean(target) or similar on full data.
Target encoding done on whole dataset.

Time leakage

For time series, shuffling and splitting randomly, so future data leaks into model training.

Post-event features

Features that are only known after the label happens (e.g., “refund_issued”, “payment_overdue_flag”) included as input.

We’ll now set up a safe template that prevents all of this.

2. Step–by–step solution strategy

Step 1 – Always split data before any preprocessing

from sklearn.model_selection import train_test_split

# df is your pandas DataFrame
X = df.drop(columns=['target'])
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y  # stratify for classification
)

At this point:

No scaling, encoding, imputation, feature selection, nothing has been done yet.
You now treat X_train, y_train as the only data the model is allowed to “see”.

Step 2 – Use a `Pipeline` + `ColumnTransformer` for preprocessing

This ensures:

All preprocessing is fitted only on train data.
During .fit(X_train, y_train), sklearn automatically fits transformers only on train.
During .predict(X_test), it uses the stored parameters (means, encoders, etc.), not refitting.

import numpy as np
import pandas as pd

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression  # or any model
from sklearn.metrics import accuracy_score, roc_auc_score

Assume you know which columns are numeric vs categorical:

numeric_features = ['age', 'income', 'balance']          # example
categorical_features = ['gender', 'region', 'channel']   # example

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),      # fit on train only
    ('scaler', StandardScaler())                        # fit on train only
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),  # fit on train only
    ('onehot', OneHotEncoder(handle_unknown='ignore'))     # categories learned from train only
])

preprocess = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features),
    ]
)

clf = Pipeline(steps=[
    ('preprocess', preprocess),
    ('model', LogisticRegression(max_iter=1000))
])

Train and evaluate:

# Fit ONLY on train
clf.fit(X_train, y_train)

# Evaluate on test
y_pred = clf.predict(X_test)
y_proba = clf.predict_proba(X_test)[:, 1]

print("Test Accuracy:", accuracy_score(y_test, y_pred))
print("Test ROC-AUC:", roc_auc_score(y_test, y_proba))

This structure alone removes many leakage patterns.

Step 3 – Do cross-validation safely

Don’t manually preprocess first and then cross-validate.
Instead, cross-validate the whole Pipeline:

from sklearn.model_selection import cross_val_score

cv_scores = cross_val_score(
    clf,           # full pipeline
    X_train, 
    y_train,
    cv=5,
    scoring='roc_auc'
)

print("CV ROC-AUC (mean):", cv_scores.mean())
print("CV ROC-AUC (per fold):", cv_scores)

Now each fold:

Fits preprocessing and model only on the training part of that fold.
Evaluates on the validation part → no leakage across folds.

Step 4 – Fix target-based feature engineering (if you have it)

Bad (leaky) pattern you might have:

# ❌ LEAKY EXAMPLE
df['mean_target_per_region'] = df.groupby('region')['target'].transform('mean')

X = df.drop(columns=['target'])
y = df['target']

# Then split...

Why this is bad:

The mean of target for each region is computed using all data (train + test).
Test information leaks into train via this feature.

Fix:
Compute such target-based features only within train, and if needed for test, compute them using train statistics only.

Example: simple train-only encoding (no CV):

# Split first
X = df.drop(columns=['target'])
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Safe target mean encoding
train_with_target = X_train.copy()
train_with_target['target'] = y_train

# Compute mean target per region on TRAIN only
mean_target_by_region = (
    train_with_target
    .groupby('region')['target']
    .mean()
)

# Map into train/test using TRAIN means only
X_train['mean_target_per_region'] = X_train['region'].map(mean_target_by_region)
X_test['mean_target_per_region'] = X_test['region'].map(mean_target_by_region)

# Handle unseen regions in test (if any)
X_test['mean_target_per_region'] = X_test['mean_target_per_region'].fillna(mean_target_by_region.mean())

Then you can include 'mean_target_per_region' in numeric_features of your pipeline.

For production-grade target encoding, you’d wrap this in a custom transformer or use category_encoders with CV target encoding — but the key rule is:

All statistics that use the target must be computed only on training data, never using test.

Step 5 – Remove features that literally can’t exist at prediction time

Check your columns for “future info”:

Flags like is_refunded, closed_reason, days_until_cancel, resolved_at when your target is “will cancel?” or “will churn?”
Any field filled after the event happens.

Quick sanity check:

leaky_columns = [
    'refund_issued_at',
    'closed_at',
    'actual_outcome',      # etc – adjust with your real column names
]

X_train = X_train.drop(columns=[col for col in leaky_columns if col in X_train.columns])
X_test = X_test.drop(columns=[col for col in leaky_columns if col in X_test.columns])

Rule of thumb:

If in real life you don’t know a value at prediction time, it must not be a feature.

Step 6 – Basic leakage diagnostic (optional but helpful)

If you want to quickly check for “suspiciously strong” correlations in train:

# Works only for numeric features
numeric_cols = X_train.select_dtypes(include=[np.number]).columns

corrs = {}
for col in numeric_cols:
    corr = np.corrcoef(X_train[col], y_train)[0, 1]
    corrs[col] = abs(corr)

sorted_corrs = sorted(corrs.items(), key=lambda x: -x[1])

print("Top suspiciously correlated features:")
for col, c in sorted_corrs[:10]:
    print(f"{col}: {c:.4f}")

If you see any feature with correlation extremely close to 1 or -1, it’s suspicious:

Either it’s directly derived from the target.
Or it’s essentially encoding the label.

Those need investigation.

Step 7 – For time series: use time-based split, not random

If your data has a timestamp and production will always predict on future rows:

df = df.sort_values('event_time')  # your time column

# Example: last 20% rows as test (future)
test_size = int(0.2 * len(df))
train_df = df.iloc[:-test_size]
test_df = df.iloc[-test_size:]

X_train = train_df.drop(columns=['target'])
y_train = train_df['target']

X_test = test_df.drop(columns=['target'])
y_test = test_df['target']

Then use the same Pipeline pattern as before.
This avoids leakage where the model accidentally learns from future data when predicting the past.

3. Minimal end-to-end example (you can copy and adapt)

import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score

# 1. Load your data
# Example structure:
# df = pd.read_csv("your_data.csv")
# Assume 'target' is the label

# Dummy example (replace with your real df)
np.random.seed(0)
df = pd.DataFrame({
    'age': np.random.randint(18, 70, 500),
    'income': np.random.normal(50000, 15000, 500),
    'balance': np.random.normal(10000, 5000, 500),
    'gender': np.random.choice(['M', 'F'], 500),
    'region': np.random.choice(['North', 'South', 'East', 'West'], 500),
    'channel': np.random.choice(['Online', 'Branch', 'Phone'], 500),
    'target': np.random.randint(0, 2, 500)
})

# 2. Split BEFORE any preprocessing
X = df.drop(columns=['target'])
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# 3. Define preprocessors & pipeline
numeric_features = ['age', 'income', 'balance']
categorical_features = ['gender', 'region', 'channel']

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocess = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features),
    ]
)

clf = Pipeline(steps=[
    ('preprocess', preprocess),
    ('model', LogisticRegression(max_iter=1000))
])

# -------------------------
# 4. Cross-validation (safe)
# -------------------------
cv_scores = cross_val_score(
    clf, X_train, y_train, cv=5, scoring='roc_auc'
)
print("CV ROC-AUC mean:", cv_scores.mean())

# -------------------------
# 5. Train final model & evaluate on hold-out test
# -------------------------
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)
y_proba = clf.predict_proba(X_test)[:, 1]

print("Test Accuracy:", accuracy_score(y_test, y_pred))
print("Test ROC-AUC:", roc_auc_score(y_test, y_proba))

Conclusion — Fixing Data Leakage Between Train & Test
Data leakage is sneaky because your offline metrics look great, but the model collapses in production. The root cause is always the same:
The model “sees” information during training that it would never see at real prediction time.

To avoid that, follow these rules:
1. Always split first, preprocess later

Do train_test_split before any scaling, encoding, imputation, or feature selection.Use Pipelines + ColumnTransformer so all preprocessing is fitted only on the training data.

2. Keep target out of features

Never compute target-based stats (like mean target per group) on the full dataset. If you use target encodings, compute them only on the training set, then map them to validation/test.

*3. Remove future-only or post-event columns *

Drop any feature that logically wouldn’t exist at prediction time (refund status, closed_at, outcome flags, etc.).

4. Use correct splitting strategy

For time series: use time-based splits, not random ones. For classification: consider stratified splits and cross-validation, but always on the full Pipeline, not on preprocessed data.

5. Compare CV vs Test performance honestly

If CV metrics are high but test / production metrics are much lower, treat that as a red flag for leakage or drift, not as a “lucky” result.
If you consistently enforce these patterns, your train/CV metrics will be more realistic, your test performance will match production better, and debugging ML models becomes much more predictable.