Abstract: This article presents a systematic analysis of the Kaggle Home Credit Default Risk competition solution, detailing the complete machine learning pipeline from data preprocessing through feature engineering to model ensemble techniques. We examine the architectural decisions, implementation strategies, and performance optimization methods that achieved competitive results in this large-scale credit risk prediction task. The methodology encompasses data quality assessment, sophisticated feature extraction from relational databases, gradient boosting model optimization, and stacking ensemble strategies. Our analysis provides actionable insights for practitioners working on similar structured data prediction problems in financial risk assessment.
Keywords: Credit Risk Modeling, Gradient Boosting, Feature Engineering, Model Stacking, LightGBM, XGBoost, CatBoost, Machine Learning Pipeline
1. Introduction and Business Context#
1.1 Problem Domain and Motivation#
The democratization of financial services has emerged as a critical challenge in global economic development. Traditional credit assessment mechanisms rely heavily on conventional credit histories, stable employment records, and collateral assets—criteria that systematically exclude approximately 1.7 billion adults worldwide who lack access to formal banking services (World Bank Global Findex Database, 2017).
Home Credit Group, an international non-banking financial institution operating across 10+ countries, addresses this market gap by serving the “credit invisible” population—individuals without established credit histories who are typically rejected by conventional banking institutions. The fundamental business challenge involves making accurate default predictions within a constrained timeframe (approximately 5 minutes) using alternative data sources.
1.2 The Credit Risk Prediction Task#
Task Formulation: Binary classification for probability of default (PD)
\[
Y_i = \begin{cases}
1, & \text{if client } i \text{ defaults (90+ days past due)} \\\\
0, & \text{if client } i \text{ repays as scheduled}
\end{cases}
\]
Evaluation Metric: Receiver Operating Characteristic - Area Under Curve (ROC-AUC)
The AUC metric is selected for its robustness to class imbalance and its focus on ranking capability rather than absolute probability calibration:
\[
\text{AUC} = \int_0^1 \text{TPR}(\tau) \, d(\text{FPR}(\tau))
\]
where \(\text{TPR}\) denotes True Positive Rate and \(\text{FPR}\) denotes False Positive Rate at threshold \(\tau\).
Metric Interpretation:
- AUC = 0.50: Random prediction (no discriminative power)
- AUC ∈ [0.60, 0.70]: Poor performance
- AUC ∈ [0.70, 0.80]: Acceptable performance
- AUC ∈ [0.80, 0.90]: Good performance
- AUC > 0.90: Excellent performance (requires overfitting verification)
1.3 Alternative Data Sources in Credit Assessment#
The competition dataset exemplifies the paradigm shift toward alternative credit scoring, utilizing non-traditional data modalities:
| Traditional Data | Alternative Proxy | Data Provider |
|---|
| Credit Score | Mobile phone recharge patterns, call duration | Telecommunication operators |
| Income Verification | POS transaction records, rental payment history | Payment processors, property platforms |
| Bank Statements | Installment payment history, credit card statements | Consumer finance companies |
| Employment Verification | E-commerce activity, social media engagement | Internet platforms |
1.4 Competition Outcomes and Methodological Impact#
The 2018 Home Credit Default Risk competition attracted 7,194 participating teams globally. Top-performing solutions demonstrated significant methodological innovations:
- Feature Engineering: Evolution from basic aggregations to sophisticated temporal window features and trend analysis
- Ensemble Architectures: Systematic application of multi-level stacking strategies
- Data Preprocessing: Advanced techniques for missing value imputation and outlier treatment
These methodologies have demonstrated transferability to related domains including insurance fraud detection, marketing response prediction, and customer churn modeling.
2. Dataset Architecture and Schema Analysis#
2.1 Data Volume and Relational Structure#
The dataset comprises seven interconnected relational tables with an aggregate volume exceeding 50 million records, representing a complex relational database schema typical of enterprise financial systems.

Table Summary Statistics:
| Table | Row Count | Storage Size | Description | Primary Key | Foreign Key |
|---|
| application_{train,test} | 307,511 / 48,744 | 45MB / 7MB | Primary application table | SK_ID_CURR | — |
| bureau | 1,716,428 | 172MB | Credit bureau records | SK_ID_BUREAU | SK_ID_CURR |
| bureau_balance | 27,299,925 | 574MB | Monthly bureau status | — | SK_ID_BUREAU |
| previous_application | 1,670,214 | 150MB | Historical applications | SK_ID_PREV | SK_ID_CURR |
| installments_payments | 13,605,401 | 730MB | Installment payment records | — | SK_ID_PREV |
| POS_CASH_balance | 10,001,358 | 970MB | POS cash loan statements | — | SK_ID_PREV |
| credit_card_balance | 3,840,312 | 400MB | Credit card statements | — | SK_ID_PREV |
2.2 Entity Relationship Model#
The database schema follows a hierarchical relational structure with three primary identifier domains:
1
2
3
| SK_ID_CURR: Client-level identifier (primary entity key)
SK_ID_PREV: Previous application identifier (transaction-level)
SK_ID_BUREAU: External credit bureau record identifier
|
Relationship Topology:
1
2
3
4
5
6
| application [1] ───<N>─── bureau [1] ───<N>─── bureau_balance
│
├─<N>─── previous_application [1] ───<N>─── installments_payments
│ ├─<N>─── POS_CASH_balance
│ └─<N>─── credit_card_balance
└─<N>─── credit_card_balance
|
The one-to-many (1:N) cardinality relationships necessitate aggregation operations during feature engineering to transform temporal and multi-record observations into static feature vectors suitable for machine learning models.
2.3 Detailed Field Specifications#
2.3.1 Application Table (Primary Entity)#
The application table serves as the central entity containing target labels for the training partition.
Demographic and Application Features:
1
2
3
4
5
6
7
8
9
10
| SK_ID_CURR: Integer (primary identifier)
TARGET: Binary (0=non-default, 1=default) - training set only
CODE_GENDER: Categorical (M/F/XNA)
FLAG_OWN_CAR: Binary (Y/N)
FLAG_OWN_REALTY: Binary (Y/N)
CNT_CHILDREN: Integer (count of children)
AMT_INCOME_TOTAL: Float (annual income in local currency)
AMT_CREDIT: Float (loan amount requested)
AMT_ANNUITY: Float (monthly installment amount)
AMT_GOODS_PRICE: Float (price of goods being financed)
|
Temporal Features (encoded as days relative to application date, negative values indicate past):
1
2
3
4
5
| DAYS_BIRTH: Integer (age in days, e.g., -10000 ≈ 27.4 years)
DAYS_EMPLOYED: Integer (employment duration, special value 365243 indicates unemployed)
DAYS_REGISTRATION: Integer (registration change recency)
DAYS_ID_PUBLISH: Integer (identity document issuance recency)
DAYS_LAST_PHONE_CHANGE: Integer (mobile phone change recency)
|
External Score Features (highly predictive normalized scores):
1
2
3
4
| EXT_SOURCE_1: Float [0,1] (normalized external score 1)
EXT_SOURCE_2: Float [0,1] (normalized external score 2)
EXT_SOURCE_3: Float [0,1] (normalized external score 3)
# Sourced from third-party credit bureaus
|
2.3.2 Bureau Table (External Credit History)#
Records of client’s credit relationships with external financial institutions.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
| SK_ID_CURR: Integer (foreign key to application)
SK_ID_BUREAU: Integer (unique bureau record identifier)
CREDIT_ACTIVE: Categorical (Active/Closed/Sold/Demand/Bad debt)
CREDIT_CURRENCY: Categorical (currency code)
DAYS_CREDIT: Integer (days since credit application)
CREDIT_DAY_OVERDUE: Integer (current days past due)
DAYS_CREDIT_ENDDATE: Integer (remaining duration to maturity)
DAYS_ENDDATE_FACT: Integer (actual closure date)
AMT_CREDIT_MAX_OVERDUE: Float (maximum historical overdue amount)
CNT_CREDIT_PROLONG: Integer (count of credit prolongations)
AMT_CREDIT_SUM: Float (total credit exposure)
AMT_CREDIT_SUM_DEBT: Float (outstanding debt)
AMT_CREDIT_SUM_LIMIT: Float (credit limit)
AMT_CREDIT_SUM_OVERDUE: Float (current overdue amount)
CREDIT_TYPE: Categorical (loan type: consumer, mortgage, etc.)
DAYS_CREDIT_UPDATE: Integer (recency of bureau update)
AMT_ANNUITY: Float (monthly payment obligation)
|
2.3.3 Bureau Balance Table (Temporal Bureau Status)#
Monthly status snapshots for each bureau record enabling trend analysis.
1
2
3
4
5
6
7
8
9
10
11
| SK_ID_BUREAU: Integer (foreign key to bureau)
MONTHS_BALANCE: Integer (relative month index, -1=last month, -2=two months ago)
STATUS: Categorical encoding:
'0': Current (no delinquency)
'1': 1-29 days past due
'2': 30-59 days past due
'3': 60-89 days past due
'4': 90-119 days past due
'5': 120-149 days past due
'C': Closed (paid off)
'X': Status unknown
|
2.3.4 Previous Application Table#
Historical applications within the Home Credit system.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
| SK_ID_CURR: Integer (foreign key)
SK_ID_PREV: Integer (unique previous application identifier)
NAME_CONTRACT_TYPE: Categorical (Cash/Consumer/Revolving loans)
AMT_ANNUITY: Float (proposed monthly payment)
AMT_APPLICATION: Float (requested amount)
AMT_CREDIT: Float (approved amount)
AMT_DOWN_PAYMENT: Float (down payment amount)
RATE_DOWN_PAYMENT: Float (down payment ratio)
RATE_INTEREST_PRIMARY: Float (primary interest rate)
RATE_INTEREST_PRIVILEGED: Float (preferential interest rate)
NAME_CONTRACT_STATUS: Categorical (Approved/Refused/Canceled/Unused)
DAYS_DECISION: Integer (days since decision)
CODE_REJECT_REASON: Categorical (rejection reason if applicable)
NAME_CLIENT_TYPE: Categorical (New/Repeat customer)
CNT_PAYMENT: Integer (proposed term in months)
|
2.3.5 Installments Payments Table#
Granular repayment transaction records.
1
2
3
4
5
6
7
8
| SK_ID_CURR: Integer (foreign key)
SK_ID_PREV: Integer (foreign key to previous_application)
NUM_INSTALMENT_VERSION: Integer (version of installment schedule)
NUM_INSTALMENT_NUMBER: Integer (installment sequence number)
DAYS_INSTALMENT: Integer (scheduled payment date)
DAYS_ENTRY_PAYMENT: Integer (actual payment date)
AMT_INSTALMENT: Float (scheduled amount)
AMT_PAYMENT: Float (actual amount paid)
|
Derived Metrics:
- Days Past Due (DPD): \(DPD = DAYS_ENTRY_PAYMENT - DAYS_INSTALMENT\)
- Payment Deviation: \(\Delta AMT = AMT_PAYMENT - AMT_INSTALMENT\)
2.4 Data Quality Profile#
Class Distribution:
1
2
3
| Class 0 (Non-default): 282,686 observations (91.93%)
Class 1 (Default): 24,825 observations (8.07%)
Imbalance Ratio: 11.4:1
|
Missing Data Summary:
- EXT_SOURCE_1: 56.38% missing
- EXT_SOURCE_3: 19.83% missing
- AMT_ANNUITY: 0.003% missing
- OCCUPATION_TYPE: 31.35% missing
Anomalous Encodings:
- DAYS_EMPLOYED = 365,243 (~1,000 years): Sentinel value for unemployment
- CODE_GENDER = ‘XNA’: Unspecified gender category
- AMT_INCOME_TOTAL: Extreme outlier at 117,000,000 (potential data error)
3. System Architecture and Pipeline Design#
3.1 Framework Selection: Steppy Pipeline Architecture#
The solution implements the Steppy framework, a lightweight machine learning pipeline library designed for modular, reproducible data science workflows. Steppy adopts design principles from workflow orchestration systems such as Apache Airflow and Spotify’s Luigi, adapted for the machine learning domain.

Motivation for Pipeline Frameworks:
Traditional imperative machine learning code suffers from several architectural limitations:
1
2
3
4
5
6
7
| # Anti-pattern: Tightly coupled workflow
data = load_data()
data = clean_data(data)
data = extract_features(data)
X_train, X_test, y_train, y_test = split_data(data)
model = train_model(X_train, y_train)
predictions = model.predict(X_test)
|
Identified Deficiencies:
- Coupling: Modifications to one stage require understanding of downstream dependencies
- Reproducibility: Intermediate results cannot be cached or versioned
- Parallelization: Sequential execution prevents computational resource optimization
- Experiment Tracking: Parameter variations are difficult to systematically compare
Steppy’s Declarative Approach:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
| from steppy.base import BaseTransformer
class DataLoader(BaseTransformer):
"""Loads raw data from persistent storage."""
def transform(self, filepath):
data = pd.read_csv(filepath)
return {'data': data}
class DataCleaningTransformer(BaseTransformer):
"""Applies data quality transformations."""
def transform(self, data):
cleaned = self._handle_outliers(data)
cleaned = self._impute_missing(cleaned)
return {'cleaned_data': cleaned}
def _handle_outliers(self, df):
# Implementation
pass
class FeatureExtractionTransformer(BaseTransformer):
"""Engineers features from cleaned data."""
def transform(self, cleaned_data):
features = self._aggregate_features(cleaned_data)
return {'features': features}
|
Design Principles:
- Standardized Interfaces: All components inherit from
BaseTransformer with fit() and transform() methods - Explicit Data Flow: Dictionary-based I/O with named keys for traceability
- Composability: Steps connect via
Step and Adapter abstractions - Persistence: Intermediate artifacts support caching and checkpointing
3.2 End-to-End Pipeline Flow#

Stage 1: Data Ingestion and Cleaning
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
| def build_data_ingestion_pipeline(config):
"""Constructs the data loading and cleaning pipeline stage."""
# Load all seven tables
raw_data = DataLoader(config.data_paths).transform()
# Apply table-specific cleaning transformers
cleaning_transformers = {
'application': ApplicationCleaning(),
'bureau': BureauCleaning(),
'bureau_balance': BureauBalanceCleaning(),
'previous_application': PreviousApplicationCleaning(),
'installments_payments': InstallmentPaymentsCleaning(),
'pos_cash_balance': PosCashBalanceCleaning(),
'credit_card_balance': CreditCardBalanceCleaning()
}
cleaned_data = {
table: transformer.transform(raw_data[table])
for table, transformer in cleaning_transformers.items()
}
return cleaned_data
|
Stage 2: Feature Engineering
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
| def build_feature_engineering_pipeline(cleaned_data):
"""Constructs the feature extraction pipeline stage."""
# Table-specific feature extraction
bureau_features = BureauFeatureExtractor().transform(
cleaned_data['bureau'],
cleaned_data['bureau_balance']
)
prev_app_features = PreviousApplicationFeatureExtractor().transform(
cleaned_data['previous_application']
)
installment_features = InstallmentFeatureExtractor().transform(
cleaned_data['installments_payments']
)
# Feature consolidation
all_features = FeatureConcatenator().transform([
cleaned_data['application'],
bureau_features,
prev_app_features,
installment_features
])
# Categorical encoding
encoded_features = CategoricalEncoder().transform(
all_features,
method='target_encoding'
)
return {
'features': encoded_features,
'target': cleaned_data['application']['TARGET'],
'feature_names': encoded_features.columns.tolist()
}
|
Stage 3: Model Training
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
| def build_model_training_pipeline(feature_data, config):
"""Constructs the model training pipeline stage."""
# Train/validation split
X_train, X_valid, y_train, y_valid = train_test_split(
feature_data['features'],
feature_data['target'],
test_size=0.2,
stratify=feature_data['target'],
random_state=config.random_seed
)
# Model initialization and training
model = GradientBoostingModel(config.model_params)
model.fit(X_train, y_train, validation_data=(X_valid, y_valid))
# Performance evaluation
validation_auc = roc_auc_score(y_valid, model.predict(X_valid))
return {
'model': model,
'validation_auc': validation_auc,
'feature_importance': model.feature_importances_
}
|
Stage 4: Ensemble Construction (Stacking)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
| def build_stacking_ensemble(base_models, meta_learner, X, y, X_test):
"""
Implements two-level stacking ensemble architecture.
Level 1: Base learners generate out-of-fold predictions
Level 2: Meta-learner trains on base model outputs
"""
# Generate OOF predictions
oof_predictions = {}
test_predictions = {}
for name, model in base_models.items():
oof_pred, test_pred = generate_oof_predictions(
model, X, y, X_test, n_folds=5
)
oof_predictions[name] = oof_pred
test_predictions[name] = test_pred
# Train meta-learner
meta_features = np.column_stack([
oof_predictions[name] for name in base_models.keys()
])
meta_learner.fit(meta_features, y)
# Generate final predictions
meta_test_features = np.column_stack([
test_predictions[name] for name in base_models.keys()
])
final_predictions = meta_learner.predict_proba(meta_test_features)[:, 1]
return final_predictions
|
3.3 Project Directory Organization#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
| open-solution-home-credit/
├── src/ # Source code modules
│ ├── __init__.py
│ ├── pipeline_manager.py # Orchestration layer
│ ├── pipelines.py # Pipeline definitions (9 model configurations)
│ ├── pipeline_blocks.py # Step factory methods
│ ├── feature_extraction.py # Feature transformers (20+ implementations)
│ ├── data_cleaning.py # Data quality transformers (7 table-specific)
│ ├── models.py # Model wrappers (LGB/XGB/CTB/NN/RF/LR)
│ ├── pipeline_config.py # Configuration constants and hyperparameters
│ ├── hyperparameter_tuning.py # Optimization strategies
│ ├── callbacks.py # Training monitoring callbacks
│ ├── utils.py # Utility functions
│ └── neptune_hacks.py # Offline experiment tracking support
├── configs/ # Configuration files
│ └── neptune.yaml # Main configuration (paths/hyperparameters)
├── data/ # Data directory
│ ├── raw/ # Original competition data
│ └── workdir/ # Intermediate processing artifacts
├── notebooks/ # Exploratory data analysis
├── blog/ # Documentation
│ └── images/ # Visualization assets
├── main.py # CLI entry point
├── requirements.txt # Dependency specification
└── README.md # Project documentation
|
3.4 Configuration Management#
The project implements a hybrid configuration strategy combining YAML for experiment parameters and Python modules for code-level constants.
Experiment Configuration (configs/neptune.yaml):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
| parameters:
# Data paths
train_filepath: /data/application_train.csv
test_filepath: /data/application_test.csv
# Model selection
pipeline_name: lightGBM
# Feature toggles
use_application: true
use_bureau: true
use_bureau_balance: true
use_previous_application: true
use_installments_payments: true
use_pos_cash_balance: true
use_credit_card_balance: true
# LightGBM hyperparameters
lgbm__objective: binary
lgbm__metric: auc
lgbm__num_leaves: 35
lgbm__learning_rate: 0.02
lgbm__n_estimators: 5000
lgbm__min_child_samples: 70
lgbm__subsample: 1.0
lgbm__colsample_bytree: 0.03
lgbm__reg_lambda: 100.0
lgbm__reg_alpha: 0.0
# Cross-validation configuration
n_cv_splits: 5
validation_size: 0.2
stratified_cv: true
shuffle: true
random_seed: 90210
|
Code-Level Configuration (src/pipeline_config.py):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
| """Constants and aggregation recipes for feature engineering."""
import numpy as np
# Reproducibility constants
RANDOM_SEED = 90210
DEV_SAMPLE_SIZE = 1000
# Column type definitions
CATEGORICAL_COLUMNS = [
'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY',
'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE',
'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'OCCUPATION_TYPE',
'WEEKDAY_APPR_PROCESS_START', 'ORGANIZATION_TYPE', 'FONDKAPREMONT_MODE',
'HOUSETYPE_MODE', 'WALLSMATERIAL_MODE', 'EMERGENCYSTATE_MODE'
]
NUMERICAL_COLUMNS = [
'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE',
'DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH',
'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3'
]
# Aggregation recipes for feature extraction
BUREAU_AGGREGATION_RECIPES = [
(['SK_ID_CURR'], [
('SK_ID_BUREAU', 'count'),
('AMT_CREDIT_SUM', ['sum', 'mean', 'max', 'std']),
('AMT_CREDIT_SUM_DEBT', ['sum', 'mean']),
('AMT_CREDIT_SUM_OVERDUE', ['sum', 'mean', 'max']),
('DAYS_CREDIT', ['min', 'max', 'mean']),
('CREDIT_DAY_OVERDUE', ['sum', 'max', 'mean']),
('CNT_CREDIT_PROLONG', 'sum')
])
]
PREVIOUS_APPLICATION_AGGREGATION_RECIPES = [
(['SK_ID_CURR'], [
('SK_ID_PREV', 'count'),
('AMT_APPLICATION', ['sum', 'mean', 'max']),
('AMT_CREDIT', ['sum', 'mean', 'max']),
('AMT_DOWN_PAYMENT', ['sum', 'mean']),
('RATE_INTEREST_PRIMARY', ['mean', 'max']),
('DAYS_DECISION', ['min', 'max', 'mean'])
])
]
|
This bifurcated configuration approach provides:
- Accessibility: YAML enables rapid experimentation without code modification
- Type Safety: Python modules enforce compile-time validation
- Override Capability: Command-line and environment variable overrides supported
4. Exploratory Data Analysis and Quality Assessment#
4.1 EDA Methodology Framework#
Exploratory Data Analysis (EDA) in this context follows a systematic investigative framework designed to answer five fundamental questions:
- Data Quality: What anomalies, missing values, or encoding inconsistencies exist?
- Distributional Properties: What are the central tendencies, dispersion, and shapes of feature distributions?
- Business Insights: Do population segments exhibit differential behaviors?
- Predictive Signals: Which features demonstrate statistical association with the target variable?
- Feature Engineering Direction: What transformations or aggregations might improve predictive power?
4.2 Critical Findings#
4.2.1 Class Imbalance Analysis#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
| import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Load training data
train_df = pd.read_csv('data/application_train.csv')
# Class distribution analysis
target_distribution = train_df['TARGET'].value_counts()
print("Class Distribution:")
print(target_distribution)
print(f"\nClass Proportions:")
print(train_df['TARGET'].value_counts(normalize=True))
# Output:
# Class Distribution:
# 0 282686
# 1 24825
# Name: TARGET, dtype: int64
#
# Class Proportions:
# 0 0.919271
# 1 0.080729
# Name: TARGET, dtype: float64
|
Interpretation: The 11.4:1 imbalance ratio necessitates careful metric selection. Accuracy would be misleading (91.9% accuracy achievable by predicting majority class), motivating the use of AUC and precision-recall metrics.
4.2.2 External Score Predictive Power#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
| # Analyze EXT_SOURCE features
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
for idx, col in enumerate(['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']):
# Distribution comparison
sns.kdeplot(
data=train_df[train_df['TARGET'] == 0][col].dropna(),
label='Non-default',
ax=axes[idx],
fill=True,
alpha=0.5
)
sns.kdeplot(
data=train_df[train_df['TARGET'] == 1][col].dropna(),
label='Default',
ax=axes[idx],
fill=True,
alpha=0.5
)
axes[idx].set_title(f'{col} Distribution by Target')
axes[idx].legend()
plt.tight_layout()
plt.savefig('images/ext_source_kde.png', dpi=150)
|
Key Observations:
- Defaulting clients demonstrate consistently lower external scores
- EXT_SOURCE_1 exhibits the strongest discriminative power (highest Information Value)
- Missing data rates vary: EXT_SOURCE_1 (56.4%), EXT_SOURCE_2 (0.2%), EXT_SOURCE_3 (19.8%)
4.2.3 Income Distribution Analysis#
1
2
3
4
5
6
7
8
9
10
| # Income distribution with log transformation
train_df['AMT_INCOME_TOTAL_LOG'] = np.log1p(train_df['AMT_INCOME_TOTAL'])
# Descriptive statistics
print(train_df['AMT_INCOME_TOTAL'].describe())
# Detect extreme outliers
q99 = train_df['AMT_INCOME_TOTAL'].quantile(0.99)
extreme_outliers = train_df[train_df['AMT_INCOME_TOTAL'] > q99 * 10]
print(f"\nExtreme outliers (>10x 99th percentile): {len(extreme_outliers)}")
|
Statistical Summary:
1
2
3
4
5
6
7
8
| count 3.075110e+05
mean 1.687979e+05
std 2.371894e+05
min 2.565000e+04
25% 1.125000e+05
50% 1.471500e+05
75% 2.025000e+05
max 1.170000e+08 # Data quality concern
|
Implications: The right-skewed distribution (coefficient of skewness ≈ 3.2) motivates log-transformation. The extreme maximum (117M vs. mean 168K) suggests potential data entry errors requiring treatment.
4.2.4 Temporal Feature Analysis: Age and Default Risk#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
| # Age calculation and risk profiling
train_df['AGE_YEARS'] = -train_df['DAYS_BIRTH'] / 365.25
# Binned analysis
train_df['AGE_BIN'] = pd.cut(
train_df['AGE_YEARS'],
bins=[0, 25, 30, 35, 40, 45, 50, 60, 100],
labels=['<25', '25-30', '30-35', '35-40', '40-45', '45-50', '50-60', '60+']
)
default_by_age = train_df.groupby('AGE_BIN')['TARGET'].agg(['mean', 'count'])
print(default_by_age)
# Visualization
plt.figure(figsize=(10, 6))
default_by_age['mean'].plot(kind='bar', color='steelblue')
plt.title('Default Rate by Age Cohort')
plt.xlabel('Age Group')
plt.ylabel('Default Rate')
plt.axhline(y=train_df['TARGET'].mean(), color='r', linestyle='--', label='Overall Average')
plt.legend()
plt.tight_layout()
plt.savefig('images/default_rate_by_age.png', dpi=150)
|
Findings: Default rates exhibit an inverse relationship with age, with clients under 25 showing approximately 2.5x higher default rates than those aged 40-50. This aligns with established credit risk theory regarding income stability and financial experience.
4.2.5 Employment Status Encoding Anomaly#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
| # Investigate DAYS_EMPLOYED anomaly
anomaly_count = (train_df['DAYS_EMPLOYED'] == 365243).sum()
anomaly_rate = anomaly_count / len(train_df)
print(f"Anomalous DAYS_EMPLOYED (365243): {anomaly_count} ({anomaly_rate:.2%})")
# Compare default rates
train_df['EMPLOYMENT_STATUS'] = np.where(
train_df['DAYS_EMPLOYED'] == 365243,
'Unemployed/Unknown',
'Employed'
)
employment_risk = train_df.groupby('EMPLOYMENT_STATUS')['TARGET'].mean()
print("\nDefault Rate by Employment Status:")
print(employment_risk)
|
Results:
1
2
3
4
5
| Anomalous DAYS_EMPLOYED (365243): 55,374 (18.01%)
Default Rate by Employment Status:
Employed 0.0753
Unemployed/Unknown 0.1047
|
Interpretation: The value 365,243 functions as a sentinel encoding (equivalent to ~1,000 years), indicating unemployment or data unavailability. The elevated default rate (10.5% vs. 7.5%) among this cohort confirms the encoding’s business relevance.
4.3 Data Preprocessing Strategy#
Based on EDA findings, we implement a systematic preprocessing pipeline:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
| class ApplicationDataCleaner(BaseTransformer):
"""
Implements data quality transformations for the primary application table.
"""
def transform(self, df: pd.DataFrame) -> Dict[str, pd.DataFrame]:
df_cleaned = df.copy()
# 1. Sentinel value treatment
df_cleaned['DAYS_EMPLOYED'].replace(365243, np.nan, inplace=True)
df_cleaned['CODE_GENDER'].replace('XNA', np.nan, inplace=True)
# 2. Infinity value handling
df_cleaned = df_cleaned.replace([np.inf, -np.inf], np.nan)
# 3. Categorical missing value imputation
categorical_columns = df_cleaned.select_dtypes(
include=['object']
).columns
df_cleaned[categorical_columns] = df_cleaned[
categorical_columns
].fillna('Unknown')
# 4. Numerical features: preserve missing values
# Gradient boosting models handle missing values natively
return {'application_cleaned': df_cleaned}
|
5. Feature Engineering Methodology#
5.1 The Aggregation Problem#
The fundamental feature engineering challenge in this dataset stems from the relational structure: clients possess multiple historical records across subsidiary tables (bureau, previous applications, payment histories), while predictive modeling requires a fixed-dimensional feature vector for each client.
Illustrative Example:
1
2
3
4
5
6
7
8
9
10
11
12
13
| Client A - Bureau Records:
├─ Record 1: SK_ID_BUREAU=101, AMT_CREDIT_SUM=5000, DAYS_CREDIT=-365
├─ Record 2: SK_ID_BUREAU=102, AMT_CREDIT_SUM=3000, DAYS_CREDIT=-180
├─ Record 3: SK_ID_BUREAU=103, AMT_CREDIT_SUM=8000, DAYS_CREDIT=-90
└─ Record 4: SK_ID_BUREAU=104, AMT_CREDIT_SUM=2000, DAYS_CREDIT=-30
Required Transformation (Single Row):
- bureau_count: 4
- bureau_amt_sum: 18000
- bureau_amt_mean: 4500
- bureau_amt_max: 8000
- bureau_days_min: -365
- bureau_days_max: -30
|
5.2 Aggregation Methodology#

Aggregation Operators:
| Operator | Mathematical Definition | Use Case | Business Interpretation |
|---|
| COUNT | \(n = |{r_1, r_2, …, r_n}|\) | Record frequency | Number of loans, applications |
| SUM | \(\Sigma = \sum_{i=1}^{n} x_i\) | Total exposure | Cumulative debt, total payments |
| MEAN | \(\mu = \frac{1}{n}\sum_{i=1}^{n} x_i\) | Central tendency | Average loan amount |
| MEDIAN | \(\tilde{x} = Q_2(x)\) | Robust central tendency | Median income (outlier-resistant) |
| MAX/MIN | \(\max(x), \min(x)\) | Extremes | Largest loan, earliest record |
| STD | \(\sigma = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(x_i - \mu)^2}\) | Variability | Income stability, payment consistency |
| NUNIQUE | \(|{x_1, x_2, …}|\) | Cardinality | Number of distinct lenders |
5.3.1 Bureau Feature Engineering#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
| class BureauFeatureExtractor(BaseTransformer):
"""
Extracts aggregated features from credit bureau records.
"""
def transform(self, bureau: pd.DataFrame) -> Dict[str, pd.DataFrame]:
# Primary aggregations
bureau_agg = bureau.groupby('SK_ID_CURR').agg({
# Exposure metrics
'SK_ID_BUREAU': 'count',
'AMT_CREDIT_SUM': ['sum', 'mean', 'max', 'min', 'std'],
'AMT_CREDIT_SUM_DEBT': ['sum', 'mean', 'max'],
'AMT_CREDIT_SUM_OVERDUE': ['sum', 'mean', 'max'],
# Delinquency metrics
'CNT_CREDIT_PROLONG': ['sum', 'mean'],
'CREDIT_DAY_OVERDUE': ['sum', 'max', 'mean'],
# Temporal metrics
'DAYS_CREDIT': ['min', 'max', 'mean'],
'DAYS_CREDIT_ENDDATE': ['min', 'max'],
'DAYS_CREDIT_UPDATE': ['min', 'max'],
})
# Flatten multi-level columns
bureau_agg.columns = [
'_'.join(col).strip()
for col in bureau_agg.columns.values
]
# Active credit subset analysis
active_mask = bureau['CREDIT_ACTIVE'] == 'Active'
active_loans = bureau[active_mask].groupby('SK_ID_CURR').agg({
'AMT_CREDIT_SUM': ['sum', 'count'],
'AMT_CREDIT_SUM_DEBT': 'sum',
})
active_loans.columns = [
'bureau_active_' + '_'.join(col)
for col in active_loans.columns
]
# Combine feature sets
features = bureau_agg.join(active_loans, how='left')
return {'bureau_features': features}
|
Generated Feature Examples:
1
2
3
4
5
6
7
8
9
10
| {
'SK_ID_BUREAU_count': 5, # Total credit relationships
'AMT_CREDIT_SUM_sum': 45000, # Total credit exposure
'AMT_CREDIT_SUM_mean': 9000, # Average loan size
'AMT_CREDIT_SUM_max': 20000, # Maximum single exposure
'DAYS_CREDIT_min': -730, # Oldest relationship
'DAYS_CREDIT_max': -30, # Most recent relationship
'bureau_active_AMT_CREDIT_SUM_sum': 15000, # Active exposure
'bureau_active_SK_ID_BUREAU_count': 2, # Number of active accounts
}
|
5.3.2 Previous Application Feature Engineering#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
| class PreviousApplicationFeatureExtractor(BaseTransformer):
"""
Extracts features from historical Home Credit applications.
"""
def transform(self, prev_app: pd.DataFrame) -> Dict[str, pd.DataFrame]:
# Core aggregations
prev_agg = prev_app.groupby('SK_ID_CURR').agg({
# Application frequency
'SK_ID_PREV': 'count',
# Approval metrics
'NAME_CONTRACT_STATUS': [
lambda x: (x == 'Approved').sum(),
lambda x: (x == 'Refused').sum(),
lambda x: (x == 'Canceled').sum()
],
# Financial metrics
'AMT_APPLICATION': ['sum', 'mean', 'max', 'min'],
'AMT_CREDIT': ['sum', 'mean', 'max'],
'AMT_DOWN_PAYMENT': ['sum', 'mean'],
'AMT_ANNUITY': ['mean', 'max'],
# Pricing metrics
'RATE_INTEREST_PRIMARY': ['mean', 'max'],
'RATE_DOWN_PAYMENT': ['mean', 'max'],
# Temporal metrics
'DAYS_DECISION': ['min', 'max', 'mean'],
})
# Derived metrics
total_apps = prev_agg[('SK_ID_PREV', 'count')]
approved_apps = prev_agg[('NAME_CONTRACT_STATUS', '<lambda_0>')]
prev_agg['approval_rate'] = approved_apps / total_apps
prev_agg['credit_to_application_ratio'] = (
prev_agg[('AMT_CREDIT', 'sum')] /
prev_agg[('AMT_APPLICATION', 'sum')]
)
# Flatten column structure
prev_agg.columns = [
'_'.join(col).strip() if isinstance(col, tuple) else col
for col in prev_agg.columns
]
return {'previous_application_features': prev_agg}
|
Key Derived Features:
approval_rate: Historical approval probabilitycredit_to_application_ratio: Approved amount relative to requestedavg_down_payment_rate: Typical down payment behavior
5.3.3 Installment Payment Feature Engineering#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
| class InstallmentFeatureExtractor(BaseTransformer):
"""
Extracts repayment behavior features from installment records.
"""
def transform(self, installments: pd.DataFrame) -> Dict[str, pd.DataFrame]:
# Calculate derived metrics
installments['DPD'] = (
installments['DAYS_ENTRY_PAYMENT'] -
installments['DAYS_INSTALMENT']
)
installments['AMT_DIFF'] = (
installments['AMT_PAYMENT'] -
installments['AMT_INSTALMENT']
)
# Aggregations
install_agg = installments.groupby('SK_ID_CURR').agg({
# Volume metrics
'NUM_INSTALMENT_VERSION': 'count',
# Delinquency metrics
'DPD': ['mean', 'max', 'sum', lambda x: (x > 0).sum()],
# Payment amount metrics
'AMT_INSTALMENT': ['sum', 'mean', 'max'],
'AMT_PAYMENT': ['sum', 'mean', 'max'],
'AMT_DIFF': [
'mean', 'sum', 'max', 'min',
lambda x: (x > 0).sum()
],
})
# Flatten columns
install_agg.columns = [
'_'.join(col).strip()
for col in install_agg.columns.values
]
return {'installment_features': install_agg}
|
Critical Derived Metrics:
DPD_mean: Average days past dueDPD_max: Worst delinquency instanceAMT_DIFF_mean: Average payment deviation (overpayment/underpayment)
5.4 Temporal Window Features#

Hypothesis: Recent behavioral patterns carry stronger predictive signal than historical averages.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
| class TemporalWindowFeatureExtractor(BaseTransformer):
"""
Extracts time-windowed aggregations for trend analysis.
"""
def transform(
self,
data: pd.DataFrame,
time_col: str = 'MONTHS_BALANCE'
) -> Dict[str, pd.DataFrame]:
window_sizes = [3, 6, 12, 24] # months
all_features = {}
for window in window_sizes:
# Subset to recent history
recent_mask = data[time_col] >= -window
recent_data = data[recent_mask]
# Window-specific aggregations
window_agg = recent_data.groupby('SK_ID_CURR').agg({
'AMT_BALANCE': ['mean', 'max', 'sum'],
'SK_ID_PREV': 'count',
})
# Rename with window suffix
window_agg.columns = [
f'{col}_last_{window}m'
for col in window_agg.columns
]
all_features[f'window_{window}m'] = window_agg
return all_features
|
5.5 Categorical Variable Encoding#
Encoding Strategy Selection:
| Method | Appropriate For | Advantages | Disadvantages |
|---|
| Label Encoding | Ordinal categories (education level) | Simple, low dimensionality | Introduces false ordinality for nominal data |
| One-Hot Encoding | Low-cardinality nominal data (gender) | No ordinality assumption | Dimensionality explosion |
| Target Encoding | High-cardinality data (occupation, region) | Captures target relationship | Risk of overfitting |
| Frequency Encoding | High-cardinality identifiers | Simple, captures prevalence | Information loss |
Implementation:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
| from category_encoders import TargetEncoder, OneHotEncoder
class CategoricalEncodingPipeline(BaseTransformer):
"""
Applies appropriate encoding strategies by variable type.
"""
def __init__(self):
self.encoders = {}
def fit(self, X: pd.DataFrame, y: pd.Series):
# Target encoding for high-cardinality features
high_cardinality = [
'OCCUPATION_TYPE', 'ORGANIZATION_TYPE',
'NAME_FAMILY_STATUS'
]
for col in high_cardinality:
encoder = TargetEncoder(cols=[col], smoothing=10.0)
encoder.fit(X[[col]], y)
self.encoders[col] = encoder
return self
def transform(self, X: pd.DataFrame) -> Dict[str, pd.DataFrame]:
X_encoded = X.copy()
for col, encoder in self.encoders.items():
X_encoded[col] = encoder.transform(X[[col]])
return {'features_encoded': X_encoded}
|
5.6 Feature Selection#
Objective: Reduce dimensionality, eliminate noise, improve training efficiency
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
| from sklearn.feature_selection import mutual_info_classif, SelectKBest
class FeatureSelectionTransformer(BaseTransformer):
"""
Selects top-K features based on mutual information.
"""
def __init__(self, k: int = 500):
self.k = k
self.selector = None
def fit(self, X: pd.DataFrame, y: pd.Series):
self.selector = SelectKBest(
score_func=mutual_info_classif,
k=self.k
)
self.selector.fit(X, y)
self.selected_features = X.columns[
self.selector.get_support()
].tolist()
return self
def transform(self, X: pd.DataFrame) -> Dict[str, pd.DataFrame]:
X_selected = X[self.selected_features]
return {
'features': X_selected,
'feature_names': self.selected_features
}
|
6. Model Selection, Training, and Evaluation#
6.1 Gradient Boosting Decision Trees#
Theoretical Foundation:
Gradient Boosting constructs an additive ensemble of weak learners (typically decision trees) through functional gradient descent:
\[F_m(x) = F_{m-1}(x) + \nu \cdot h_m(x)\]
where $h_m(x)$ is the weak learner fitted to the pseudo-residuals:
\[r_{im} = -\left[\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)}\right]_{F=F_{m-1}}\]
Advantages for Tabular Data:
- Automatic Feature Interactions: Tree splits inherently model feature combinations
- Missing Value Handling: Native support without imputation requirements
- Non-linear Capacity: Captures complex decision boundaries
- Interpretability: Feature importance and partial dependence analysis
6.2 Comparative Analysis: LightGBM, XGBoost, CatBoost#

Algorithmic Characteristics:
| Characteristic | LightGBM | XGBoost | CatBoost |
|---|
| Tree Growth | Leaf-wise | Level-wise | Level-wise |
| Split Finding | Histogram-based | Histogram + Exact | Oblivious trees |
| Key Optimizations | GOSS, EFB | Cache-aware access | Ordered boosting |
| Categorical Support | Limited | Manual encoding | Native support |
| Training Speed | Fastest | Moderate | Moderate |
| Memory Efficiency | Best | Moderate | Good |
Gradient-based One-Side Sampling (GOSS) [LightGBM]:
Retains instances with large gradients (high error) while randomly sampling instances with small gradients, maintaining data distribution while accelerating training.
Exclusive Feature Bundling (EFB) [LightGBM]:
Bundles mutually exclusive features (rarely non-zero simultaneously) to reduce dimensionality without information loss.
Ordered Boosting [CatBoost]:
Eliminates prediction shift by using ordered permutation of training data, providing unbiased gradient estimates.
Selection Guidelines:
- Rapid Experimentation: LightGBM (10x training speed advantage)
- Maximum Accuracy: XGBoost (marginal but consistent gains)
- Rich Categorical Data: CatBoost (native categorical handling)
6.3 LightGBM Implementation#

Hyperparameter Specification:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
| import lightgbm as lgb
# Model configuration
LGBM_PARAMS = {
'objective': 'binary',
'metric': 'auc',
'boosting_type': 'gbdt',
# Tree structure
'num_leaves': 35,
'max_depth': -1,
'min_child_samples': 70,
# Learning dynamics
'learning_rate': 0.02,
'n_estimators': 5000,
# Regularization
'reg_lambda': 100.0,
'reg_alpha': 0.0,
# Sampling
'subsample': 1.0,
'colsample_bytree': 0.03,
# Categorical handling
'categorical_feature': 'auto',
'verbose': -1,
'random_state': 42
}
|
Training Procedure:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
| from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
# Data preparation
X_train, X_valid, y_train, y_valid = train_test_split(
X, y,
test_size=0.2,
stratify=y,
random_state=42
)
# Dataset construction
train_dataset = lgb.Dataset(X_train, label=y_train)
valid_dataset = lgb.Dataset(X_valid, label=y_valid, reference=train_dataset)
# Model training
model = lgb.train(
LGBM_PARAMS,
train_dataset,
num_boost_round=5000,
valid_sets=[train_dataset, valid_dataset],
valid_names=['train', 'valid'],
callbacks=[
lgb.early_stopping(stopping_rounds=100),
lgb.log_evaluation(period=100)
]
)
# Performance evaluation
y_pred = model.predict(X_valid, num_iteration=model.best_iteration)
validation_auc = roc_auc_score(y_valid, y_pred)
print(f'Validation AUC: {validation_auc:.4f}')
# Feature importance analysis
importance_df = pd.DataFrame({
'feature': model.feature_name(),
'importance_gain': model.feature_importance(importance_type='gain'),
'importance_split': model.feature_importance(importance_type='split')
}).sort_values('importance_gain', ascending=False)
print("\nTop 20 Features by Gain:")
print(importance_df.head(20))
|
Hyperparameter Tuning Guidelines:
num_leaves: \(2^{\text{max\_depth}}\) provides baseline; reduce to control overfittinglearning_rate: 0.01-0.1 range; lower values require more iterationsreg_lambda: Increase (1→100) for noisy datasetscolsample_bytree: Reduce (1.0→0.3) for high-dimensional features
6.4 XGBoost Implementation#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
| import xgboost as xgb
XGB_PARAMS = {
'objective': 'binary:logistic',
'eval_metric': 'auc',
'max_depth': 6,
'learning_rate': 0.05,
'subsample': 0.8,
'colsample_bytree': 0.8,
'reg_alpha': 0.1,
'reg_lambda': 1.0,
'tree_method': 'hist',
'seed': 42
}
# DMatrix construction
dtrain = xgb.DMatrix(X_train, label=y_train, enable_categorical=True)
dvalid = xgb.DMatrix(X_valid, label=y_valid, enable_categorical=True)
# Training
eval_results = {}
model = xgb.train(
XGB_PARAMS,
dtrain,
num_boost_round=1000,
evals=[(dtrain, 'train'), (dvalid, 'valid')],
evals_result=eval_results,
early_stopping_rounds=100,
verbose_eval=100
)
# Evaluation
y_pred = model.predict(dvalid)
auc = roc_auc_score(y_valid, y_pred)
|
6.5 CatBoost Implementation#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
| from catboost import CatBoostClassifier, Pool
# Identify categorical features
categorical_features = [
i for i, col in enumerate(X_train.columns)
if X_train[col].dtype == 'object'
]
# Data pools
train_pool = Pool(X_train, y_train, cat_features=categorical_features)
valid_pool = Pool(X_valid, y_valid, cat_features=categorical_features)
# Model configuration
model = CatBoostClassifier(
iterations=1000,
learning_rate=0.05,
depth=6,
l2_leaf_reg=3.0,
early_stopping_rounds=100,
verbose=100,
random_seed=42
)
# Training
model.fit(train_pool, eval_set=valid_pool)
# Evaluation
y_pred = model.predict_proba(valid_pool)[:, 1]
auc = roc_auc_score(y_valid, y_pred)
|
6.6 Cross-Validation and Out-of-Fold Prediction#
Rationale for Cross-Validation:
- Stability Assessment: Reduces variance from single train/test split
- Overfitting Prevention: Validates generalization capability
- OOF Generation: Produces unbiased predictions for ensemble construction
Stratified K-Fold Implementation:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
| from sklearn.model_selection import StratifiedKFold
N_FOLDS = 5
kf = StratifiedKFold(n_splits=N_FOLDS, shuffle=True, random_state=42)
oof_predictions = np.zeros(len(X_train))
test_predictions = np.zeros(len(X_test))
fold_scores = []
for fold, (train_idx, valid_idx) in enumerate(kf.split(X_train, y_train)):
print(f'\nFold {fold + 1}/{N_FOLDS}')
# Data partitioning
X_tr, X_val = X_train.iloc[train_idx], X_train.iloc[valid_idx]
y_tr, y_val = y_train.iloc[train_idx], y_train.iloc[valid_idx]
# Model training
model = lgb.LGBMClassifier(**LGBM_PARAMS)
model.fit(
X_tr, y_tr,
eval_set=[(X_val, y_val)],
callbacks=[lgb.early_stopping(100), lgb.log_evaluation(0)]
)
# Out-of-fold predictions
oof_predictions[valid_idx] = model.predict_proba(X_val)[:, 1]
# Test set predictions (ensemble across folds)
test_predictions += model.predict_proba(X_test)[:, 1] / N_FOLDS
# Fold-level evaluation
fold_auc = roc_auc_score(y_val, oof_predictions[valid_idx])
fold_scores.append(fold_auc)
print(f'Fold AUC: {fold_auc:.4f}')
# Aggregate performance
overall_auc = roc_auc_score(y_train, oof_predictions)
print(f'\nOverall OOF AUC: {overall_auc:.4f} (+/- {np.std(fold_scores):.4f})')
|
7. Ensemble Learning and Model Fusion#
7.1 Ensemble Learning Theory#
Limitations of Single Models:
Individual models exhibit specific failure modes:
- LightGBM: Prone to overfitting on sparse features
- XGBoost: Computationally intensive training
- CatBoost: Slight accuracy trade-off for robustness
Ensemble Advantages:
- Variance Reduction: Averaging reduces prediction variance
- Bias Reduction: Diverse models capture complementary patterns
- Stability: Robust to individual model failures

7.2 Two-Level Stacking Architecture#
Architecture Specification:
Level 1 (Base Learners): Diverse gradient boosting implementations
- LightGBM (leaf-wise optimization)
- XGBoost (level-wise with exact greedy)
- CatBoost (ordered boosting)
Level 2 (Meta-Learner): Simple linear model
- Logistic Regression or Ridge Regression
- Rationale: Base learners extract sufficient signal; complex meta-learners risk overfitting
7.3 Out-of-Fold Prediction Generation#
Critical Constraint: Meta-learner training requires predictions where the base model was not trained on the target instance (preventing data leakage).

Data Leakage Warning:
1
2
3
| # INCORRECT: Training set predictions (data leakage)
model.fit(X_train, y_train)
train_pred = model.predict(X_train) # Model has seen these instances!
|
Correct OOF Generation:
1
2
3
4
5
6
7
8
9
10
11
12
| from sklearn.model_selection import StratifiedKFold
kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
oof_preds = np.zeros(len(X_train))
for train_idx, valid_idx in kf.split(X_train, y_train):
X_tr, X_val = X_train[train_idx], X_train[valid_idx]
y_tr = y_train[train_idx]
model.fit(X_tr, y_tr)
# Predict on held-out validation set only
oof_preds[valid_idx] = model.predict_proba(X_val)[:, 1]
|
7.4 Stacking Implementation#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
| from sklearn.linear_model import LogisticRegression
from typing import Dict, Tuple
class StackingEnsemble:
"""
Two-level stacking ensemble with OOF prediction generation.
"""
def __init__(
self,
base_models: Dict[str, object],
meta_learner: object
):
self.base_models = base_models
self.meta_learner = meta_learner
self.base_predictions = {}
def fit(
self,
X: pd.DataFrame,
y: pd.Series,
cv: int = 5
) -> 'StackingEnsemble':
"""
Generate OOF predictions and train meta-learner.
"""
kf = StratifiedKFold(
n_splits=cv,
shuffle=True,
random_state=42
)
# Matrix to store OOF predictions
n_models = len(self.base_models)
oof_features = np.zeros((len(X), n_models))
# Generate OOF predictions for each base model
for idx, (name, model) in enumerate(self.base_models.items()):
print(f'Generating OOF predictions: {name}...')
for train_idx, valid_idx in kf.split(X, y):
X_tr = X.iloc[train_idx]
X_val = X.iloc[valid_idx]
y_tr = y.iloc[train_idx]
# Fit on training fold
model.fit(X_tr, y_tr)
# Predict on validation fold
oof_features[valid_idx, idx] = (
model.predict_proba(X_val)[:, 1]
)
self.base_predictions[name] = oof_features[:, idx].copy()
# Train meta-learner on OOF features
print('Training meta-learner...')
self.meta_learner.fit(oof_features, y)
# Retrain base models on full dataset
print('Retraining base models on full data...')
for name, model in self.base_models.items():
model.fit(X, y)
return self
def predict(self, X: pd.DataFrame) -> np.ndarray:
"""
Generate ensemble predictions.
"""
# Generate base model predictions
n_models = len(self.base_models)
base_features = np.zeros((len(X), n_models))
for idx, (name, model) in enumerate(self.base_models.items()):
base_features[:, idx] = model.predict_proba(X)[:, 1]
# Meta-learner prediction
return self.meta_learner.predict_proba(base_features)[:, 1]
# Usage
base_models = {
'lightgbm': lgb.LGBMClassifier(**lgb_params),
'xgboost': xgb.XGBClassifier(**xgb_params),
'catboost': CatBoostClassifier(**ctb_params, verbose=0)
}
meta_model = LogisticRegression(
C=1.0,
solver='lbfgs',
max_iter=1000
)
ensemble = StackingEnsemble(base_models, meta_model)
ensemble.fit(X_train, y_train)
final_predictions = ensemble.predict(X_test)
|
7.5 Hyperparameter Optimization#

Methodological Comparison:
| Method | Strategy | Strengths | Limitations | Computational Cost |
|---|
| Grid Search | Exhaustive enumeration | Comprehensive coverage | Exponential scaling | High |
| Random Search | Random sampling | Efficient exploration | Potential omission | Moderate |
| Bayesian Optimization | Probabilistic surrogate model | Sample-efficient | Implementation complexity | Low-Moderate |
Bayesian Optimization Implementation:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
| from skopt import BayesSearchCV
from skopt.space import Real, Integer
# Define search space
search_spaces = {
'num_leaves': Integer(20, 50),
'learning_rate': Real(0.01, 0.1, prior='log-uniform'),
'min_child_samples': Integer(10, 100),
'reg_lambda': Real(1e-8, 10.0, prior='log-uniform'),
'subsample': Real(0.5, 1.0),
'colsample_bytree': Real(0.3, 1.0)
}
# Bayesian optimization
opt = BayesSearchCV(
lgb.LGBMClassifier(
objective='binary',
metric='auc',
boosting_type='gbdt',
n_estimators=1000,
verbose=-1
),
search_spaces,
n_iter=50,
scoring='roc_auc',
cv=3,
n_jobs=-1,
random_state=42,
verbose=1
)
opt.fit(X_train, y_train)
print(f'Best CV Score: {opt.best_score_:.4f}')
print(f'Optimal Parameters: {opt.best_params_}')
|
| Model Configuration | Cross-Validation AUC | Public Leaderboard | Private Leaderboard | Relative Improvement |
|---|
| LightGBM (single) | 0.7902 | 0.791 | 0.792 | Baseline |
| XGBoost (single) | 0.7854 | 0.787 | 0.788 | -0.004 |
| CatBoost (single) | 0.7881 | 0.789 | 0.790 | -0.002 |
| Simple Average | 0.7920 | 0.793 | 0.794 | +0.002 |
| Stacking (LGB+XGB+CTB+LR) | 0.8053 | 0.807 | 0.808 | +0.016 |
Key Insights:
- Stacking ensemble achieves 1.6 percentage point improvement over best single model
- On Kaggle leaderboards, 0.01 AUC improvement typically corresponds to hundreds of ranking positions
- Simple logistic regression meta-learner outperforms complex alternatives by reducing overfitting
8. Conclusions and Best Practices#
8.1 Complete Technical Pipeline Summary#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
| Raw Data (7 tables, 50M+ records)
↓
Data Preprocessing
├─ Anomaly detection and treatment
├─ Missing value imputation
└─ Categorical encoding
↓
Feature Engineering (1,000+ features)
├─ Aggregation operations (count/sum/mean/max/std)
├─ Temporal window features (3m/6m/12m/24m)
├─ Ratio and interaction features
└─ Target encoding for high-cardinality variables
↓
Model Development
├─ LightGBM (primary model)
├─ XGBoost (accuracy complement)
└─ CatBoost (robustness validation)
↓
Ensemble Construction
├─ Out-of-fold prediction generation
├─ Meta-learner training
└─ Final prediction aggregation
↓
Submission (AUC 0.808, Top 5%)
|
8.2 Key Technical Insights#
Data Architecture:
- Relational database schemas require systematic aggregation strategies
- One-to-many relationships necessitate careful feature extraction to prevent information loss
- Temporal sequences provide richer signal than static snapshots
Feature Engineering:
- Domain knowledge fundamentally constrains feature construction possibilities
- Aggregation function selection (mean vs. median vs. max) encodes distinct business assumptions
- Time-windowed features capture behavioral trends superior to historical averages
Modeling Strategy:
- Gradient boosting remains the state-of-the-art for structured data prediction
- Cross-validation serves dual purposes: stability assessment and ensemble preparation
- Stacking ensembles provide consistent, significant performance improvements
8.3 Reproducible Engineering Practices#
- Pipeline Architecture: Modular design enables component testing and replacement
- Configuration Management: Centralized parameter specification facilitates experiment tracking
- Development Mode: Subsampling strategies (
--dev_mode) accelerate iteration cycles - Experiment Tracking: Systematic logging prevents redundant computations
8.4 Future Research Directions#
Near-term Optimizations (1-2 weeks):
- Feature interaction exploration beyond manual specification
- Weighted ensemble construction optimized via validation performance
- Hyperparameter search space refinement
Medium-term Extensions (1-2 months):
- Deep learning feature extraction (autoencoder representations)
- Graph neural networks for relational data modeling
- Model interpretability analysis (SHAP value decomposition)
Long-term Investigations (3+ months):
- Online learning systems for distribution drift adaptation
- A/B testing frameworks for production model validation
- Federated learning architectures for cross-institutional collaboration
8.5 Recommended Resources#
Documentation:
Seminal Publications:
- Chen, T., & Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. KDD.
- Ke, G., et al. (2017). LightGBM: A Highly Efficient Gradient Boosting Decision Tree. NIPS.
- Prokhorenkova, L., et al. (2018). CatBoost: Unbiased Boosting with Categorical Features. NIPS.
Competition Resources:
Conclusion#
This comprehensive analysis of the Home Credit Default Risk competition solution demonstrates the systematic application of machine learning methodology to real-world credit risk assessment. The project’s value extends beyond the achieved AUC score (0.808) to encompass:
- Engineering Discipline: Pipeline architectures ensure reproducibility and maintainability
- Data-Centric Approach: Exploratory analysis directly informs feature engineering decisions
- Methodical Optimization: Progressive improvement from single models to sophisticated ensembles
Fundamental Principle:
“Feature engineering determines the theoretical performance ceiling; machine learning algorithms merely approximate this ceiling. Investment in data understanding and feature construction consistently outperforms hyperparameter tuning alone.”
The methodologies presented herein transfer directly to related domains:
- Insurance fraud detection
- Marketing response modeling
- Customer churn prediction
- Credit scoring system development
*This tutorial presents a complete technical analysis of the Kaggle Home Credit Default Risk competition solution. The open-source implementation is available at GitHub.