Abstract: This article presents a systematic analysis of the Kaggle Home Credit Default Risk competition solution, detailing the complete machine learning pipeline from data preprocessing through feature engineering to model ensemble techniques. We examine the architectural decisions, implementation strategies, and performance optimization methods that achieved competitive results in this large-scale credit risk prediction task. The methodology encompasses data quality assessment, sophisticated feature extraction from relational databases, gradient boosting model optimization, and stacking ensemble strategies. Our analysis provides actionable insights for practitioners working on similar structured data prediction problems in financial risk assessment.

Keywords: Credit Risk Modeling, Gradient Boosting, Feature Engineering, Model Stacking, LightGBM, XGBoost, CatBoost, Machine Learning Pipeline


1. Introduction and Business Context

1.1 Problem Domain and Motivation

The democratization of financial services has emerged as a critical challenge in global economic development. Traditional credit assessment mechanisms rely heavily on conventional credit histories, stable employment records, and collateral assets—criteria that systematically exclude approximately 1.7 billion adults worldwide who lack access to formal banking services (World Bank Global Findex Database, 2017).

Home Credit Group, an international non-banking financial institution operating across 10+ countries, addresses this market gap by serving the “credit invisible” population—individuals without established credit histories who are typically rejected by conventional banking institutions. The fundamental business challenge involves making accurate default predictions within a constrained timeframe (approximately 5 minutes) using alternative data sources.

1.2 The Credit Risk Prediction Task

Task Formulation: Binary classification for probability of default (PD)

\[ Y_i = \begin{cases} 1, & \text{if client } i \text{ defaults (90+ days past due)} \\\\ 0, & \text{if client } i \text{ repays as scheduled} \end{cases} \]

Evaluation Metric: Receiver Operating Characteristic - Area Under Curve (ROC-AUC)

The AUC metric is selected for its robustness to class imbalance and its focus on ranking capability rather than absolute probability calibration:

\[ \text{AUC} = \int_0^1 \text{TPR}(\tau) \, d(\text{FPR}(\tau)) \]

where \(\text{TPR}\) denotes True Positive Rate and \(\text{FPR}\) denotes False Positive Rate at threshold \(\tau\).

Metric Interpretation:

  • AUC = 0.50: Random prediction (no discriminative power)
  • AUC ∈ [0.60, 0.70]: Poor performance
  • AUC ∈ [0.70, 0.80]: Acceptable performance
  • AUC ∈ [0.80, 0.90]: Good performance
  • AUC > 0.90: Excellent performance (requires overfitting verification)

1.3 Alternative Data Sources in Credit Assessment

The competition dataset exemplifies the paradigm shift toward alternative credit scoring, utilizing non-traditional data modalities:

Traditional DataAlternative ProxyData Provider
Credit ScoreMobile phone recharge patterns, call durationTelecommunication operators
Income VerificationPOS transaction records, rental payment historyPayment processors, property platforms
Bank StatementsInstallment payment history, credit card statementsConsumer finance companies
Employment VerificationE-commerce activity, social media engagementInternet platforms

1.4 Competition Outcomes and Methodological Impact

The 2018 Home Credit Default Risk competition attracted 7,194 participating teams globally. Top-performing solutions demonstrated significant methodological innovations:

  1. Feature Engineering: Evolution from basic aggregations to sophisticated temporal window features and trend analysis
  2. Ensemble Architectures: Systematic application of multi-level stacking strategies
  3. Data Preprocessing: Advanced techniques for missing value imputation and outlier treatment

These methodologies have demonstrated transferability to related domains including insurance fraud detection, marketing response prediction, and customer churn modeling.


2. Dataset Architecture and Schema Analysis

2.1 Data Volume and Relational Structure

The dataset comprises seven interconnected relational tables with an aggregate volume exceeding 50 million records, representing a complex relational database schema typical of enterprise financial systems.

Database Schema Architecture

Table Summary Statistics:

TableRow CountStorage SizeDescriptionPrimary KeyForeign Key
application_{train,test}307,511 / 48,74445MB / 7MBPrimary application tableSK_ID_CURR
bureau1,716,428172MBCredit bureau recordsSK_ID_BUREAUSK_ID_CURR
bureau_balance27,299,925574MBMonthly bureau statusSK_ID_BUREAU
previous_application1,670,214150MBHistorical applicationsSK_ID_PREVSK_ID_CURR
installments_payments13,605,401730MBInstallment payment recordsSK_ID_PREV
POS_CASH_balance10,001,358970MBPOS cash loan statementsSK_ID_PREV
credit_card_balance3,840,312400MBCredit card statementsSK_ID_PREV

2.2 Entity Relationship Model

The database schema follows a hierarchical relational structure with three primary identifier domains:

1
2
3
SK_ID_CURR: Client-level identifier (primary entity key)
SK_ID_PREV: Previous application identifier (transaction-level)
SK_ID_BUREAU: External credit bureau record identifier

Relationship Topology:

1
2
3
4
5
6
application [1] ───<N>─── bureau [1] ───<N>─── bureau_balance
     ├─<N>─── previous_application [1] ───<N>─── installments_payments
     │                                    ├─<N>─── POS_CASH_balance
     │                                    └─<N>─── credit_card_balance
     └─<N>─── credit_card_balance

The one-to-many (1:N) cardinality relationships necessitate aggregation operations during feature engineering to transform temporal and multi-record observations into static feature vectors suitable for machine learning models.

2.3 Detailed Field Specifications

2.3.1 Application Table (Primary Entity)

The application table serves as the central entity containing target labels for the training partition.

Demographic and Application Features:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
SK_ID_CURR: Integer (primary identifier)
TARGET: Binary (0=non-default, 1=default) - training set only
CODE_GENDER: Categorical (M/F/XNA)
FLAG_OWN_CAR: Binary (Y/N)
FLAG_OWN_REALTY: Binary (Y/N)
CNT_CHILDREN: Integer (count of children)
AMT_INCOME_TOTAL: Float (annual income in local currency)
AMT_CREDIT: Float (loan amount requested)
AMT_ANNUITY: Float (monthly installment amount)
AMT_GOODS_PRICE: Float (price of goods being financed)

Temporal Features (encoded as days relative to application date, negative values indicate past):

1
2
3
4
5
DAYS_BIRTH: Integer (age in days, e.g., -10000  27.4 years)
DAYS_EMPLOYED: Integer (employment duration, special value 365243 indicates unemployed)
DAYS_REGISTRATION: Integer (registration change recency)
DAYS_ID_PUBLISH: Integer (identity document issuance recency)
DAYS_LAST_PHONE_CHANGE: Integer (mobile phone change recency)

External Score Features (highly predictive normalized scores):

1
2
3
4
EXT_SOURCE_1: Float [0,1] (normalized external score 1)
EXT_SOURCE_2: Float [0,1] (normalized external score 2)
EXT_SOURCE_3: Float [0,1] (normalized external score 3)
# Sourced from third-party credit bureaus

2.3.2 Bureau Table (External Credit History)

Records of client’s credit relationships with external financial institutions.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
SK_ID_CURR: Integer (foreign key to application)
SK_ID_BUREAU: Integer (unique bureau record identifier)
CREDIT_ACTIVE: Categorical (Active/Closed/Sold/Demand/Bad debt)
CREDIT_CURRENCY: Categorical (currency code)
DAYS_CREDIT: Integer (days since credit application)
CREDIT_DAY_OVERDUE: Integer (current days past due)
DAYS_CREDIT_ENDDATE: Integer (remaining duration to maturity)
DAYS_ENDDATE_FACT: Integer (actual closure date)
AMT_CREDIT_MAX_OVERDUE: Float (maximum historical overdue amount)
CNT_CREDIT_PROLONG: Integer (count of credit prolongations)
AMT_CREDIT_SUM: Float (total credit exposure)
AMT_CREDIT_SUM_DEBT: Float (outstanding debt)
AMT_CREDIT_SUM_LIMIT: Float (credit limit)
AMT_CREDIT_SUM_OVERDUE: Float (current overdue amount)
CREDIT_TYPE: Categorical (loan type: consumer, mortgage, etc.)
DAYS_CREDIT_UPDATE: Integer (recency of bureau update)
AMT_ANNUITY: Float (monthly payment obligation)

2.3.3 Bureau Balance Table (Temporal Bureau Status)

Monthly status snapshots for each bureau record enabling trend analysis.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
SK_ID_BUREAU: Integer (foreign key to bureau)
MONTHS_BALANCE: Integer (relative month index, -1=last month, -2=two months ago)
STATUS: Categorical encoding:
    '0': Current (no delinquency)
    '1': 1-29 days past due
    '2': 30-59 days past due
    '3': 60-89 days past due
    '4': 90-119 days past due
    '5': 120-149 days past due
    'C': Closed (paid off)
    'X': Status unknown

2.3.4 Previous Application Table

Historical applications within the Home Credit system.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
SK_ID_CURR: Integer (foreign key)
SK_ID_PREV: Integer (unique previous application identifier)
NAME_CONTRACT_TYPE: Categorical (Cash/Consumer/Revolving loans)
AMT_ANNUITY: Float (proposed monthly payment)
AMT_APPLICATION: Float (requested amount)
AMT_CREDIT: Float (approved amount)
AMT_DOWN_PAYMENT: Float (down payment amount)
RATE_DOWN_PAYMENT: Float (down payment ratio)
RATE_INTEREST_PRIMARY: Float (primary interest rate)
RATE_INTEREST_PRIVILEGED: Float (preferential interest rate)
NAME_CONTRACT_STATUS: Categorical (Approved/Refused/Canceled/Unused)
DAYS_DECISION: Integer (days since decision)
CODE_REJECT_REASON: Categorical (rejection reason if applicable)
NAME_CLIENT_TYPE: Categorical (New/Repeat customer)
CNT_PAYMENT: Integer (proposed term in months)

2.3.5 Installments Payments Table

Granular repayment transaction records.

1
2
3
4
5
6
7
8
SK_ID_CURR: Integer (foreign key)
SK_ID_PREV: Integer (foreign key to previous_application)
NUM_INSTALMENT_VERSION: Integer (version of installment schedule)
NUM_INSTALMENT_NUMBER: Integer (installment sequence number)
DAYS_INSTALMENT: Integer (scheduled payment date)
DAYS_ENTRY_PAYMENT: Integer (actual payment date)
AMT_INSTALMENT: Float (scheduled amount)
AMT_PAYMENT: Float (actual amount paid)

Derived Metrics:

  • Days Past Due (DPD): \(DPD = DAYS_ENTRY_PAYMENT - DAYS_INSTALMENT\)
  • Payment Deviation: \(\Delta AMT = AMT_PAYMENT - AMT_INSTALMENT\)

2.4 Data Quality Profile

Class Distribution:

1
2
3
Class 0 (Non-default): 282,686 observations (91.93%)
Class 1 (Default):      24,825 observations (8.07%)
Imbalance Ratio: 11.4:1

Missing Data Summary:

  • EXT_SOURCE_1: 56.38% missing
  • EXT_SOURCE_3: 19.83% missing
  • AMT_ANNUITY: 0.003% missing
  • OCCUPATION_TYPE: 31.35% missing

Anomalous Encodings:

  • DAYS_EMPLOYED = 365,243 (~1,000 years): Sentinel value for unemployment
  • CODE_GENDER = ‘XNA’: Unspecified gender category
  • AMT_INCOME_TOTAL: Extreme outlier at 117,000,000 (potential data error)

3. System Architecture and Pipeline Design

3.1 Framework Selection: Steppy Pipeline Architecture

The solution implements the Steppy framework, a lightweight machine learning pipeline library designed for modular, reproducible data science workflows. Steppy adopts design principles from workflow orchestration systems such as Apache Airflow and Spotify’s Luigi, adapted for the machine learning domain.

Steppy Framework Architecture

Motivation for Pipeline Frameworks:

Traditional imperative machine learning code suffers from several architectural limitations:

1
2
3
4
5
6
7
# Anti-pattern: Tightly coupled workflow
data = load_data()
data = clean_data(data)
data = extract_features(data)
X_train, X_test, y_train, y_test = split_data(data)
model = train_model(X_train, y_train)
predictions = model.predict(X_test)

Identified Deficiencies:

  1. Coupling: Modifications to one stage require understanding of downstream dependencies
  2. Reproducibility: Intermediate results cannot be cached or versioned
  3. Parallelization: Sequential execution prevents computational resource optimization
  4. Experiment Tracking: Parameter variations are difficult to systematically compare

Steppy’s Declarative Approach:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
from steppy.base import BaseTransformer

class DataLoader(BaseTransformer):
    """Loads raw data from persistent storage."""
    
    def transform(self, filepath):
        data = pd.read_csv(filepath)
        return {'data': data}

class DataCleaningTransformer(BaseTransformer):
    """Applies data quality transformations."""
    
    def transform(self, data):
        cleaned = self._handle_outliers(data)
        cleaned = self._impute_missing(cleaned)
        return {'cleaned_data': cleaned}
    
    def _handle_outliers(self, df):
        # Implementation
        pass

class FeatureExtractionTransformer(BaseTransformer):
    """Engineers features from cleaned data."""
    
    def transform(self, cleaned_data):
        features = self._aggregate_features(cleaned_data)
        return {'features': features}

Design Principles:

  1. Standardized Interfaces: All components inherit from BaseTransformer with fit() and transform() methods
  2. Explicit Data Flow: Dictionary-based I/O with named keys for traceability
  3. Composability: Steps connect via Step and Adapter abstractions
  4. Persistence: Intermediate artifacts support caching and checkpointing

3.2 End-to-End Pipeline Flow

Complete Data Processing Pipeline

Stage 1: Data Ingestion and Cleaning

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
def build_data_ingestion_pipeline(config):
    """Constructs the data loading and cleaning pipeline stage."""
    
    # Load all seven tables
    raw_data = DataLoader(config.data_paths).transform()
    
    # Apply table-specific cleaning transformers
    cleaning_transformers = {
        'application': ApplicationCleaning(),
        'bureau': BureauCleaning(),
        'bureau_balance': BureauBalanceCleaning(),
        'previous_application': PreviousApplicationCleaning(),
        'installments_payments': InstallmentPaymentsCleaning(),
        'pos_cash_balance': PosCashBalanceCleaning(),
        'credit_card_balance': CreditCardBalanceCleaning()
    }
    
    cleaned_data = {
        table: transformer.transform(raw_data[table])
        for table, transformer in cleaning_transformers.items()
    }
    
    return cleaned_data

Stage 2: Feature Engineering

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
def build_feature_engineering_pipeline(cleaned_data):
    """Constructs the feature extraction pipeline stage."""
    
    # Table-specific feature extraction
    bureau_features = BureauFeatureExtractor().transform(
        cleaned_data['bureau'], 
        cleaned_data['bureau_balance']
    )
    
    prev_app_features = PreviousApplicationFeatureExtractor().transform(
        cleaned_data['previous_application']
    )
    
    installment_features = InstallmentFeatureExtractor().transform(
        cleaned_data['installments_payments']
    )
    
    # Feature consolidation
    all_features = FeatureConcatenator().transform([
        cleaned_data['application'],
        bureau_features,
        prev_app_features,
        installment_features
    ])
    
    # Categorical encoding
    encoded_features = CategoricalEncoder().transform(
        all_features,
        method='target_encoding'
    )
    
    return {
        'features': encoded_features,
        'target': cleaned_data['application']['TARGET'],
        'feature_names': encoded_features.columns.tolist()
    }

Stage 3: Model Training

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
def build_model_training_pipeline(feature_data, config):
    """Constructs the model training pipeline stage."""
    
    # Train/validation split
    X_train, X_valid, y_train, y_valid = train_test_split(
        feature_data['features'],
        feature_data['target'],
        test_size=0.2,
        stratify=feature_data['target'],
        random_state=config.random_seed
    )
    
    # Model initialization and training
    model = GradientBoostingModel(config.model_params)
    model.fit(X_train, y_train, validation_data=(X_valid, y_valid))
    
    # Performance evaluation
    validation_auc = roc_auc_score(y_valid, model.predict(X_valid))
    
    return {
        'model': model,
        'validation_auc': validation_auc,
        'feature_importance': model.feature_importances_
    }

Stage 4: Ensemble Construction (Stacking)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
def build_stacking_ensemble(base_models, meta_learner, X, y, X_test):
    """
    Implements two-level stacking ensemble architecture.
    
    Level 1: Base learners generate out-of-fold predictions
    Level 2: Meta-learner trains on base model outputs
    """
    
    # Generate OOF predictions
    oof_predictions = {}
    test_predictions = {}
    
    for name, model in base_models.items():
        oof_pred, test_pred = generate_oof_predictions(
            model, X, y, X_test, n_folds=5
        )
        oof_predictions[name] = oof_pred
        test_predictions[name] = test_pred
    
    # Train meta-learner
    meta_features = np.column_stack([
        oof_predictions[name] for name in base_models.keys()
    ])
    
    meta_learner.fit(meta_features, y)
    
    # Generate final predictions
    meta_test_features = np.column_stack([
        test_predictions[name] for name in base_models.keys()
    ])
    final_predictions = meta_learner.predict_proba(meta_test_features)[:, 1]
    
    return final_predictions

3.3 Project Directory Organization

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
open-solution-home-credit/
├── src/                          # Source code modules
   ├── __init__.py
   ├── pipeline_manager.py       # Orchestration layer
   ├── pipelines.py              # Pipeline definitions (9 model configurations)
   ├── pipeline_blocks.py        # Step factory methods
   ├── feature_extraction.py     # Feature transformers (20+ implementations)
   ├── data_cleaning.py          # Data quality transformers (7 table-specific)
   ├── models.py                 # Model wrappers (LGB/XGB/CTB/NN/RF/LR)
   ├── pipeline_config.py        # Configuration constants and hyperparameters
   ├── hyperparameter_tuning.py  # Optimization strategies
   ├── callbacks.py              # Training monitoring callbacks
   ├── utils.py                  # Utility functions
   └── neptune_hacks.py          # Offline experiment tracking support
├── configs/                      # Configuration files
   └── neptune.yaml              # Main configuration (paths/hyperparameters)
├── data/                         # Data directory
   ├── raw/                      # Original competition data
   └── workdir/                  # Intermediate processing artifacts
├── notebooks/                    # Exploratory data analysis
├── blog/                         # Documentation
   └── images/                   # Visualization assets
├── main.py                       # CLI entry point
├── requirements.txt              # Dependency specification
└── README.md                     # Project documentation

3.4 Configuration Management

The project implements a hybrid configuration strategy combining YAML for experiment parameters and Python modules for code-level constants.

Experiment Configuration (configs/neptune.yaml):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
parameters:
  # Data paths
  train_filepath: /data/application_train.csv
  test_filepath: /data/application_test.csv
  
  # Model selection
  pipeline_name: lightGBM
  
  # Feature toggles
  use_application: true
  use_bureau: true
  use_bureau_balance: true
  use_previous_application: true
  use_installments_payments: true
  use_pos_cash_balance: true
  use_credit_card_balance: true
  
  # LightGBM hyperparameters
  lgbm__objective: binary
  lgbm__metric: auc
  lgbm__num_leaves: 35
  lgbm__learning_rate: 0.02
  lgbm__n_estimators: 5000
  lgbm__min_child_samples: 70
  lgbm__subsample: 1.0
  lgbm__colsample_bytree: 0.03
  lgbm__reg_lambda: 100.0
  lgbm__reg_alpha: 0.0
  
  # Cross-validation configuration
  n_cv_splits: 5
  validation_size: 0.2
  stratified_cv: true
  shuffle: true
  random_seed: 90210

Code-Level Configuration (src/pipeline_config.py):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
"""Constants and aggregation recipes for feature engineering."""

import numpy as np

# Reproducibility constants
RANDOM_SEED = 90210
DEV_SAMPLE_SIZE = 1000

# Column type definitions
CATEGORICAL_COLUMNS = [
    'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY',
    'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE',
    'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'OCCUPATION_TYPE',
    'WEEKDAY_APPR_PROCESS_START', 'ORGANIZATION_TYPE', 'FONDKAPREMONT_MODE',
    'HOUSETYPE_MODE', 'WALLSMATERIAL_MODE', 'EMERGENCYSTATE_MODE'
]

NUMERICAL_COLUMNS = [
    'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE',
    'DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH',
    'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3'
]

# Aggregation recipes for feature extraction
BUREAU_AGGREGATION_RECIPES = [
    (['SK_ID_CURR'], [
        ('SK_ID_BUREAU', 'count'),
        ('AMT_CREDIT_SUM', ['sum', 'mean', 'max', 'std']),
        ('AMT_CREDIT_SUM_DEBT', ['sum', 'mean']),
        ('AMT_CREDIT_SUM_OVERDUE', ['sum', 'mean', 'max']),
        ('DAYS_CREDIT', ['min', 'max', 'mean']),
        ('CREDIT_DAY_OVERDUE', ['sum', 'max', 'mean']),
        ('CNT_CREDIT_PROLONG', 'sum')
    ])
]

PREVIOUS_APPLICATION_AGGREGATION_RECIPES = [
    (['SK_ID_CURR'], [
        ('SK_ID_PREV', 'count'),
        ('AMT_APPLICATION', ['sum', 'mean', 'max']),
        ('AMT_CREDIT', ['sum', 'mean', 'max']),
        ('AMT_DOWN_PAYMENT', ['sum', 'mean']),
        ('RATE_INTEREST_PRIMARY', ['mean', 'max']),
        ('DAYS_DECISION', ['min', 'max', 'mean'])
    ])
]

This bifurcated configuration approach provides:

  • Accessibility: YAML enables rapid experimentation without code modification
  • Type Safety: Python modules enforce compile-time validation
  • Override Capability: Command-line and environment variable overrides supported

4. Exploratory Data Analysis and Quality Assessment

4.1 EDA Methodology Framework

Exploratory Data Analysis (EDA) in this context follows a systematic investigative framework designed to answer five fundamental questions:

  1. Data Quality: What anomalies, missing values, or encoding inconsistencies exist?
  2. Distributional Properties: What are the central tendencies, dispersion, and shapes of feature distributions?
  3. Business Insights: Do population segments exhibit differential behaviors?
  4. Predictive Signals: Which features demonstrate statistical association with the target variable?
  5. Feature Engineering Direction: What transformations or aggregations might improve predictive power?

4.2 Critical Findings

4.2.1 Class Imbalance Analysis

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load training data
train_df = pd.read_csv('data/application_train.csv')

# Class distribution analysis
target_distribution = train_df['TARGET'].value_counts()
print("Class Distribution:")
print(target_distribution)
print(f"\nClass Proportions:")
print(train_df['TARGET'].value_counts(normalize=True))

# Output:
# Class Distribution:
# 0    282686
# 1     24825
# Name: TARGET, dtype: int64
#
# Class Proportions:
# 0    0.919271
# 1    0.080729
# Name: TARGET, dtype: float64

Interpretation: The 11.4:1 imbalance ratio necessitates careful metric selection. Accuracy would be misleading (91.9% accuracy achievable by predicting majority class), motivating the use of AUC and precision-recall metrics.

4.2.2 External Score Predictive Power

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# Analyze EXT_SOURCE features
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

for idx, col in enumerate(['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']):
    # Distribution comparison
    sns.kdeplot(
        data=train_df[train_df['TARGET'] == 0][col].dropna(),
        label='Non-default',
        ax=axes[idx],
        fill=True,
        alpha=0.5
    )
    sns.kdeplot(
        data=train_df[train_df['TARGET'] == 1][col].dropna(),
        label='Default',
        ax=axes[idx],
        fill=True,
        alpha=0.5
    )
    axes[idx].set_title(f'{col} Distribution by Target')
    axes[idx].legend()

plt.tight_layout()
plt.savefig('images/ext_source_kde.png', dpi=150)

Key Observations:

  • Defaulting clients demonstrate consistently lower external scores
  • EXT_SOURCE_1 exhibits the strongest discriminative power (highest Information Value)
  • Missing data rates vary: EXT_SOURCE_1 (56.4%), EXT_SOURCE_2 (0.2%), EXT_SOURCE_3 (19.8%)

4.2.3 Income Distribution Analysis

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# Income distribution with log transformation
train_df['AMT_INCOME_TOTAL_LOG'] = np.log1p(train_df['AMT_INCOME_TOTAL'])

# Descriptive statistics
print(train_df['AMT_INCOME_TOTAL'].describe())

# Detect extreme outliers
q99 = train_df['AMT_INCOME_TOTAL'].quantile(0.99)
extreme_outliers = train_df[train_df['AMT_INCOME_TOTAL'] > q99 * 10]
print(f"\nExtreme outliers (>10x 99th percentile): {len(extreme_outliers)}")

Statistical Summary:

1
2
3
4
5
6
7
8
count    3.075110e+05
mean     1.687979e+05
std      2.371894e+05
min      2.565000e+04
25%      1.125000e+05
50%      1.471500e+05
75%      2.025000e+05
max      1.170000e+08  # Data quality concern

Implications: The right-skewed distribution (coefficient of skewness ≈ 3.2) motivates log-transformation. The extreme maximum (117M vs. mean 168K) suggests potential data entry errors requiring treatment.

4.2.4 Temporal Feature Analysis: Age and Default Risk

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# Age calculation and risk profiling
train_df['AGE_YEARS'] = -train_df['DAYS_BIRTH'] / 365.25

# Binned analysis
train_df['AGE_BIN'] = pd.cut(
    train_df['AGE_YEARS'],
    bins=[0, 25, 30, 35, 40, 45, 50, 60, 100],
    labels=['<25', '25-30', '30-35', '35-40', '40-45', '45-50', '50-60', '60+']
)

default_by_age = train_df.groupby('AGE_BIN')['TARGET'].agg(['mean', 'count'])
print(default_by_age)

# Visualization
plt.figure(figsize=(10, 6))
default_by_age['mean'].plot(kind='bar', color='steelblue')
plt.title('Default Rate by Age Cohort')
plt.xlabel('Age Group')
plt.ylabel('Default Rate')
plt.axhline(y=train_df['TARGET'].mean(), color='r', linestyle='--', label='Overall Average')
plt.legend()
plt.tight_layout()
plt.savefig('images/default_rate_by_age.png', dpi=150)

Findings: Default rates exhibit an inverse relationship with age, with clients under 25 showing approximately 2.5x higher default rates than those aged 40-50. This aligns with established credit risk theory regarding income stability and financial experience.

4.2.5 Employment Status Encoding Anomaly

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
# Investigate DAYS_EMPLOYED anomaly
anomaly_count = (train_df['DAYS_EMPLOYED'] == 365243).sum()
anomaly_rate = anomaly_count / len(train_df)

print(f"Anomalous DAYS_EMPLOYED (365243): {anomaly_count} ({anomaly_rate:.2%})")

# Compare default rates
train_df['EMPLOYMENT_STATUS'] = np.where(
    train_df['DAYS_EMPLOYED'] == 365243,
    'Unemployed/Unknown',
    'Employed'
)

employment_risk = train_df.groupby('EMPLOYMENT_STATUS')['TARGET'].mean()
print("\nDefault Rate by Employment Status:")
print(employment_risk)

Results:

1
2
3
4
5
Anomalous DAYS_EMPLOYED (365243): 55,374 (18.01%)

Default Rate by Employment Status:
Employed              0.0753
Unemployed/Unknown    0.1047

Interpretation: The value 365,243 functions as a sentinel encoding (equivalent to ~1,000 years), indicating unemployment or data unavailability. The elevated default rate (10.5% vs. 7.5%) among this cohort confirms the encoding’s business relevance.

4.3 Data Preprocessing Strategy

Based on EDA findings, we implement a systematic preprocessing pipeline:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
class ApplicationDataCleaner(BaseTransformer):
    """
    Implements data quality transformations for the primary application table.
    """
    
    def transform(self, df: pd.DataFrame) -> Dict[str, pd.DataFrame]:
        df_cleaned = df.copy()
        
        # 1. Sentinel value treatment
        df_cleaned['DAYS_EMPLOYED'].replace(365243, np.nan, inplace=True)
        df_cleaned['CODE_GENDER'].replace('XNA', np.nan, inplace=True)
        
        # 2. Infinity value handling
        df_cleaned = df_cleaned.replace([np.inf, -np.inf], np.nan)
        
        # 3. Categorical missing value imputation
        categorical_columns = df_cleaned.select_dtypes(
            include=['object']
        ).columns
        df_cleaned[categorical_columns] = df_cleaned[
            categorical_columns
        ].fillna('Unknown')
        
        # 4. Numerical features: preserve missing values
        # Gradient boosting models handle missing values natively
        
        return {'application_cleaned': df_cleaned}

5. Feature Engineering Methodology

5.1 The Aggregation Problem

The fundamental feature engineering challenge in this dataset stems from the relational structure: clients possess multiple historical records across subsidiary tables (bureau, previous applications, payment histories), while predictive modeling requires a fixed-dimensional feature vector for each client.

Illustrative Example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
Client A - Bureau Records:
├─ Record 1: SK_ID_BUREAU=101, AMT_CREDIT_SUM=5000, DAYS_CREDIT=-365
├─ Record 2: SK_ID_BUREAU=102, AMT_CREDIT_SUM=3000, DAYS_CREDIT=-180
├─ Record 3: SK_ID_BUREAU=103, AMT_CREDIT_SUM=8000, DAYS_CREDIT=-90
└─ Record 4: SK_ID_BUREAU=104, AMT_CREDIT_SUM=2000, DAYS_CREDIT=-30

Required Transformation (Single Row):
- bureau_count: 4
- bureau_amt_sum: 18000
- bureau_amt_mean: 4500
- bureau_amt_max: 8000
- bureau_days_min: -365
- bureau_days_max: -30

5.2 Aggregation Methodology

Feature Aggregation Methodology

Aggregation Operators:

OperatorMathematical DefinitionUse CaseBusiness Interpretation
COUNT\(n = |{r_1, r_2, …, r_n}|\)Record frequencyNumber of loans, applications
SUM\(\Sigma = \sum_{i=1}^{n} x_i\)Total exposureCumulative debt, total payments
MEAN\(\mu = \frac{1}{n}\sum_{i=1}^{n} x_i\)Central tendencyAverage loan amount
MEDIAN\(\tilde{x} = Q_2(x)\)Robust central tendencyMedian income (outlier-resistant)
MAX/MIN\(\max(x), \min(x)\)ExtremesLargest loan, earliest record
STD\(\sigma = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(x_i - \mu)^2}\)VariabilityIncome stability, payment consistency
NUNIQUE\(|{x_1, x_2, …}|\)CardinalityNumber of distinct lenders

5.3 Table-Specific Feature Extraction

5.3.1 Bureau Feature Engineering

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
class BureauFeatureExtractor(BaseTransformer):
    """
    Extracts aggregated features from credit bureau records.
    """
    
    def transform(self, bureau: pd.DataFrame) -> Dict[str, pd.DataFrame]:
        # Primary aggregations
        bureau_agg = bureau.groupby('SK_ID_CURR').agg({
            # Exposure metrics
            'SK_ID_BUREAU': 'count',
            'AMT_CREDIT_SUM': ['sum', 'mean', 'max', 'min', 'std'],
            'AMT_CREDIT_SUM_DEBT': ['sum', 'mean', 'max'],
            'AMT_CREDIT_SUM_OVERDUE': ['sum', 'mean', 'max'],
            
            # Delinquency metrics
            'CNT_CREDIT_PROLONG': ['sum', 'mean'],
            'CREDIT_DAY_OVERDUE': ['sum', 'max', 'mean'],
            
            # Temporal metrics
            'DAYS_CREDIT': ['min', 'max', 'mean'],
            'DAYS_CREDIT_ENDDATE': ['min', 'max'],
            'DAYS_CREDIT_UPDATE': ['min', 'max'],
        })
        
        # Flatten multi-level columns
        bureau_agg.columns = [
            '_'.join(col).strip() 
            for col in bureau_agg.columns.values
        ]
        
        # Active credit subset analysis
        active_mask = bureau['CREDIT_ACTIVE'] == 'Active'
        active_loans = bureau[active_mask].groupby('SK_ID_CURR').agg({
            'AMT_CREDIT_SUM': ['sum', 'count'],
            'AMT_CREDIT_SUM_DEBT': 'sum',
        })
        active_loans.columns = [
            'bureau_active_' + '_'.join(col) 
            for col in active_loans.columns
        ]
        
        # Combine feature sets
        features = bureau_agg.join(active_loans, how='left')
        
        return {'bureau_features': features}

Generated Feature Examples:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
{
    'SK_ID_BUREAU_count': 5,                    # Total credit relationships
    'AMT_CREDIT_SUM_sum': 45000,                # Total credit exposure
    'AMT_CREDIT_SUM_mean': 9000,                # Average loan size
    'AMT_CREDIT_SUM_max': 20000,                # Maximum single exposure
    'DAYS_CREDIT_min': -730,                    # Oldest relationship
    'DAYS_CREDIT_max': -30,                     # Most recent relationship
    'bureau_active_AMT_CREDIT_SUM_sum': 15000,  # Active exposure
    'bureau_active_SK_ID_BUREAU_count': 2,      # Number of active accounts
}

5.3.2 Previous Application Feature Engineering

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
class PreviousApplicationFeatureExtractor(BaseTransformer):
    """
    Extracts features from historical Home Credit applications.
    """
    
    def transform(self, prev_app: pd.DataFrame) -> Dict[str, pd.DataFrame]:
        # Core aggregations
        prev_agg = prev_app.groupby('SK_ID_CURR').agg({
            # Application frequency
            'SK_ID_PREV': 'count',
            
            # Approval metrics
            'NAME_CONTRACT_STATUS': [
                lambda x: (x == 'Approved').sum(),
                lambda x: (x == 'Refused').sum(),
                lambda x: (x == 'Canceled').sum()
            ],
            
            # Financial metrics
            'AMT_APPLICATION': ['sum', 'mean', 'max', 'min'],
            'AMT_CREDIT': ['sum', 'mean', 'max'],
            'AMT_DOWN_PAYMENT': ['sum', 'mean'],
            'AMT_ANNUITY': ['mean', 'max'],
            
            # Pricing metrics
            'RATE_INTEREST_PRIMARY': ['mean', 'max'],
            'RATE_DOWN_PAYMENT': ['mean', 'max'],
            
            # Temporal metrics
            'DAYS_DECISION': ['min', 'max', 'mean'],
        })
        
        # Derived metrics
        total_apps = prev_agg[('SK_ID_PREV', 'count')]
        approved_apps = prev_agg[('NAME_CONTRACT_STATUS', '<lambda_0>')]
        prev_agg['approval_rate'] = approved_apps / total_apps
        
        prev_agg['credit_to_application_ratio'] = (
            prev_agg[('AMT_CREDIT', 'sum')] / 
            prev_agg[('AMT_APPLICATION', 'sum')]
        )
        
        # Flatten column structure
        prev_agg.columns = [
            '_'.join(col).strip() if isinstance(col, tuple) else col 
            for col in prev_agg.columns
        ]
        
        return {'previous_application_features': prev_agg}

Key Derived Features:

  • approval_rate: Historical approval probability
  • credit_to_application_ratio: Approved amount relative to requested
  • avg_down_payment_rate: Typical down payment behavior

5.3.3 Installment Payment Feature Engineering

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
class InstallmentFeatureExtractor(BaseTransformer):
    """
    Extracts repayment behavior features from installment records.
    """
    
    def transform(self, installments: pd.DataFrame) -> Dict[str, pd.DataFrame]:
        # Calculate derived metrics
        installments['DPD'] = (
            installments['DAYS_ENTRY_PAYMENT'] - 
            installments['DAYS_INSTALMENT']
        )
        installments['AMT_DIFF'] = (
            installments['AMT_PAYMENT'] - 
            installments['AMT_INSTALMENT']
        )
        
        # Aggregations
        install_agg = installments.groupby('SK_ID_CURR').agg({
            # Volume metrics
            'NUM_INSTALMENT_VERSION': 'count',
            
            # Delinquency metrics
            'DPD': ['mean', 'max', 'sum', lambda x: (x > 0).sum()],
            
            # Payment amount metrics
            'AMT_INSTALMENT': ['sum', 'mean', 'max'],
            'AMT_PAYMENT': ['sum', 'mean', 'max'],
            'AMT_DIFF': [
                'mean', 'sum', 'max', 'min', 
                lambda x: (x > 0).sum()
            ],
        })
        
        # Flatten columns
        install_agg.columns = [
            '_'.join(col).strip() 
            for col in install_agg.columns.values
        ]
        
        return {'installment_features': install_agg}

Critical Derived Metrics:

  • DPD_mean: Average days past due
  • DPD_max: Worst delinquency instance
  • AMT_DIFF_mean: Average payment deviation (overpayment/underpayment)

5.4 Temporal Window Features

Bureau Feature Extraction Pipeline

Hypothesis: Recent behavioral patterns carry stronger predictive signal than historical averages.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
class TemporalWindowFeatureExtractor(BaseTransformer):
    """
    Extracts time-windowed aggregations for trend analysis.
    """
    
    def transform(
        self, 
        data: pd.DataFrame, 
        time_col: str = 'MONTHS_BALANCE'
    ) -> Dict[str, pd.DataFrame]:
        
        window_sizes = [3, 6, 12, 24]  # months
        all_features = {}
        
        for window in window_sizes:
            # Subset to recent history
            recent_mask = data[time_col] >= -window
            recent_data = data[recent_mask]
            
            # Window-specific aggregations
            window_agg = recent_data.groupby('SK_ID_CURR').agg({
                'AMT_BALANCE': ['mean', 'max', 'sum'],
                'SK_ID_PREV': 'count',
            })
            
            # Rename with window suffix
            window_agg.columns = [
                f'{col}_last_{window}m' 
                for col in window_agg.columns
            ]
            
            all_features[f'window_{window}m'] = window_agg
        
        return all_features

5.5 Categorical Variable Encoding

Encoding Strategy Selection:

MethodAppropriate ForAdvantagesDisadvantages
Label EncodingOrdinal categories (education level)Simple, low dimensionalityIntroduces false ordinality for nominal data
One-Hot EncodingLow-cardinality nominal data (gender)No ordinality assumptionDimensionality explosion
Target EncodingHigh-cardinality data (occupation, region)Captures target relationshipRisk of overfitting
Frequency EncodingHigh-cardinality identifiersSimple, captures prevalenceInformation loss

Implementation:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
from category_encoders import TargetEncoder, OneHotEncoder

class CategoricalEncodingPipeline(BaseTransformer):
    """
    Applies appropriate encoding strategies by variable type.
    """
    
    def __init__(self):
        self.encoders = {}
        
    def fit(self, X: pd.DataFrame, y: pd.Series):
        # Target encoding for high-cardinality features
        high_cardinality = [
            'OCCUPATION_TYPE', 'ORGANIZATION_TYPE', 
            'NAME_FAMILY_STATUS'
        ]
        
        for col in high_cardinality:
            encoder = TargetEncoder(cols=[col], smoothing=10.0)
            encoder.fit(X[[col]], y)
            self.encoders[col] = encoder
            
        return self
    
    def transform(self, X: pd.DataFrame) -> Dict[str, pd.DataFrame]:
        X_encoded = X.copy()
        
        for col, encoder in self.encoders.items():
            X_encoded[col] = encoder.transform(X[[col]])
            
        return {'features_encoded': X_encoded}

5.6 Feature Selection

Objective: Reduce dimensionality, eliminate noise, improve training efficiency

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
from sklearn.feature_selection import mutual_info_classif, SelectKBest

class FeatureSelectionTransformer(BaseTransformer):
    """
    Selects top-K features based on mutual information.
    """
    
    def __init__(self, k: int = 500):
        self.k = k
        self.selector = None
        
    def fit(self, X: pd.DataFrame, y: pd.Series):
        self.selector = SelectKBest(
            score_func=mutual_info_classif,
            k=self.k
        )
        self.selector.fit(X, y)
        self.selected_features = X.columns[
            self.selector.get_support()
        ].tolist()
        return self
    
    def transform(self, X: pd.DataFrame) -> Dict[str, pd.DataFrame]:
        X_selected = X[self.selected_features]
        return {
            'features': X_selected,
            'feature_names': self.selected_features
        }

6. Model Selection, Training, and Evaluation

6.1 Gradient Boosting Decision Trees

Theoretical Foundation:

Gradient Boosting constructs an additive ensemble of weak learners (typically decision trees) through functional gradient descent:

\[F_m(x) = F_{m-1}(x) + \nu \cdot h_m(x)\]

where $h_m(x)$ is the weak learner fitted to the pseudo-residuals:

\[r_{im} = -\left[\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)}\right]_{F=F_{m-1}}\]

Advantages for Tabular Data:

  1. Automatic Feature Interactions: Tree splits inherently model feature combinations
  2. Missing Value Handling: Native support without imputation requirements
  3. Non-linear Capacity: Captures complex decision boundaries
  4. Interpretability: Feature importance and partial dependence analysis

6.2 Comparative Analysis: LightGBM, XGBoost, CatBoost

Model Architecture Comparison

Algorithmic Characteristics:

CharacteristicLightGBMXGBoostCatBoost
Tree GrowthLeaf-wiseLevel-wiseLevel-wise
Split FindingHistogram-basedHistogram + ExactOblivious trees
Key OptimizationsGOSS, EFBCache-aware accessOrdered boosting
Categorical SupportLimitedManual encodingNative support
Training SpeedFastestModerateModerate
Memory EfficiencyBestModerateGood

Gradient-based One-Side Sampling (GOSS) [LightGBM]: Retains instances with large gradients (high error) while randomly sampling instances with small gradients, maintaining data distribution while accelerating training.

Exclusive Feature Bundling (EFB) [LightGBM]: Bundles mutually exclusive features (rarely non-zero simultaneously) to reduce dimensionality without information loss.

Ordered Boosting [CatBoost]: Eliminates prediction shift by using ordered permutation of training data, providing unbiased gradient estimates.

Selection Guidelines:

  • Rapid Experimentation: LightGBM (10x training speed advantage)
  • Maximum Accuracy: XGBoost (marginal but consistent gains)
  • Rich Categorical Data: CatBoost (native categorical handling)

6.3 LightGBM Implementation

Model Training Pipeline

Hyperparameter Specification:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import lightgbm as lgb

# Model configuration
LGBM_PARAMS = {
    'objective': 'binary',
    'metric': 'auc',
    'boosting_type': 'gbdt',
    
    # Tree structure
    'num_leaves': 35,
    'max_depth': -1,
    'min_child_samples': 70,
    
    # Learning dynamics
    'learning_rate': 0.02,
    'n_estimators': 5000,
    
    # Regularization
    'reg_lambda': 100.0,
    'reg_alpha': 0.0,
    
    # Sampling
    'subsample': 1.0,
    'colsample_bytree': 0.03,
    
    # Categorical handling
    'categorical_feature': 'auto',
    
    'verbose': -1,
    'random_state': 42
}

Training Procedure:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

# Data preparation
X_train, X_valid, y_train, y_valid = train_test_split(
    X, y,
    test_size=0.2,
    stratify=y,
    random_state=42
)

# Dataset construction
train_dataset = lgb.Dataset(X_train, label=y_train)
valid_dataset = lgb.Dataset(X_valid, label=y_valid, reference=train_dataset)

# Model training
model = lgb.train(
    LGBM_PARAMS,
    train_dataset,
    num_boost_round=5000,
    valid_sets=[train_dataset, valid_dataset],
    valid_names=['train', 'valid'],
    callbacks=[
        lgb.early_stopping(stopping_rounds=100),
        lgb.log_evaluation(period=100)
    ]
)

# Performance evaluation
y_pred = model.predict(X_valid, num_iteration=model.best_iteration)
validation_auc = roc_auc_score(y_valid, y_pred)
print(f'Validation AUC: {validation_auc:.4f}')

# Feature importance analysis
importance_df = pd.DataFrame({
    'feature': model.feature_name(),
    'importance_gain': model.feature_importance(importance_type='gain'),
    'importance_split': model.feature_importance(importance_type='split')
}).sort_values('importance_gain', ascending=False)

print("\nTop 20 Features by Gain:")
print(importance_df.head(20))

Hyperparameter Tuning Guidelines:

  • num_leaves: \(2^{\text{max\_depth}}\) provides baseline; reduce to control overfitting
  • learning_rate: 0.01-0.1 range; lower values require more iterations
  • reg_lambda: Increase (1→100) for noisy datasets
  • colsample_bytree: Reduce (1.0→0.3) for high-dimensional features

6.4 XGBoost Implementation

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
import xgboost as xgb

XGB_PARAMS = {
    'objective': 'binary:logistic',
    'eval_metric': 'auc',
    'max_depth': 6,
    'learning_rate': 0.05,
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'reg_alpha': 0.1,
    'reg_lambda': 1.0,
    'tree_method': 'hist',
    'seed': 42
}

# DMatrix construction
dtrain = xgb.DMatrix(X_train, label=y_train, enable_categorical=True)
dvalid = xgb.DMatrix(X_valid, label=y_valid, enable_categorical=True)

# Training
eval_results = {}
model = xgb.train(
    XGB_PARAMS,
    dtrain,
    num_boost_round=1000,
    evals=[(dtrain, 'train'), (dvalid, 'valid')],
    evals_result=eval_results,
    early_stopping_rounds=100,
    verbose_eval=100
)

# Evaluation
y_pred = model.predict(dvalid)
auc = roc_auc_score(y_valid, y_pred)

6.5 CatBoost Implementation

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
from catboost import CatBoostClassifier, Pool

# Identify categorical features
categorical_features = [
    i for i, col in enumerate(X_train.columns)
    if X_train[col].dtype == 'object'
]

# Data pools
train_pool = Pool(X_train, y_train, cat_features=categorical_features)
valid_pool = Pool(X_valid, y_valid, cat_features=categorical_features)

# Model configuration
model = CatBoostClassifier(
    iterations=1000,
    learning_rate=0.05,
    depth=6,
    l2_leaf_reg=3.0,
    early_stopping_rounds=100,
    verbose=100,
    random_seed=42
)

# Training
model.fit(train_pool, eval_set=valid_pool)

# Evaluation
y_pred = model.predict_proba(valid_pool)[:, 1]
auc = roc_auc_score(y_valid, y_pred)

6.6 Cross-Validation and Out-of-Fold Prediction

Rationale for Cross-Validation:

  1. Stability Assessment: Reduces variance from single train/test split
  2. Overfitting Prevention: Validates generalization capability
  3. OOF Generation: Produces unbiased predictions for ensemble construction

Stratified K-Fold Implementation:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
from sklearn.model_selection import StratifiedKFold

N_FOLDS = 5
kf = StratifiedKFold(n_splits=N_FOLDS, shuffle=True, random_state=42)

oof_predictions = np.zeros(len(X_train))
test_predictions = np.zeros(len(X_test))
fold_scores = []

for fold, (train_idx, valid_idx) in enumerate(kf.split(X_train, y_train)):
    print(f'\nFold {fold + 1}/{N_FOLDS}')
    
    # Data partitioning
    X_tr, X_val = X_train.iloc[train_idx], X_train.iloc[valid_idx]
    y_tr, y_val = y_train.iloc[train_idx], y_train.iloc[valid_idx]
    
    # Model training
    model = lgb.LGBMClassifier(**LGBM_PARAMS)
    model.fit(
        X_tr, y_tr,
        eval_set=[(X_val, y_val)],
        callbacks=[lgb.early_stopping(100), lgb.log_evaluation(0)]
    )
    
    # Out-of-fold predictions
    oof_predictions[valid_idx] = model.predict_proba(X_val)[:, 1]
    
    # Test set predictions (ensemble across folds)
    test_predictions += model.predict_proba(X_test)[:, 1] / N_FOLDS
    
    # Fold-level evaluation
    fold_auc = roc_auc_score(y_val, oof_predictions[valid_idx])
    fold_scores.append(fold_auc)
    print(f'Fold AUC: {fold_auc:.4f}')

# Aggregate performance
overall_auc = roc_auc_score(y_train, oof_predictions)
print(f'\nOverall OOF AUC: {overall_auc:.4f} (+/- {np.std(fold_scores):.4f})')

7. Ensemble Learning and Model Fusion

7.1 Ensemble Learning Theory

Limitations of Single Models:

Individual models exhibit specific failure modes:

  • LightGBM: Prone to overfitting on sparse features
  • XGBoost: Computationally intensive training
  • CatBoost: Slight accuracy trade-off for robustness

Ensemble Advantages:

  • Variance Reduction: Averaging reduces prediction variance
  • Bias Reduction: Diverse models capture complementary patterns
  • Stability: Robust to individual model failures

Stacking Ensemble Architecture

7.2 Two-Level Stacking Architecture

Architecture Specification:

Level 1 (Base Learners): Diverse gradient boosting implementations

  • LightGBM (leaf-wise optimization)
  • XGBoost (level-wise with exact greedy)
  • CatBoost (ordered boosting)

Level 2 (Meta-Learner): Simple linear model

  • Logistic Regression or Ridge Regression
  • Rationale: Base learners extract sufficient signal; complex meta-learners risk overfitting

7.3 Out-of-Fold Prediction Generation

Critical Constraint: Meta-learner training requires predictions where the base model was not trained on the target instance (preventing data leakage).

OOF Prediction Generation

Data Leakage Warning:

1
2
3
# INCORRECT: Training set predictions (data leakage)
model.fit(X_train, y_train)
train_pred = model.predict(X_train)  # Model has seen these instances!

Correct OOF Generation:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
from sklearn.model_selection import StratifiedKFold

kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
oof_preds = np.zeros(len(X_train))

for train_idx, valid_idx in kf.split(X_train, y_train):
    X_tr, X_val = X_train[train_idx], X_train[valid_idx]
    y_tr = y_train[train_idx]
    
    model.fit(X_tr, y_tr)
    # Predict on held-out validation set only
    oof_preds[valid_idx] = model.predict_proba(X_val)[:, 1]

7.4 Stacking Implementation

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
from sklearn.linear_model import LogisticRegression
from typing import Dict, Tuple

class StackingEnsemble:
    """
    Two-level stacking ensemble with OOF prediction generation.
    """
    
    def __init__(
        self, 
        base_models: Dict[str, object],
        meta_learner: object
    ):
        self.base_models = base_models
        self.meta_learner = meta_learner
        self.base_predictions = {}
        
    def fit(
        self, 
        X: pd.DataFrame, 
        y: pd.Series,
        cv: int = 5
    ) -> 'StackingEnsemble':
        """
        Generate OOF predictions and train meta-learner.
        """
        kf = StratifiedKFold(
            n_splits=cv, 
            shuffle=True, 
            random_state=42
        )
        
        # Matrix to store OOF predictions
        n_models = len(self.base_models)
        oof_features = np.zeros((len(X), n_models))
        
        # Generate OOF predictions for each base model
        for idx, (name, model) in enumerate(self.base_models.items()):
            print(f'Generating OOF predictions: {name}...')
            
            for train_idx, valid_idx in kf.split(X, y):
                X_tr = X.iloc[train_idx]
                X_val = X.iloc[valid_idx]
                y_tr = y.iloc[train_idx]
                
                # Fit on training fold
                model.fit(X_tr, y_tr)
                
                # Predict on validation fold
                oof_features[valid_idx, idx] = (
                    model.predict_proba(X_val)[:, 1]
                )
            
            self.base_predictions[name] = oof_features[:, idx].copy()
        
        # Train meta-learner on OOF features
        print('Training meta-learner...')
        self.meta_learner.fit(oof_features, y)
        
        # Retrain base models on full dataset
        print('Retraining base models on full data...')
        for name, model in self.base_models.items():
            model.fit(X, y)
            
        return self
    
    def predict(self, X: pd.DataFrame) -> np.ndarray:
        """
        Generate ensemble predictions.
        """
        # Generate base model predictions
        n_models = len(self.base_models)
        base_features = np.zeros((len(X), n_models))
        
        for idx, (name, model) in enumerate(self.base_models.items()):
            base_features[:, idx] = model.predict_proba(X)[:, 1]
        
        # Meta-learner prediction
        return self.meta_learner.predict_proba(base_features)[:, 1]

# Usage
base_models = {
    'lightgbm': lgb.LGBMClassifier(**lgb_params),
    'xgboost': xgb.XGBClassifier(**xgb_params),
    'catboost': CatBoostClassifier(**ctb_params, verbose=0)
}

meta_model = LogisticRegression(
    C=1.0,
    solver='lbfgs',
    max_iter=1000
)

ensemble = StackingEnsemble(base_models, meta_model)
ensemble.fit(X_train, y_train)
final_predictions = ensemble.predict(X_test)

7.5 Hyperparameter Optimization

Hyperparameter Tuning Methods

Methodological Comparison:

MethodStrategyStrengthsLimitationsComputational Cost
Grid SearchExhaustive enumerationComprehensive coverageExponential scalingHigh
Random SearchRandom samplingEfficient explorationPotential omissionModerate
Bayesian OptimizationProbabilistic surrogate modelSample-efficientImplementation complexityLow-Moderate

Bayesian Optimization Implementation:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
from skopt import BayesSearchCV
from skopt.space import Real, Integer

# Define search space
search_spaces = {
    'num_leaves': Integer(20, 50),
    'learning_rate': Real(0.01, 0.1, prior='log-uniform'),
    'min_child_samples': Integer(10, 100),
    'reg_lambda': Real(1e-8, 10.0, prior='log-uniform'),
    'subsample': Real(0.5, 1.0),
    'colsample_bytree': Real(0.3, 1.0)
}

# Bayesian optimization
opt = BayesSearchCV(
    lgb.LGBMClassifier(
        objective='binary',
        metric='auc',
        boosting_type='gbdt',
        n_estimators=1000,
        verbose=-1
    ),
    search_spaces,
    n_iter=50,
    scoring='roc_auc',
    cv=3,
    n_jobs=-1,
    random_state=42,
    verbose=1
)

opt.fit(X_train, y_train)

print(f'Best CV Score: {opt.best_score_:.4f}')
print(f'Optimal Parameters: {opt.best_params_}')

7.6 Performance Summary

Model ConfigurationCross-Validation AUCPublic LeaderboardPrivate LeaderboardRelative Improvement
LightGBM (single)0.79020.7910.792Baseline
XGBoost (single)0.78540.7870.788-0.004
CatBoost (single)0.78810.7890.790-0.002
Simple Average0.79200.7930.794+0.002
Stacking (LGB+XGB+CTB+LR)0.80530.8070.808+0.016

Key Insights:

  • Stacking ensemble achieves 1.6 percentage point improvement over best single model
  • On Kaggle leaderboards, 0.01 AUC improvement typically corresponds to hundreds of ranking positions
  • Simple logistic regression meta-learner outperforms complex alternatives by reducing overfitting

8. Conclusions and Best Practices

8.1 Complete Technical Pipeline Summary

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
Raw Data (7 tables, 50M+ records)
    
Data Preprocessing
    ├─ Anomaly detection and treatment
    ├─ Missing value imputation
    └─ Categorical encoding
    
Feature Engineering (1,000+ features)
    ├─ Aggregation operations (count/sum/mean/max/std)
    ├─ Temporal window features (3m/6m/12m/24m)
    ├─ Ratio and interaction features
    └─ Target encoding for high-cardinality variables
    
Model Development
    ├─ LightGBM (primary model)
    ├─ XGBoost (accuracy complement)
    └─ CatBoost (robustness validation)
    
Ensemble Construction
    ├─ Out-of-fold prediction generation
    ├─ Meta-learner training
    └─ Final prediction aggregation
    
Submission (AUC 0.808, Top 5%)

8.2 Key Technical Insights

Data Architecture:

  • Relational database schemas require systematic aggregation strategies
  • One-to-many relationships necessitate careful feature extraction to prevent information loss
  • Temporal sequences provide richer signal than static snapshots

Feature Engineering:

  • Domain knowledge fundamentally constrains feature construction possibilities
  • Aggregation function selection (mean vs. median vs. max) encodes distinct business assumptions
  • Time-windowed features capture behavioral trends superior to historical averages

Modeling Strategy:

  • Gradient boosting remains the state-of-the-art for structured data prediction
  • Cross-validation serves dual purposes: stability assessment and ensemble preparation
  • Stacking ensembles provide consistent, significant performance improvements

8.3 Reproducible Engineering Practices

  1. Pipeline Architecture: Modular design enables component testing and replacement
  2. Configuration Management: Centralized parameter specification facilitates experiment tracking
  3. Development Mode: Subsampling strategies (--dev_mode) accelerate iteration cycles
  4. Experiment Tracking: Systematic logging prevents redundant computations

8.4 Future Research Directions

Near-term Optimizations (1-2 weeks):

  • Feature interaction exploration beyond manual specification
  • Weighted ensemble construction optimized via validation performance
  • Hyperparameter search space refinement

Medium-term Extensions (1-2 months):

  • Deep learning feature extraction (autoencoder representations)
  • Graph neural networks for relational data modeling
  • Model interpretability analysis (SHAP value decomposition)

Long-term Investigations (3+ months):

  • Online learning systems for distribution drift adaptation
  • A/B testing frameworks for production model validation
  • Federated learning architectures for cross-institutional collaboration

Documentation:

Seminal Publications:

  • Chen, T., & Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. KDD.
  • Ke, G., et al. (2017). LightGBM: A Highly Efficient Gradient Boosting Decision Tree. NIPS.
  • Prokhorenkova, L., et al. (2018). CatBoost: Unbiased Boosting with Categorical Features. NIPS.

Competition Resources:


Conclusion

This comprehensive analysis of the Home Credit Default Risk competition solution demonstrates the systematic application of machine learning methodology to real-world credit risk assessment. The project’s value extends beyond the achieved AUC score (0.808) to encompass:

  1. Engineering Discipline: Pipeline architectures ensure reproducibility and maintainability
  2. Data-Centric Approach: Exploratory analysis directly informs feature engineering decisions
  3. Methodical Optimization: Progressive improvement from single models to sophisticated ensembles

Fundamental Principle:

“Feature engineering determines the theoretical performance ceiling; machine learning algorithms merely approximate this ceiling. Investment in data understanding and feature construction consistently outperforms hyperparameter tuning alone.”

The methodologies presented herein transfer directly to related domains:

  • Insurance fraud detection
  • Marketing response modeling
  • Customer churn prediction
  • Credit scoring system development

*This tutorial presents a complete technical analysis of the Kaggle Home Credit Default Risk competition solution. The open-source implementation is available at GitHub.