Credit Risk ML — Loan Default Prediction System

End-to-end credit scoring pipeline: SQL feature engineering → 4-model ensemble comparison → business-aware threshold optimization → Django + FastAPI deployment on Render.

🌐 Live Demo: credit-risk-ai.onrender.com

About This Project

Built to simulate how a real fintech company engineers, trains, and serves a credit scoring model. The system covers:

SQL-first feature engineering from 4 normalized database tables (66 engineered features)
Production-grade preprocessing pipeline with MNAR-aware imputation, ordinal + one-hot encoding, and dual scaling strategies
4-model comparison with RandomizedSearchCV tuning and StratifiedKFold cross-validation
Business-aware threshold optimization using a cost matrix calibrated to real lending economics (₹91.2L savings demonstrated)
Versioned model registry with rollback support and metadata tracking
Django web app with SHAP explainability, animated prediction gauge, and model dashboard
FastAPI ML microservice with Pydantic validation, Swagger docs, and health monitoring
Full deployment stack: Docker Compose (local) + Render Blueprint (cloud)

Project Architecture

Credit_Risk_ML/
├── database/                     # Data generation & SQL feature engineering
│   ├── schema.sql                # 4-table normalized schema (customers, loans, credit_history, payments)
│   ├── feature_engineering.sql   # 66 features via SQL aggregations & window functions
│   ├── generate_data.py          # Initial synthetic dataset generator
│   ├── generate_realistic_data.py # Realistic data with noise, skew, and missing patterns
│   └── build_features.py         # Executes SQL → exports ml_features.csv
│
├── src/                          # Core ML pipeline
│   ├── preprocessing.py          # MissingValueHandler, FeatureEncoder, FeatureScaler, PreprocessingPipeline
│   ├── train_models.py           # ModelTrainer with RandomizedSearchCV for 4 models
│   ├── business_threshold.py     # CostMatrix, ThresholdOptimizer — cost-sensitive threshold selection
│   └── model_registry.py        # Versioned model storage with metadata & rollback
│
├── models/                       # Trained model artifacts (.pkl)
│   ├── Logistic_Regression_model.pkl
│   ├── XGBoost_model.pkl
│   ├── LightGBM_model.pkl
│   ├── Sklearn_GBM_model.pkl
│   ├── feature_scaler.pkl
│   └── registry/                 # Versioned registry entries
│
├── results/                      # Training outputs & reports
│   ├── model_comparison.csv      # AUC-ROC, F1, KS, Gini across all models
│   ├── training_report.json      # Full metrics + top-10 features per model
│   ├── threshold_analysis.png    # 3-panel threshold sweep visualization
│   ├── threshold_optimization.json
│   └── {Model}_classification_report.txt (×4)
│
├── ml_service/                   # FastAPI microservice
│   ├── main.py                   # App + /predict, /health, /model-info endpoints
│   ├── predictor.py              # MLPredictor with SHAP integration
│   ├── schemas.py                # Pydantic schemas (LoanApplication, PredictionResponse)
│   ├── requirements.txt
│   └── Dockerfile
│
├── data/                         # Raw & processed datasets
│   ├── customers.csv             # ~5K customer profiles
│   ├── loan_applications.csv     # ~10K loan records
│   ├── credit_history.csv        # Bureau features
│   ├── repayment_history.csv     # Monthly payment records
│   ├── ml_features.csv           # Final 66-feature dataset (SQL output)
│   └── processed/                # Train/test splits (X_train, X_test, scaled variants)
│
├── webapp/                       # Django web application
│   ├── config/                   # Settings, URLs, WSGI
│   └── predictor/                # Prediction Django app
│       ├── views.py              # Form handling & ML inference
│       ├── forms.py              # LoanApplicationForm with validation
│       ├── ml/                   # Model loader & SHAP explainer
│       ├── templates/            # predict.html, dashboard.html
│       └── static/               # CSS, JS (glassmorphism UI)
│
├── docker-compose.yml            # 3-service stack: Django + FastAPI + PostgreSQL
├── render.yaml                   # Render Blueprint (one-click cloud deploy)
├── build.sh                      # Render build script (migrations + collectstatic)
├── requirements.txt              # Production dependencies
└── .env.example                  # Environment variable template

Model Performance

All four models were tuned with RandomizedSearchCV (5-fold stratified CV, scoring = AUC-ROC). Results on held-out 20% test set:

Model	AUC-ROC	CV AUC	AUC-PR	Precision	Recall	F1	KS Stat	Gini
Logistic Regression 🏆	0.8294	0.8494	0.7065	0.6043	0.7197	0.6570	0.5165	0.6588
XGBoost	0.8293	0.8465	0.7035	0.6672	0.6417	0.6542	0.5149	0.6586
LightGBM	0.8275	0.8454	0.6996	0.6121	0.7086	0.6568	0.5101	0.6551
Sklearn GBM	0.8253	0.8462	0.6970	0.7067	0.5064	0.5900	0.5064	0.6506

KS Statistic > 0.40 is considered "good" in industry scorecards. All four models achieved KS ≈ 0.50–0.52. Gini > 0.60 is "excellent" — all models landed at 0.65+.

Why AUC ≈ 0.83 and not 0.99?

The initial synthetic dataset produced AUC ≈ 0.999 due to feature leakage — post-loan behavioral signals (missed payments, delinquencies) were included as input features. 13 leaky features were removed and signal-to-noise ratio was tuned to achieve the realistic credit scorecard range (0.75–0.85). This is intentional: real-world banking models from the RBI/Basel II era typically score 0.70–0.85 AUC.

Why Logistic Regression as Champion?

Logistic Regression achieved the highest CV AUC (0.8494) despite simpler architecture. More importantly, it is:

Interpretable by design — coefficients show exact feature contribution direction and magnitude
Regulatory-friendly — Basel II/III and RBI guidelines often require explainable models for credit decisions
Fast at inference — microsecond predictions, critical for real-time loan approvals

The gradient boosting models were statistically equivalent but not significantly better, which indicates the signal is primarily in the features — not the model complexity.

ML Engineering Deep Dive

1. SQL-Based Feature Engineering (`database/`)

Features are computed directly from 4 normalized tables using SQL aggregations, window functions, and self-joins. This mirrors production data warehouse workflows (dbt, BigQuery, Redshift).

Key engineered features:

Feature	Source	Type
`dti_ratio`	EMI / annual income	Ratio
`emi_to_income_ratio`	Monthly EMI / monthly income	Ratio
`credit_utilization`	Outstanding / credit limit	Ratio (0–1)
`late_payment_ratio`	Late payments / total payments	Ratio
`employment_stability_ratio`	Years employed / age	Ratio
`income_per_dependent`	Income / (dependents + 1)	Normalized
`loan_to_income_ratio`	Loan amount / annual income	Ratio
`credit_score_tier`	Bucketed credit score	Ordinal
`is_thin_file`	Missing credit history flag	Binary

Total: 66 features from 4 tables (customers, loan_applications, credit_history, repayment_history).

2. MNAR-Aware Preprocessing Pipeline (`src/preprocessing.py`)

Three composable classes, each with fit() / transform() / fit_transform():

MissingValueHandler — Handles Missing Not At Random (MNAR) patterns:

A missing credit_score means the customer is new-to-credit — a risk signal in itself
Creates binary {col}_is_missing indicator columns before imputation to preserve the missingness signal
Numeric: imputed with median from training data only
Categorical: imputed with mode from training data only

FeatureEncoder — Encoding strategy by feature type:

Ordinal: education, credit_score_tier, age_group, utilization_bucket, term_bucket, rate_tier — preserves monotonic ordering that OHE would destroy
One-Hot: gender, marital_status, employment_type, loan_type — nominal categories with no meaningful order
Handles unseen categories at inference gracefully (fills 0, maps to −1)

FeatureScaler — Three scaling strategies:

StandardScaler (z-score): normally-distributed features like age, credit_score, loan_term_months
RobustScaler (median/IQR): outlier-prone features like annual_income, loan_amount, emi_amount
No scaling: binary flags, bounded ratios (0–1), ordinal-encoded integers, OHE columns

PreprocessingPipeline orchestrates all steps in the correct order:

Separate target (is_default) and drop ID columns
Stratified train/test split (80/20, random_state=42) — split happens before any fitting
fit_transform(X_train) → transform(X_test) for each handler
Produces two output versions: scaled (for Logistic Regression) and unscaled (for tree models)

3. Model Training & Tuning (`src/train_models.py`)

Why RandomizedSearchCV over GridSearchCV?
XGBoost's parameter space alone has ~92,000 combinations × 5 folds = 460,800 model fits. RandomizedSearchCV samples 50 random combinations (≈0.05% of the grid) and achieves near-optimal results per Bergstra & Bengio (2012). Production systems use Bayesian optimization (Optuna/Hyperopt) — 3–5× more efficient still.

Model-specific tuning rationale:

Model	Key Parameters	Why
Logistic Regression	`C`, `penalty (L1/L2)`	L1 drives correlated features to zero (feature selection); L2 shrinks all coefficients
XGBoost	`max_depth`, `gamma`, `min_child_weight`, `scale_pos_weight`	`gamma` prunes splits; `scale_pos_weight` handles class imbalance
LightGBM	`num_leaves`, `min_child_samples`, histogram binning	`num_leaves` is the primary complexity control for leaf-wise growth
Sklearn GBM	`min_samples_leaf`, `min_samples_split`, `max_features`	Conservative depth-wise growth; good baseline for reproducibility

Metrics computed per model:
AUC-ROC, AUC-PR, F1, Precision, Recall, Accuracy, KS Statistic, Gini Coefficient — the last two are industry-standard credit scorecard metrics rarely seen in Kaggle notebooks.

4. Business Threshold Optimization (`src/business_threshold.py`)

The default 0.5 threshold assumes that missing a defaulter and rejecting a good customer cost the same. In lending, they emphatically do not.

Cost Matrix:

Error Type	Description	Cost
False Negative (FN)	Approved a customer who defaulted	₹1,00,000
False Positive (FP)	Rejected a customer who would have repaid	₹10,000
True Negative / Positive	Correct decisions	₹0

10:1 asymmetry → optimal threshold shifts left (more aggressive at catching defaults).

Results on LightGBM test set:

	Default Threshold (0.50)	Optimal Threshold (0.11)
Total Business Cost	₹2,11,20,000	₹1,20,00,000
Recall (defaulters caught)	70.86%	98.57%
Precision	61.21%	35.80%
Missed Defaulters (FN)	183	9
Estimated Savings	—	₹91.2 Lakhs (43.2%)

The ThresholdOptimizer sweeps thresholds from 0.05 to 0.95, computes total business cost at each step, and generates a 3-panel analysis plot (cost curve, recall curve, precision curve).

5. Versioned Model Registry (`src/model_registry.py`)

Each trained model version is stored with:

Serialized model + scaler (.pkl via joblib)
Metadata: AUC-ROC, training timestamp, feature count, hyperparameters, optimal threshold
Supports rollback to any previous version
Singleton pattern for efficient loading in Django + FastAPI

6. FastAPI ML Microservice (`ml_service/`)

A stateless inference service designed to scale horizontally independently of the Django UI:

Endpoint	Method	Description
`/predict`	POST	Accepts `LoanApplication` JSON → returns default probability, risk tier, SHAP explanations, EMI estimate
`/health`	GET	Service health + model load status
`/model-info`	GET	Loaded model version, AUC, optimal threshold, available registry versions
`/docs`	GET	Auto-generated Swagger/OpenAPI UI

Request/response validated via Pydantic v2. Startup uses FastAPI's lifespan context manager (model loaded at startup, graceful degradation if load fails).

7. Django Web Application (`webapp/`)

Predict page — LoanApplicationForm with 15 fields; inline validation; animated gauge showing default probability
Dashboard — Stripe/Linear-inspired model comparison table; top feature importance bars (sqrt-scaled for visual balance); SHAP explanation panel
Glassmorphism UI with dark background, micro-animations, responsive layout
Static files served via WhiteNoise (production-ready, no Nginx required)

Quick Start — Local

# 1. Clone
git clone https://github.com/Ravinthra/Credit_Risk_ML.git
cd Credit_Risk_ML

# 2. Install dependencies
pip install -r requirements.txt

# 3. Generate synthetic dataset (SQLite database + CSVs)
python database/generate_realistic_data.py

# 4. Build ML features via SQL
python database/build_features.py

# 5. Preprocess (creates data/processed/ splits)
python src/preprocessing.py

# 6. Train all 4 models (takes ~8 min on CPU)
python src/train_models.py

# 7. Optimize decision threshold
python src/business_threshold.py

# 8. Launch Django app
cd webapp
python manage.py migrate
python manage.py runserver
# → http://127.0.0.1:8000

# Optional: Run FastAPI service in parallel
cd ..
uvicorn ml_service.main:app --host 0.0.0.0 --port 8001 --reload
# → http://127.0.0.1:8001/docs

Docker Deployment (Full Stack)

# Copy and configure environment
cp .env.example .env
# Edit .env: DJANGO_SECRET_KEY, POSTGRES_PASSWORD

# Build and start all 3 services
docker-compose up --build

Services:

Service	Container	URL
Django Web UI	`creditrisk_django`	`http://localhost:8000`
FastAPI ML API	`creditrisk_ml_api`	`http://localhost:8001/docs`
PostgreSQL 16	`creditrisk_db`	Internal (port 5432)

The Django container mounts ./models and ./results as read-only volumes. The FastAPI container reads from the same model registry. PostgreSQL health-checks before Django starts.

Cloud Deployment — Render

Option A: Blueprint (Recommended)

Push this repo to GitHub
Go to render.com → New → Blueprint
Connect your GitHub repo — Render auto-detects render.yaml
Click Apply — live in ~3 minutes

Option B: Manual Web Service

Setting	Value
Runtime	Python 3.12
Build Command	`./build.sh`
Start Command	`cd webapp && gunicorn config.wsgi:application --bind 0.0.0.0:$PORT --workers 2 --timeout 120`
Environment Variables	`DJANGO_SECRET_KEY` (generate), `DJANGO_DEBUG=False`, `DJANGO_ALLOWED_HOSTS=.onrender.com`

The build.sh script installs dependencies, runs collectstatic, and runs database migrations automatically on each deploy.

Environment Variables

Variable	Required	Description
`DJANGO_SECRET_KEY`	✅	Django secret key (generate with `python -c "from django.core.management.utils import get_random_secret_key; print(get_random_secret_key())"`)
`DJANGO_DEBUG`	✅	`False` in production
`DJANGO_ALLOWED_HOSTS`	✅	Comma-separated hostnames (e.g., `.onrender.com`)
`DATABASE_URL`	Docker only	PostgreSQL connection string
`ML_SERVICE_URL`	Docker only	FastAPI base URL (e.g., `http://ml_api:8001`)
`POSTGRES_DB`	Docker only	PostgreSQL database name
`POSTGRES_USER`	Docker only	PostgreSQL username
`POSTGRES_PASSWORD`	Docker only	PostgreSQL password

Copy .env.example to .env and fill in values before running locally or in Docker.

Interview Talking Points

Why SQL instead of pandas for feature engineering?

In production, features live in data warehouses (BigQuery, Redshift, Snowflake). SQL pipelines are scalable, version-controllable via dbt, and reproducible across teams. Pandas-based pipelines don't scale past a single machine and are harder to schedule in Airflow/Prefect.

Why not just use XGBoost and skip Logistic Regression?

Logistic Regression is a performance floor — if a boosting model barely beats it, the problem is with features, not architecture. Additionally, interpretable models are often legally required for credit decisions (Basel II, ECOA, RBI guidelines). LogReg with L1 also acts as embedded feature selection.

How do you prevent training-serving skew?

The preprocessing pipeline uses fit() on training data only and transform() on test/production data. Scalers, encoders, and imputation values are serialized to disk via joblib and loaded at inference time. The FastAPI service loads the same scaler artifact that was fit during training — no recomputation.

Why threshold = 0.11 and not 0.5?

Because costs are asymmetric: a missed defaulter on an unsecured personal loan costs ~₹1L in write-offs; rejecting a good customer costs ~₹10K in lost interest. With a 10:1 FN/FP cost ratio, the expected-cost-minimizing threshold shifts aggressively left. At 0.11, recall jumps from 70.9% to 98.6%, catching 174 additional defaulters at the cost of more false alarms — but the math clearly favors it.

What metrics matter for credit risk?

AUC-ROC for threshold-independent discriminative ability, KS Statistic (max TPR−FPR separation) and Gini Coefficient (= 2×AUC−1) as industry-standard scorecard KPIs. KS > 0.40 is "good", Gini > 0.60 is "excellent" — all four models achieved both. Accuracy is largely meaningless at 31% class imbalance.

Why not CatBoost as the 4th model?

CatBoost's ordered boosting would reduce target leakage in the boosting process and its native categorical handling would eliminate OHE entirely. It requires Python ≤ 3.12 for prebuilt wheels. Sklearn GBM serves as a conservative, fully reproducible sklearn-native baseline in its place. In a production system, I'd replace Sklearn GBM with CatBoost.

Tech Stack

Layer	Technology
Language	Python 3.12
Web Framework	Django 5.1 + Gunicorn
ML API	FastAPI 0.109 + Uvicorn
ML Libraries	scikit-learn 1.3, XGBoost 2.0, LightGBM 4.0
Explainability	SHAP 0.43
Data	pandas 2.1, NumPy 1.26
Database	SQLite (local ML) / PostgreSQL 16 (Docker/Render)
Serialization	joblib
Static Files	WhiteNoise
Containerization	Docker + Docker Compose
Cloud	Render (Blueprint)

About

Ravinthra Amulraj
MCA Graduate · Python Developer · Aspiring ML Engineer

Built to demonstrate real-world ML engineering beyond Kaggle notebooks — SQL feature engineering, modular preprocessing, cost-sensitive optimization, microservice architecture, and cloud deployment working together as a coherent system.

GitHub: github.com/Ravinthra
Live Project: credit-risk-ai.onrender.com

License

MIT License — see LICENSE for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Credit Risk ML — Loan Default Prediction System

About This Project

Project Architecture

Model Performance

Why AUC ≈ 0.83 and not 0.99?

Why Logistic Regression as Champion?

ML Engineering Deep Dive

1. SQL-Based Feature Engineering (`database/`)

2. MNAR-Aware Preprocessing Pipeline (`src/preprocessing.py`)

3. Model Training & Tuning (`src/train_models.py`)

4. Business Threshold Optimization (`src/business_threshold.py`)

5. Versioned Model Registry (`src/model_registry.py`)

6. FastAPI ML Microservice (`ml_service/`)

7. Django Web Application (`webapp/`)

Quick Start — Local

Docker Deployment (Full Stack)

Cloud Deployment — Render

Option A: Blueprint (Recommended)

Option B: Manual Web Service

Environment Variables

Interview Talking Points

Tech Stack

About

License

About

Uh oh!

Releases 1

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
database		database
ml_service		ml_service
models		models
results		results
src		src
webapp		webapp
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build.sh		build.sh
docker-compose.yml		docker-compose.yml
render.yaml		render.yaml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Credit Risk ML — Loan Default Prediction System

About This Project

Project Architecture

Model Performance

Why AUC ≈ 0.83 and not 0.99?

Why Logistic Regression as Champion?

ML Engineering Deep Dive

1. SQL-Based Feature Engineering (database/)

2. MNAR-Aware Preprocessing Pipeline (src/preprocessing.py)

3. Model Training & Tuning (src/train_models.py)

4. Business Threshold Optimization (src/business_threshold.py)

5. Versioned Model Registry (src/model_registry.py)

6. FastAPI ML Microservice (ml_service/)

7. Django Web Application (webapp/)

Quick Start — Local

Docker Deployment (Full Stack)

Cloud Deployment — Render

Option A: Blueprint (Recommended)

Option B: Manual Web Service

Environment Variables

Interview Talking Points

Tech Stack

About

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Contributors

Uh oh!

Languages

1. SQL-Based Feature Engineering (`database/`)

2. MNAR-Aware Preprocessing Pipeline (`src/preprocessing.py`)

3. Model Training & Tuning (`src/train_models.py`)

4. Business Threshold Optimization (`src/business_threshold.py`)

5. Versioned Model Registry (`src/model_registry.py`)

6. FastAPI ML Microservice (`ml_service/`)

7. Django Web Application (`webapp/`)