Predicting Credit Default Risk, A Full-Stack Data Science Project

Introduction

Credit default prediction is central to financial risk management. In this project, I built a comprehensive machine learning pipeline to predict whether a credit card client will default in the next month. The model was trained on a dataset sourced from the UCI Machine Learning Repository. The pipeline includes data preprocessing, feature reduction, model training (Logistic Regression, Random Forest, and XGBoost), and hyperparameter tuning. The final model was deployed using Streamlit and is live here.

Dataset Overview

The dataset contains 30,000 observations and 24 features including:

LIMIT_BAL: Credit limit for the client
SEX, EDUCATION, MARRIAGE: Demographic info
PAY_0 to PAY_6: Repayment status over the last six months
BILL_AMT1 to BILL_AMT6: Historical billing amounts
PAY_AMT1 to PAY_AMT6: Repayment amounts

The target variable is default.payment.next.month, where 1 means the client defaulted and 0 means they paid successfully.

Exploratory Data Analysis

Initial exploration revealed no missing values. A class imbalance was evident: only ~22% of clients defaulted. Strong predictors included repayment history (especially PAY_1) and bill amounts. All features were numerical, allowing seamless integration into scikit-learn pipelines.

Feature Selection and Preprocessing

I reduced the number of features from 24 to 10, focusing on the most predictive ones like LIMIT_BAL, PAY_1, PAY_AMT1, and BILL_AMT1. Standard scaling was applied using StandardScaler to normalize continuous features.


scaler = StandardScaler()
X_scaled = scaler.fit_transform(df[feature_list])

Modeling and Evaluation

I evaluated three models: Logistic Regression (baseline), Random Forest, and XGBoost. The dataset was split using train_test_split with 20% for testing. Below is a snippet used to fit XGBoost:


xgb = XGBClassifier(use_label_encoder=False, eval_metric="logloss")
xgb.fit(X_train, y_train)

Evaluation metrics:

Accuracy: Measures overall correct predictions
F1-Score: Weighted average of precision and recall
ROC-AUC: Model discrimination ability

Hyperparameter tuning was performed using both GridSearchCV and RandomizedSearchCV. XGBoost emerged as the top-performing model with the highest F1 and ROC-AUC.

Deployment

The deployment goal was to host the model via an API. However, AWS Lambda had package size limitations that made deploying XGBoost infeasible under the free tier. As a result, I pivoted to Streamlit Cloud, which integrates directly with GitHub and supports Python natively.


joblib.dump(xgb_best, "xgb_model.pkl")

The final app was deployed from the following GitHub repository: github.com/jkazalekor/credit-default-app

Web App Features

User inputs financial info: credit limit, repayment history, and bill/payment amounts
Option to choose between Random Forest and XGBoost model
Displays prediction: Will the client default?
Shows confidence probability and selected model's performance

App is live here: credit-default-app.streamlit.app

Reflections and Learnings

This project deepened my understanding of both data preprocessing and deployment challenges in machine learning. Choosing the right model and tuning it for real-world performance required both statistical and engineering tradeoffs. I learned:

Deployment constraints should be considered from the start
Random Forest and XGBoost perform well on tabular financial data
Feature importance scores can guide useful dimensionality reduction

Project Links

GitHub Repository: github.com/jkazalekor/credit-default-app
Live Web App: Streamlit App
Dataset: UCI Repository