Predicting Credit Default Risk, A Full-Stack Data Science Project

Credit

Introduction

Credit default prediction is central to financial risk management. In this project, I built a comprehensive machine learning pipeline to predict whether a credit card client will default in the next month. The model was trained on a dataset sourced from the UCI Machine Learning Repository. The pipeline includes data preprocessing, feature reduction, model training (Logistic Regression, Random Forest, and XGBoost), and hyperparameter tuning. The final model was deployed using Streamlit and is live here.

Dataset Overview

The dataset contains 30,000 observations and 24 features including:

  • LIMIT_BAL: Credit limit for the client
  • SEX, EDUCATION, MARRIAGE: Demographic info
  • PAY_0 to PAY_6: Repayment status over the last six months
  • BILL_AMT1 to BILL_AMT6: Historical billing amounts
  • PAY_AMT1 to PAY_AMT6: Repayment amounts

The target variable is default.payment.next.month, where 1 means the client defaulted and 0 means they paid successfully.

Exploratory Data Analysis

Initial exploration revealed no missing values. A class imbalance was evident: only ~22% of clients defaulted. Strong predictors included repayment history (especially PAY_1) and bill amounts. All features were numerical, allowing seamless integration into scikit-learn pipelines.

Feature Selection and Preprocessing

I reduced the number of features from 24 to 10, focusing on the most predictive ones like LIMIT_BAL, PAY_1, PAY_AMT1, and BILL_AMT1. Standard scaling was applied using StandardScaler to normalize continuous features.


scaler = StandardScaler()
X_scaled = scaler.fit_transform(df[feature_list])

Modeling and Evaluation

I evaluated three models: Logistic Regression (baseline), Random Forest, and XGBoost. The dataset was split using train_test_split with 20% for testing. Below is a snippet used to fit XGBoost:


xgb = XGBClassifier(use_label_encoder=False, eval_metric="logloss")
xgb.fit(X_train, y_train)

Evaluation metrics:

  • Accuracy: Measures overall correct predictions
  • F1-Score: Weighted average of precision and recall
  • ROC-AUC: Model discrimination ability

Hyperparameter tuning was performed using both GridSearchCV and RandomizedSearchCV. XGBoost emerged as the top-performing model with the highest F1 and ROC-AUC.

Deployment

The deployment goal was to host the model via an API. However, AWS Lambda had package size limitations that made deploying XGBoost infeasible under the free tier. As a result, I pivoted to Streamlit Cloud, which integrates directly with GitHub and supports Python natively.


joblib.dump(xgb_best, "xgb_model.pkl")

The final app was deployed from the following GitHub repository: github.com/jkazalekor/credit-default-app

Web App Features

  • User inputs financial info: credit limit, repayment history, and bill/payment amounts
  • Option to choose between Random Forest and XGBoost model
  • Displays prediction: Will the client default?
  • Shows confidence probability and selected model's performance

App is live here: credit-default-app.streamlit.app

Reflections and Learnings

This project deepened my understanding of both data preprocessing and deployment challenges in machine learning. Choosing the right model and tuning it for real-world performance required both statistical and engineering tradeoffs. I learned:

  • Deployment constraints should be considered from the start
  • Random Forest and XGBoost perform well on tabular financial data
  • Feature importance scores can guide useful dimensionality reduction

Project Links