Predicting Credit Default Risk, A Full-Stack Data Science Project
Introduction
Credit default prediction is central to financial risk management. In this project, I built a comprehensive machine learning pipeline to predict whether a credit card client will default in the next month. The model was trained on a dataset sourced from the UCI Machine Learning Repository. The pipeline includes data preprocessing, feature reduction, model training (Logistic Regression, Random Forest, and XGBoost), and hyperparameter tuning. The final model was deployed using Streamlit and is live here.
Dataset Overview
The dataset contains 30,000 observations and 24 features including:
LIMIT_BAL: Credit limit for the clientSEX,EDUCATION,MARRIAGE: Demographic infoPAY_0toPAY_6: Repayment status over the last six monthsBILL_AMT1toBILL_AMT6: Historical billing amountsPAY_AMT1toPAY_AMT6: Repayment amounts
The target variable is default.payment.next.month, where 1 means the client defaulted and 0 means they paid successfully.
Exploratory Data Analysis
Initial exploration revealed no missing values. A class imbalance was evident: only ~22% of clients defaulted. Strong predictors included repayment history (especially PAY_1) and bill amounts. All features were numerical, allowing seamless integration into scikit-learn pipelines.
Feature Selection and Preprocessing
I reduced the number of features from 24 to 10, focusing on the most predictive ones like LIMIT_BAL, PAY_1, PAY_AMT1, and BILL_AMT1. Standard scaling was applied using StandardScaler to normalize continuous features.
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df[feature_list])
Modeling and Evaluation
I evaluated three models: Logistic Regression (baseline), Random Forest, and XGBoost. The dataset was split using train_test_split with 20% for testing. Below is a snippet used to fit XGBoost:
xgb = XGBClassifier(use_label_encoder=False, eval_metric="logloss")
xgb.fit(X_train, y_train)
Evaluation metrics:
- Accuracy: Measures overall correct predictions
- F1-Score: Weighted average of precision and recall
- ROC-AUC: Model discrimination ability
Hyperparameter tuning was performed using both GridSearchCV and RandomizedSearchCV. XGBoost emerged as the top-performing model with the highest F1 and ROC-AUC.
Deployment
The deployment goal was to host the model via an API. However, AWS Lambda had package size limitations that made deploying XGBoost infeasible under the free tier. As a result, I pivoted to Streamlit Cloud, which integrates directly with GitHub and supports Python natively.
joblib.dump(xgb_best, "xgb_model.pkl")
The final app was deployed from the following GitHub repository: github.com/jkazalekor/credit-default-app
Web App Features
- User inputs financial info: credit limit, repayment history, and bill/payment amounts
- Option to choose between Random Forest and XGBoost model
- Displays prediction: Will the client default?
- Shows confidence probability and selected model's performance
App is live here: credit-default-app.streamlit.app
Reflections and Learnings
This project deepened my understanding of both data preprocessing and deployment challenges in machine learning. Choosing the right model and tuning it for real-world performance required both statistical and engineering tradeoffs. I learned:
- Deployment constraints should be considered from the start
- Random Forest and XGBoost perform well on tabular financial data
- Feature importance scores can guide useful dimensionality reduction
Project Links
- GitHub Repository: github.com/jkazalekor/credit-default-app
- Live Web App: Streamlit App
- Dataset: UCI Repository