Bank Loan
Classifier Training
The purpose of this project is to
train a classifier to identify whether or
not a loan will be paid back based on past data.
Overview
A fictional credit lending company has tasked me with creating a classifier in order to determine whether or not a loan will be paid back. I have been supplied with a dataset with the details of 9,500 previous loans. For this assignment, the purpose of correctly identifying if a loan will not be paid back is considered significantly more important than accurately predicting whether a loan will be paid back.
Files: Python Notebook Loan Data CSV
Objectives
- Inspect dataset for data correlated with unpaid loans
- Clean data of potentially conflicting data
- Train a classifier to be able to accurately detect if a loan will be defaulted on.
Process
- Data was acquired from an open source dataset (available here)
- Data was displayed using a correlation matrix in order to determine whether some of the variables were either uncorrelated or unnecessary for analysis.
- Data was cleaned and prepared for the use in training a classifier function.
- Due to having less data on unpaid loans than paid ones, the unpaid loans were upsampled in order to provide more accurate training.
- Train a classifier to be able to accurately detect potential cases where loans will not be paid back.
Analysis
Potential outliers in data
My first step in analyzing the data was to attempt to identify outliers in the data which might throw off accuracy in training. Upon viewing items correlated with the generated correlation matrix for unpaid loans, I concluded that no one point of data was significantly impacting the data one way or the other.
Training of Classifier
As less data was available for unpaid loans compared to paid loans, it was necessary to first oversample the amount of unpaid loans. After some tweaking of methodology in the classifier, I was was able to obtain a 67% accuracy rate in detecting unpaid loans within a test set.
Overall Accuracy of Classifier
Throughout the training of the classifier, some tradeoffs were necessary. Due to the fact that detecting loans which are more likely to not be paid back than it was to correctly identify whether loans would be, the model was more aggressively trained to detect those items (bottom right). This lead to a significant amount of misclassification of loans which would actually end up being paid back (top right). Though not optimal, this tradeoff is something I consider acceptable due to the mission statement of this project.
Conclusions
With a 67% accuracy of identifying loans which will not be paid back, I believe this model would be useful in identifying potential loans which will not be paid back. Due to the misclassification of loans which will be in good standing, the model would most likely have to be used for a tool to identify loans which require a closer watch or further investigation.