Identifying Fraudulent Credit Card Transactions
Creating and training a machine learning classifier to be able to accurately identify fraudulent credit card transactions, as well as supplemental analysis of fradulent charges
Scenario
A new credit card company has just entered the market in the western United States. The company is promoting itself as one of the safest credit cards to use. They have hired you as their data scientist in charge of identifying instances of fraud. The executive who hired you has have provided you with data on credit card transactions, including whether or not each transaction was fraudulent. The executive wants to know how accurately you can predict fraud using this data. She has stressed that the model should err on the side of caution: it is not a big problem to flag transactions as fraudulent when they aren't just to be safe. In your report, you will need to describe how well your model functions and how it adheres to these criteria.
Files: Python Notebook Loan Data CSV Trained Model
Objectives
- Inspect dataset for most common aspects of fraudulent transactions
- Prepare and clean data for use in a classifier
- Train a machine learning model to be able to accurately predict whether or not a charge is fraudulent
Process
- Data was acquired from an open source dataset (available here)
- Data was split into various datasets, grouping it by different aspects in order to determine which aspects were most common in fraud
- Data was analyzed and visualizations were created in order to explore aspects of code
- Data was cleaned and prepared for usage in machine learning model
- Model was trained in order to accurately detect whether or not a transaction was genuine or fraudulent
Analysis
Supplemental Analysis
Fraudulent Charges by Type
From analyzing the dataset of all illegitimate charges in the dataset, I was able to see that the category of charges most responsible for fraud is the 'Grocery Point of Sale'(grocery_pos) category. Closely following the first category was online shopping (shopping_net).
Fraud by State
Analysis of the dataset showed that the state responsible for most charges was California. In my opinion, this isn't particularly impactful in detecting whether or not a charge was fraudulent, as the data seemed to mostly be from western American areas, and California is the most populous state in this region.
Classifier
Accuracy of Classifier
Upon training the classifier, which involved converting all data within the dataset to usable data types and testing multiple methods of classification, I believe I have achieved an acceptable level of accuracy. With tweaking of values for the training of the model, I was able to achieve an acceptable level of 98% correct identification of legitimate transactions, with a 73% accuracy in identifying fradulent transactions (see below for full confusion matrix).
Conclusions
With the ability to accurately detect an illegitimate charge with a 73% accuracy, this model would be useful as part of a system to be able to block fraudulent charges when they are made. Along with this accuracy in identifying fraud, the model will only return false positives in 2% of transactions. If used with a potential secondary system, this could definitely be a beneficial first step in protecting customers from credit card theft or fraud.