Back to Timeline

Chapter 3-Predictive Analytics with Ensemble Learning/

Started: 2025-11-28

View on GitHub

Python Seabon Numpy

Project Progress 100%

About this project

Online Payments Fraud Detection

Online Payments Fraud Detection with Machine Learning

Author: Leon Motaung

Technologies: Python, Pandas, NumPy, Scikit-learn, Matplotlib, Seaborn, Plotly

Project Overview

This project focuses on detecting fraudulent online payment transactions using machine learning. The dataset contains 284,808 transactions, each with 30 anonymized features (V1–V28), plus Amount, Time, and Class (fraud = 1, normal = 0). The dataset is highly imbalanced, with far fewer fraudulent transactions than normal ones.

Process & Steps Taken

Data Cleaning and Preprocessing: Handled missing values, corrected data types, scaled Amount and Time features, removed outliers using IQR, and added a log-transformed scaled_amount feature. The cleaned dataset was saved as creditcard_final.csv.
Data Visualization: Used boxplots, histograms, correlation heatmaps, and scatterplots to explore feature distributions, detect anomalies, and understand the class imbalance.
Baseline Models: Trained Logistic Regression, Decision Tree, and K-Nearest Neighbors (KNN). Focused on Recall and F1-score due to class imbalance.

Baseline Model Evaluation

Model	Accuracy	Precision	Recall	F1-score	Notes
Logistic Regression	0.9768	0.0577	0.8929	0.1083	Simple, interpretable baseline
Decision Tree	0.9993	0.8478	0.6964	0.7647	Visualize feature importance
K-Nearest Neighbors	0.9995	0.9744	0.6786	0.8000	Works well for small datasets

Findings & Insights

Class Imbalance Matters: Accuracy alone is misleading; Recall and F1-score are more meaningful for fraud detection.
Decision Trees and KNN performed well: Both models gave strong F1-scores, with Decision Trees providing feature importance insight.
Feature Relationships: Correlation heatmaps revealed subtle patterns useful for feature engineering.
Visualization is Key: Scatterplots and boxplots helped detect anomalies and better understand distributions.

Next Steps

Handle class imbalance with SMOTE or undersampling.
Train advanced models: XGBoost, LightGBM.
Evaluate models using Precision, Recall, F1-score, ROC-AUC.
Deploy model via Flask or Streamlit dashboard for real-time fraud detection.

Project Structure

app.py – Main fraud detection script
draw.py – Visualization scripts
creditcard_final.csv – Cleaned dataset
Images – Boxplots, scatterplots, heatmaps, charts

This project gave me hands-on experience in data preprocessing, visualization, and baseline model evaluation. It reinforced the importance of appropriate evaluation metrics for imbalanced datasets and prepared me to tackle more advanced predictive analytics tasks.