keyboard_arrow_up
Towards Optimizing Performance of Machine Learning Algorithms on Unbalanced Dataset

Authors

Asitha Thumpati and Yan Zhang, California State University San Bernardino, USA

Abstract

Imbalanced data, a common occurrence in real-world datasets, presents a challenge for machine learning classification models. These models are typically designed with the assumption of balanced class distributions, leading to lower predictive performance when faced with imbalanced data. To address this issue, this paper employs data preprocessing techniques, including Synthetic Minority Oversampling Technique (SMOTE) for oversampling and random undersampling, on unbalanced datasets. Additionally, genetic programming is utilized for feature selection to enhance both performance and efficiency. In our experiment, we leverage an imbalanced bank marketing dataset sourced from the UCI Machine Learning Repository. To evaluate the effectiveness of our techniques, we implement it on four different classification algorithms: Decision Tree, Logistic Regression, K-Nearest Neighbors (KNN), and Support Vector Machines (SVM). We compare various evaluation metrics, such as accuracy, balanced accuracy, recall, F-score, Receiver Operating Characteristics (ROC) curve, and Precision-Recall (PR) curve, across different scenarios: unbalanced data, oversampled data, undersampled data, and data cleaned with Tomek-Links. Our findings reveal that all four algorithms demonstrate improved performance when the minority class is oversampled to half the size of the majority class and the majority class is undersampled to match the minority class. Subsequently, applying Tomek-Links on the balanced dataset further enhances performance.

Keywords

Unbalanced Dataset, Oversampling, Undersampling, Feature Selection, Classification

Full Text  Volume 13, Number 19