Data oversampling and imbalanced datasets: an investigation of performance for machine learning and feature engineering

Mujahid, Muhammad; Kına, EROL; Rustam, Furqan; Villar, Monica Gracia; Alvarado, Eduardo Silva; De La Torre Diez, Isabel; Ashraf, Imran

doi:10.1186/s40537-024-00943-4

Research
Open access
Published: 17 June 2024

Data oversampling and imbalanced datasets: an investigation of performance for machine learning and feature engineering

Muhammad Mujahid¹,
EROL Kına²,
Furqan Rustam³,
Monica Gracia Villar^4,5,6,
Eduardo Silva Alvarado^4,7,8,
Isabel De La Torre Diez⁹ &
…
Imran Ashraf¹⁰

Journal of Big Data volume 11, Article number: 87 (2024) Cite this article

389 Accesses
Metrics details

Abstract

The classification of imbalanced datasets is a prominent task in text mining and machine learning. The number of samples in each class is not uniformly distributed; one class contains a large number of samples while the other has a small number. Overfitting of the model occurs as a result of imbalanced datasets, resulting in poor performance. In this study, we compare different oversampling techniques like synthetic minority oversampling technique (SMOTE), support vector machine SMOTE (SVM-SMOTE), Border-line SMOTE, K-means SMOTE, and adaptive synthetic (ADASYN) oversampling to address the issue of imbalanced datasets and enhance the performance of machine learning models. Preprocessing significantly enhances the quality of input data by reducing noise, redundant data, and unnecessary data. This enables the machines to identify crucial patterns that facilitate the extraction of significant and pertinent information from the preprocessed data. This study preprocesses the data using various top-level preprocessing steps. Furthermore, two imbalanced Twitter datasets are used to compare the performance of oversampling techniques with six machine learning models including random forest (RF), SVM, K-nearest neighbor (KNN), AdaBoost (ADA), logistic regression (LR), and decision tree (DT). In addition, the bag of words (BoW) and term frequency and inverse document frequency (TF-IDF) features extraction approaches are used to extract features from the tweets. The experiments indicate that SMOTE and ADASYN perform much better than other techniques thus providing higher accuracy. Additionally, overall results show that SVM with ’linear’ kernel tends to attain the highest accuracy and recall score of 99.67% and 1.00% on ADASYN oversampled datasets and 99.57% accuracy on SMOTE oversampled dataset with TF-IDF features. The SVM model using 10-fold cross-validation experiments achieved 97.40 mean accuracy with a 0.008 standard deviation. Our approach achieved 2.62% greater accuracy as compared to other current methods.

Introduction

Text mining and machine learning with imbalanced datasets have received an increasing amount of attention in recent years, both from a theoretical and practical standpoint and have a considerable classification challenge. The datasets are imbalanced because one or more classes have a much lower number of samples than other classes, resulting in an imbalance of class distribution. There are a variety of strategies available to deal with this issue; approaches that produce synthetic data to establish a balanced class are more adaptable than methods that manually tweak data. For example, various methods are used for text categorization [1], retrieval of information and filtration [2], and fraud detection [3].

The class imbalance has a negative impact on the prediction capabilities of classification methods. The prediction accuracy of machine learning algorithms is often used to evaluate the overall performance of the algorithms. Many of these algorithms are designed to maximize classification accuracy, which is a metric that is skewed in favor of the majority class. A classifier can obtain better classification accuracy even if it does not accurately predict a single occurrence of a minority class. According to Japkowicz and Stephen [4], the complexities of the model increase as the intensity of the class imbalance increases, and the impact of this issue on the classifiers is intensified as the quantity of the training examples decreases. Another study [5] focuses on the class imbalanced problem that is encountered when utilizing machine-based models to analyze the sentiment of tweets. The authors performed experiments on the imbalanced dataset collected from Twitter using two well-known classifiers to prove their point. Later, the synthetic minority oversampling technique (SMOTE) approach was used to ensure that the dataset was properly balanced. Their findings demonstrated that using an oversampling strategy can improve the performance of machine-based models.

Similarly, the study [6] employed SMOTE to balance the datasets and conducted research on the classification of political tweets in two different languages. It appears that their strategy makes it possible to deal with the class imbalance distribution issue by improving recognition of such minority classes while simultaneously achieving a substantial increase in total geometric mean criterion, as demonstrated by the obtained findings. Despite several works investigating the influence of individual oversampling approaches, the literature lacks a comprehensive study that compares the performance of various oversampling approaches.

Scope and importance

Oversampling is a machine learning approach that may successfully handle issues associated with class imbalance. When the number of schemas used by various classes is drastically different, there is class inconsistency. This animosity deprives minority groups of positive role models. To address this issue, oversampling procedures aim to consciously increase the minority representation in the sample. There is room for improvement in the efficiency of machine learning techniques such as neural networks, random forests, and support vector machines (SVM). Incorporating a more evenly distributed set of training data allows these algorithms to better predict outcomes from novel data and set limits for their development. To adapt to different data distributions, quickly manage large datasets, and fine-tune algorithmic parameters, the oversampling approach employs powerful machine learning skills. Furthermore, machine learning can facilitate the development of sophisticated sampling algorithms. These algorithms can produce synthetic samples while maintaining crucial data properties.

Machine learning pipelines rely heavily on feature engineering. It includes steps to clean and prepare the raw data for use in training the models. Two widely used methods for extracting text characteristics in certain settings are BoW and TF-IDF. BoW makes use of a vector that represents the frequency of phrases in the corpus. While this method works well for sentiment analysis, document categorization, and spam detection, it doesn’t take word order or other contextual aspects into account. After processing text, the Bag-of-Words (BoW) technique converts it into a high-dimensional sparse matrix. In this matrix, each row represents a document, and every column represents a word. By taking document-level word importance into account in context with the full corpus, the TF-IDF algorithm improves upon the BoW approach. To make frequent words less important and rare terms more significant in the computation, multiply TF and IDF. The analysis of textual data, including consumer comments, social media marketing, and product evaluations. Common approaches employed in this study include term frequency, BoW, and TF-IDF. Data retrieval methods such as TF-IDF and BoW compare user queries with relevant documents, using the textual content of the documents to evaluate the correctness of the documents. Businesses have begun integrating engineering methods like TF-IDF and bag-of-sampling into machine learning to make predictive models more accurate and versatile for use in real-world applications like sentiment analysis and fraud detection. They may work together to use oversampling methods. A company’s value can only rise as a result of better decision-making.

Most modern companies embrace the strategy of enhancing their products. Many businesses use customer exposure as a strategic tool to assess their customers’ perceptions. Remote monitoring, surveys, assessments, and questionnaires are all effective methods for collecting valuable customer feedback. These statements provide valuable feedback that companies can leverage to enhance their products and services. Consumers rely on online resources and social media platforms to make informed decisions about the most suitable products. E-commerce has become increasingly essential in our daily lives, offering a wide range of information, communication, educational, shopping, and entertainment options. Twitter is highly valuable for businesses compared to other platforms due to its ability to facilitate users sharing concise or detailed feedback on any product. Organizations and businesses often face difficulties in gathering tweets and analyzing the sentiments expressed. Automated sentiment analysis enables rapid evaluation and categorization of tweets as positive or negative.

Contributions

This paper focuses on analyzing the suitability and efficacy of different sampling approaches to solve the problem of the imbalanced dataset as well as improve the performance of machine learning models. The model’s performance varies depending on whether the number of class samples is fewer or higher if the class is not properly balanced. This research makes use of two extremely imbalanced Twitter datasets, including the EndViolence tweets dataset, and the E-commerce-related tweets dataset, which are obtained from Kaggle. So, this study makes the following contributions in this regard

1.
The Impact of various sampling approaches is investigated concerning the performance of machine learning models for imbalanced datasets. For this purpose, five sampling approaches are employed including SMOTE, support vector machine-SMOTE (SVM-SMOTE), K-means SMOTE, adaptive synthetic (ADASYN) oversampling, and Border-Line SMOTE. The selection of these sampling approaches is made regarding their wide use in the existing literature and reported performances.
2.
For analyzing the influence of these approaches on the performance of models, six widely used machine learning models are selected which include random forest (RF), SVM, K-nearest neighbor (KNN), AdaBoost (ADA), logistic regression (LR), decision tree (DT), and gradient boosting (GB). The performance is measured in terms of accuracy, precision, recall, and F1 score.
3.
Two publicly available and highly imbalanced datasets are utilized for experiments to check the performance of oversampling techniques on machine learning models. For feature engineering, the bag of words (BoW) and term frequency-inverse document frequency (TF-IDF) are also utilized.

The following part of the paper is organized into four sections. Section "Related work" covers the relevant work on oversampling approaches. Section "Materials and proposed methodology" presents the methodology adopted to carry out experiments. In Sect. "Results and discussions", experiments and results of our study are discussed while Sect. Conclusion" provides the conclusion.

Related work

Data balancing is an important task to reduce model skewness and as such several works have made use of oversampling approaches to reduce model overfitting and increase performance [7,8,9]. Sarakit et al. [10] employed the SMOTE method to detect emotion in unbalanced YouTube datasets using three machine learning classifiers. The findings suggest that the proposed method improves emotion classification while also addressing the issue of an unbalanced dataset. For the categorization of toxic comments, Rupapara et al. [11] introduced an ensemble approach based on imbalanced features while SMOTE is used for data balancing. A soft voting ensemble model is used by combining the LR and support vector classifiers. Moreover, BoW and TFIDF are used for the proposed approach. Using TF-IDF features and the SMOTE method, the RV-VC model can achieve a 97% accuracy.

Flores et al. [12] used the SMOTE method for sentiment analysis to test the SVM and Naive Bayes on imbalanced datasets. Results show that preprocessing, training, and testing split effectiveness and data balancing are all prevalent aspects in achieving improved results. For sentiment analysis on the Arabic tweets related to COVID-19, the study [13] used ensemble classifiers with SMOTE. Word2vec embedding is used, as well as single and ensemble classifiers are used with and without SMOTe. The ensemble approach using word embedding and the SMOTE technique outperforms the competition. The research [14] focused on the performance of machine learning models in determining the polarity score for extremely imbalanced datasets using word-embedding features. A number of basic classifiers and ensemble classifiers are investigated with and without SMOTE. The results show that using ensemble methods with SMOTE and embed words improved the F1 score by more than 15% on average over the baseline technique. Another research [15] classified news using the SMOTE and Borderline SMOTE techniques on imbalanced datasets to show that using SMOTE produces better results.

Along the same lines, the study [16] used oversampling approaches to overcome the problem of imbalance dataset classes in sarcasm detection from social media tweets. The authors demonstrated that SMOTE and Borderline SMOTE are efficient in classifying sarcasm sentiments. Borderline SMOTE and ADASYN oversampling approaches for multi-text classification problems are compared in the study [17].

KNN and SVM models are used in the study. Results show the supremacy of SMOTE to improve the performance of models. Another research [18] employed the SMOTE approach for sentiment and emotion classification on an imbalanced dataset. Sentiment analysis regarding spam reviews is carried out in [19] using the SMOTE approach. Similarly, the study [20] employed the machine learning techniques, and K-means SMOTE method to balance the dataset to predict school student performance and showed superior results using the balanced dataset.

The study [21] conducted sentiment analysis on tweets concerning online education using an imbalanced dataset. The study employed an e-learning dataset that was extracted using various keywords from the Twitter API. The authors employed lexicon-based and feature engineering techniques to label the tweets and extract features from them. The experiments are carried out using machine and deep learning models using SMOTE-balanced datasets. The performance of the models is better using SMOTE. The authors employ SMOTE in [22] to undertake a comparative analysis of three classifiers for sentiment analysis. The study states that SMOTE used with TF-IDF features provides the best results. Another research [23] analyzed the sentiments of people regarding e-sports education using an imbalanced Twitter dataset. The study compared the performance of Naive Bayes and SVM using SMOTE oversampling. Experimental results reveal that the Naive Bayes technique has the best accuracy score when used with the SMOTE balanced dataset.

Balaji et al. [24] employ robust machine learning methods to provide a comprehensive evaluation of numerous applications of sentiment analysis. The study commences with a comprehensive examination of machine learning methods specifically utilized for sentiment analysis. Subsequently, they provide a comprehensive examination of machine learning methodologies for sentiment analysis. Additionally, they offer a comprehensive examination of the limitations and benefits associated with employing machine learning in social media analysis. Another study [25] offers a comprehensive statistical analysis of Extensive Feature Selector (EFS), a new method for feature selection that uses probability based on classes and corpora. On four benchmark data sets, it compares EFS to nine alternative FS approaches using KNN, support vector machines, and multinomial Naive Bayes classifiers. The results show that out of the ten methods tested, EFS consistently yields the best results.

The study [26] examined how globalization tactics impacted local feature selection (LFS) techniques using feature-rich datasets. The authors used the weighted-sum, summation, and maximum approaches to analyze globalization. Utilizing DFSS, odds ratio (OR), and chi-square (CHI) techniques, the researchers examined the effects of globalization initiatives. The approach with the highest degree of internationalization success, according to the findings, was the AVG technique. The DFSS approach outperformed RE and CHI2 in terms of MCB and MCU characteristics. The authors found the CHI2 method to be more accurate in terms of DFSS and OR techniques.

The paper [27] conducted a comprehensive examination of the DL and ML techniques used in the diagnosis of depression. They also emphasize the constraints of the current research efforts. However, this research lacks the capacity to conduct a comprehensive review of prior investigations. The authors in the paper [28] used a neural network architecture in the development of the proposed deep learning model to accurately identify the musical genre from aural inputs. The proposed approach boasted an impressive accuracy rate of 90.3%, surpassing the results of previous research in its competition. The deep model’s performance consistency was evaluated by conducting K-fold cross-validation with different values of k. The study [29] provided a cutting-edge deep learning model. By deftly combining deep learning approaches with various word embedding techniques, the model conducts multi-class sentiment analysis on a dataset consisting of tweets from six prominent US airlines. To detect emotional content and insert words, the selected systems use a range of deep learning approaches. The approach begins with cleaning tweets and applying CNN’s pre-processing algorithms to the raw data from DNN.

There is a lot of research on oversampling and performance analysis that uses ensemble methods or simple machine learning models. They used word embedding models, also known as “bags of words,” to extract semantic information from texts. Even though these models work, it’s hard for them to generalize complicated speech patterns and attitudes that depend on the situation. In machine learning, class imbalance, or the improper display of one mood class, is a major issue. Inequalities in the data may make it harder for skewed models to work well with minority classes. In an effort to address this problem, researchers have only focused on SMOTE and borderline SMOTE in the entire literature. We are still debating how to address class imbalances in this research, as these methods do not perform very well.

Text data preparation is another drawback yet frequently overlooked aspect of the analysis of tweets in the performance analysis of machine learning research. Preparing raw text for machine learning models is an essential step. Numerous studies either do not pretest or employ methods that consume a great deal of computational power without significantly enhancing performance. Ineffective preparation has an impact on the model’s performance, the duration of the training process, and operational costs. Negligently normalized and extracted feature models from textual data could be problematic, thereby degrading the performance of mood classifiers.

Another drawback of existing approaches is the selection of appropriate oversampling and feature engineering techniques. The authors in studies such as [14,15,16,17,18,19] do not select appropriate sampling and feature techniques. Some studies used only one SMOTE technique to validate the performance of machine learning on balanced data. Only the single oversampling technique does not validate or provide assurance of the superiority of the model in text classification. Also, feature engineering is most important to enhance the performance of models. Another aspect is the use of some basic machine learning models without optimization and hyper-parameter tuning for training large datasets.

In the literature, we noticed that without proper preprocessing, inadequate selection of oversampling and feature engineering techniques may lead to unsatisfactory results. The authors in the study [10] only utilized SMOTE and simple models; the performance of machine learning was not satisfactory, and hence the overall results were very poor. The study [11] also utilized the SMOTE and Ensemble ML models. An ensemble of three simple models takes a lot of time to process and provide predictions in its early stages, so only using a single SMOTE technique can ensure that the ensemble works better on SMOTE data. The study [12] also utilized the single SMOTE technique and did not perform data preprocessing techniques properly to lessen the training resources. Their study achieved poor results. The study [14] employed SMOTE to balance and enhance the dataset samples, but their study conducted experiments only with a total of 1798 samples, which is very limited and causes overfitting with ensemble methods. They used SMOTE but achieved only 78% accuracy.

From the above-discussed research works, several points can be deduced. First, the performance of machine learning tends to improve when a balanced dataset is used. Dataset balancing provided an approximately equal number of samples for model training and overcomes the problem of model skewness in the majority class. Second, SMOTe is the most widely used oversampling approach for data balancing and is predominantly used by existing studies. Third, TF-IDF is most widely used along with the SMOTE and shows better results than other feature engineering approaches. Last and foremost, despite several works investigating the influence of oversampling approaches on the performance of machine learning models, studies analyzing the comparative performance of various oversampling approaches rarely exist. Therefore, this study focuses on selecting the most commonly used oversampling approaches and performs a comparative study to fill this gap. Table 1 illustrates the summary of the state-of-the-art summary along with their limitations and contributions.

Table 1 Summary of the state-of-the-art works along with their limitations and contributions

Data oversampling and imbalanced datasets: an investigation of performance for machine learning and feature engineering

Abstract

Introduction

Scope and importance

Contributions

Related work

Materials and proposed methodology

Description of datasets

Preprocessig

Feature engineering

Bag of words

Term frequency-inverse document-frequency

Comparison between TFIDF and BoW

Description of oversampling techniques

Synthetic minority oversampling technique

Border-line SMOTE

SVM-SMOTE

Adaptive synthetic oversampling

K-Means SMOTE

Machine learning models

Random forest

Support vector machine

K-nearest neighbor

Logistic regression

Decision tree

AdaBoost

Evaluation parameters

Results and Discussions

Results of ML models on oversampled datasets

K-fold cross-validation results on oversampled datasets

Results of deep learning models on balanced datasets

Statistical T-test comparison

Performance comparison with existing studies

Discussion

Conclusion

Availability of data and materials

Code availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords