Skip to main content

Table 1 A summary of various missing data techniques in machine learning

From: A survey on missing data in machine learning

Refs.

DataSet

Performance objective

Mechanism

Summary

Limitations

[124]

Balance, Breast, Glass, Bupa, Cmc, Iris, Housing, Ionosphere, wine

To study the influence of noise on missing value handling methods when noise and missing values distributed throughout the dataset

MCAR, MAR, MNAR

The technique proved that noise had a negative effect on imputation methods, particularly when the noise level is high

Division of qualitative values may have been a problem

[85]

German, Glass(g2), heart-statlo, ionosphere, kr-vs-kp, labor, Pima-indians, sonar, balance-scale, iris, waveform, lymphography, vehicle, anneal, glass, satimage, image, zoo, LED, vowel, letter

Experimenting methods for handling incomplete training and test data for different missing data with various proportions and mechanisms

MCAR, MAR

In this technique an understanding of the relative strengths and weaknesses of decision trees for missing value imputation was discussed

The approach did not consider correlations between features

[125]

Los Angeles ozone pollution and Simulated data

To study classification and regression problems using a variety of missing data mechanisms in order to compare the approaches on high dimensional problems

MCAR, MAR

Here the authors tested the potential of imputation technique’s dependence on the correlation structure of the data

Random choice of missing values may have weakened the experiment consistency

[38]

Breast Cancer

To evaluate the performance of statistical and machine learning imputation techniques that were used to predict recurrence in breast cancer patient data

 

The machine learning techniques proved to be the most suited imputation and led to a significant enhancement of prognosis accuracy compared to statistical techniques

One type of data was used for the imputation model, therefore, the presented results may not generalise to different datasets

[126]

Iris, Wine, Voting, Tic-Tiac-Toe, Hepatitis

To propose a novel technique to impute missing values based on feature relevance

MCAR, MAR

The approach employed mutual information to measure feature relevance and proved to reduce classification bias

Random choice of missing values may have weakened the experiment consistency

[127]

Liver, Diabetis, Breast Cancer, Heart, WDSC, Sonar

Experimented on missing data handling using Random Forests and specifically analysed the impact of correlation of features on the imputation results

MCAR, MAR, MNAR

The imputation approach was reported to be generally robust with performance improving when increasing correlation

Random choice of missing values in MNAR could have weakened the consistency of the experiment

[128]

Wine , Simulated

To create an improved imputation algorithm for handling missing values

MCAR, MAR, MNAR

Demonstrated the superiority of a new algorithm to existing imputation methods on accuracy of imputing missing data

Features may have had different percentages of missing data, also MAR and MNAR may have been weakened

[129]

De novo simulation, Health surveys S1, S2 and S3

To compare various techniques of combining internal validation with multiple imputation

MCAR,MAR

The approach was regarded to be comprehensive with regard to the use of simulated and real data with different data characteristics, validation strategies and performance measures

The approach influenced potential bias by the relationship between effect strengths and missingness in covariates

[130]

Pima Indian Diabetes dataset

To experiment on missing values approach that takes into account feature relevance

 

The results of the technique proved that the hybrid algorithm was better than the existing methods in terms of accuracy, RMSE and MAE

Missing values mechanism was not considered

[13]

Iris, Voting, Hepatitis

Proposed an iterative KNN that took into account the presence of the class labels

MCAR, MAR

The technique considered class labels and proved to perform good against other imputation methods

The approach has not been theoretically proven to converge, though it was empirically shown

[74]

Camel, Ant, Ivy, Arc, Pcs, Mwl, KC3, Mc2

To develop a novel incomplete-instance based imputation approach that utilized cross-validation to improve the parameters for each missing value

MCAR, MAR

The study demonstrated that their approach was superior to other missing values approaches

 

[131]

Blood, breast-cancer, ecoli, glass, ionosphere, iris, Magic, optdigits, pendigits, pima, segment, sonar, waveform, wine, yeast, balance-scale, Car, chess-c, chess-m, CNAE-9, lymphography, mushroom, nursery, promoters, SPECT, tic-tac-toe, abalone, acute, card, contraceptive, German, heart, liver, zoo

To develop a missing handling approach is introduced with effective imputation results

MCAR

The method was based on calculating the class center of every class and using the distances between it and the observed data to define a threshold for imputation. The method performed better and had less imputation time

Only one missing mechanism was implemented

[132]

Groundwater

Developed a multiple imputation method that can handle the missingness in ground water dataset with high rate of missing values

MAR

Here the technique used to handle the missing values, was chosen looking at its ability to consider the relationships between the variables of interest

There was no prior knowledge on the label of missing data which may have provided difficulty when performing imputation

[133]

Dukes’ B colon cancer, the Mice Protein Expression and Yeast

Developed a novel hybrid Fuzzy C means Rough parameter missing value imputation method

 

The technique handled the vagueness and coarseness in the dataset and proved to produce better imputation results

There was no report of missing values mechanisms used for the experiment

[134]

Forest fire, Glass, Housing, Iris, MPG, MV, Stocks, Wine

The method proposed a variant of the forward stage-wise regression algorithm for data imputation by modelling the missing values as random variables following a Gaussian mixture distribution. Categorical

 

The method proved to be effective compared to other approaches that combined standard missing data approaches and the original FSR algorithm

There was no report of missing values mechanisms used for the experiment

[135]

Weather dataset

This method applied four(Likewise, Multiple imputation, KNN, MICE) missing data handling methods to the training data before classification

 

Of the imputation methods applied the authors concluded that the most effective missing data imputation method for photovoltaic forecasting was the KNN method

There was no report of missing values mechanisms used for the experiment

[136]

Air quality data

To make time series prediction for missing values using three machine learning algorithms and identify the best method

 

The study concluded that deep learning performed better when data was large and machine learning models produced better results when the data was less

Heavy costs in time consumption and computational powers for training when implementing their most effective method (deep learning)

[137]

Traumatic Brain Injury and Diabetes

To demonstrate how performance varies with different missing value mechanisms and the imputation method used and further demonstrate how MNAR is an important tool to give confidence that valid results are obtained using multiple imputation and complete case analysis

MCAR, MAR, MNAR

The study showed that both complete case analysis and multiple imputation can produce unbiased results under more conditions

The method was limited by the absence of nonlinear terms in the substantive models

[138]

Grades Dataset

To develop a new decision tree approach for missing data handling

MCAR, MAR, MNAR

The method produced a higher accuracy compared to other missing values handling techniques and had more interpretable classifier

The algorithm suffered from a weakness when the gating variable had no predictive power

[139]

Air Pressure System data

The study proposed a sorted missing percentages approach for filtering attributes when building machine learning classification model using sensor readings with missing data

 

The technique proved to be effective for scenarios dealing with missing data in industrial sensor data analysis

The proposed approach could not meet the needs of automation

[139]

Abalone and Boston Housing

To experiment the reliability of missing value handling at not missing at random

MAR

The results of the study indicated that the approach achieved satisfactory performance in solving the lower incomplete problem compared to other six methods

The approach did not consider any missingness rate which may have affected the analysis

[140]

Cleveland Heart disease

Proposed a systematic methodology for the identification of missing values using the KNN, MICE, mean, and mode with four classifiers Naive Bayes, SVM, logistic regression, and random forest

 

The result of the study demonstrated that MICE imputation performed better than other imputation methods used on the study

The approach compared stage of the art methods with simple imputation methods, mean and mode that are bias and unrealistic results

[141]

Iris, Wine, Ecoli and Sonar datasets

To retrieve missing data by considering the attribute correlation in the imputation process using a class center-based adaptive approach using the firefly algorithm

MCAR

The result of the experiment demonstrated that the class center-based firefly algorithm was an efficient method for handling missing values

Imputation was tested on only one missing value mechanism

[15]

Abalone, Iris, Lymphography and Parkinsons

Proposed a novel tuple-based region splitting imputation approach that used a new metric, mean integrity rate to measure the missing degree of a dataset to impute various types missing data

 

The region splitting imputation model outperformed the competitive models of imputation

Random generator was used to impute missing values and other mechanisms for missing values were not considered

[142]

Artificial and real metabolomics data

To develop a new kernel weight function-based imputation approach that handles missing values and outliers

MAR

The proposed kernel weight-based approach proved to be superior compared to other data imputation techniques

The method was experimented on one type of dataset and may not perform as reported on other types of data