A survey on missing data in machine learning

Emmanuel, Tlamelo; Maupong, Thabiso; Mpoeleng, Dimane; Semong, Thabo; Mphago, Banyatsang; Tabona, Oteng

doi:10.1186/s40537-021-00516-9

Journal of Big Data

Table 1 A summary of various missing data techniques in machine learning

From: A survey on missing data in machine learning

Refs.	DataSet	Performance objective	Mechanism	Summary	Limitations
[124]	Balance, Breast, Glass, Bupa, Cmc, Iris, Housing, Ionosphere, wine	To study the influence of noise on missing value handling methods when noise and missing values distributed throughout the dataset	MCAR, MAR, MNAR	The technique proved that noise had a negative effect on imputation methods, particularly when the noise level is high	Division of qualitative values may have been a problem
[85]	German, Glass(g2), heart-statlo, ionosphere, kr-vs-kp, labor, Pima-indians, sonar, balance-scale, iris, waveform, lymphography, vehicle, anneal, glass, satimage, image, zoo, LED, vowel, letter	Experimenting methods for handling incomplete training and test data for different missing data with various proportions and mechanisms	MCAR, MAR	In this technique an understanding of the relative strengths and weaknesses of decision trees for missing value imputation was discussed	The approach did not consider correlations between features
[125]	Los Angeles ozone pollution and Simulated data	To study classification and regression problems using a variety of missing data mechanisms in order to compare the approaches on high dimensional problems	MCAR, MAR	Here the authors tested the potential of imputation technique’s dependence on the correlation structure of the data	Random choice of missing values may have weakened the experiment consistency
[38]	Breast Cancer	To evaluate the performance of statistical and machine learning imputation techniques that were used to predict recurrence in breast cancer patient data		The machine learning techniques proved to be the most suited imputation and led to a significant enhancement of prognosis accuracy compared to statistical techniques	One type of data was used for the imputation model, therefore, the presented results may not generalise to different datasets
[126]	Iris, Wine, Voting, Tic-Tiac-Toe, Hepatitis	To propose a novel technique to impute missing values based on feature relevance	MCAR, MAR	The approach employed mutual information to measure feature relevance and proved to reduce classification bias	Random choice of missing values may have weakened the experiment consistency
[127]	Liver, Diabetis, Breast Cancer, Heart, WDSC, Sonar	Experimented on missing data handling using Random Forests and specifically analysed the impact of correlation of features on the imputation results	MCAR, MAR, MNAR	The imputation approach was reported to be generally robust with performance improving when increasing correlation	Random choice of missing values in MNAR could have weakened the consistency of the experiment
[128]	Wine , Simulated	To create an improved imputation algorithm for handling missing values	MCAR, MAR, MNAR	Demonstrated the superiority of a new algorithm to existing imputation methods on accuracy of imputing missing data	Features may have had different percentages of missing data, also MAR and MNAR may have been weakened
[129]	De novo simulation, Health surveys S1, S2 and S3	To compare various techniques of combining internal validation with multiple imputation	MCAR,MAR	The approach was regarded to be comprehensive with regard to the use of simulated and real data with different data characteristics, validation strategies and performance measures	The approach influenced potential bias by the relationship between effect strengths and missingness in covariates
[130]	Pima Indian Diabetes dataset	To experiment on missing values approach that takes into account feature relevance		The results of the technique proved that the hybrid algorithm was better than the existing methods in terms of accuracy, RMSE and MAE	Missing values mechanism was not considered
[13]	Iris, Voting, Hepatitis	Proposed an iterative KNN that took into account the presence of the class labels	MCAR, MAR	The technique considered class labels and proved to perform good against other imputation methods	The approach has not been theoretically proven to converge, though it was empirically shown
[74]	Camel, Ant, Ivy, Arc, Pcs, Mwl, KC3, Mc2	To develop a novel incomplete-instance based imputation approach that utilized cross-validation to improve the parameters for each missing value	MCAR, MAR	The study demonstrated that their approach was superior to other missing values approaches
[131]	Blood, breast-cancer, ecoli, glass, ionosphere, iris, Magic, optdigits, pendigits, pima, segment, sonar, waveform, wine, yeast, balance-scale, Car, chess-c, chess-m, CNAE-9, lymphography, mushroom, nursery, promoters, SPECT, tic-tac-toe, abalone, acute, card, contraceptive, German, heart, liver, zoo	To develop a missing handling approach is introduced with effective imputation results	MCAR	The method was based on calculating the class center of every class and using the distances between it and the observed data to define a threshold for imputation. The method performed better and had less imputation time	Only one missing mechanism was implemented
[132]	Groundwater	Developed a multiple imputation method that can handle the missingness in ground water dataset with high rate of missing values	MAR	Here the technique used to handle the missing values, was chosen looking at its ability to consider the relationships between the variables of interest	There was no prior knowledge on the label of missing data which may have provided difficulty when performing imputation
[133]	Dukes’ B colon cancer, the Mice Protein Expression and Yeast	Developed a novel hybrid Fuzzy C means Rough parameter missing value imputation method		The technique handled the vagueness and coarseness in the dataset and proved to produce better imputation results	There was no report of missing values mechanisms used for the experiment
[134]	Forest fire, Glass, Housing, Iris, MPG, MV, Stocks, Wine	The method proposed a variant of the forward stage-wise regression algorithm for data imputation by modelling the missing values as random variables following a Gaussian mixture distribution. Categorical		The method proved to be effective compared to other approaches that combined standard missing data approaches and the original FSR algorithm	There was no report of missing values mechanisms used for the experiment
[135]	Weather dataset	This method applied four(Likewise, Multiple imputation, KNN, MICE) missing data handling methods to the training data before classification		Of the imputation methods applied the authors concluded that the most effective missing data imputation method for photovoltaic forecasting was the KNN method	There was no report of missing values mechanisms used for the experiment
[136]	Air quality data	To make time series prediction for missing values using three machine learning algorithms and identify the best method		The study concluded that deep learning performed better when data was large and machine learning models produced better results when the data was less	Heavy costs in time consumption and computational powers for training when implementing their most effective method (deep learning)
[137]	Traumatic Brain Injury and Diabetes	To demonstrate how performance varies with different missing value mechanisms and the imputation method used and further demonstrate how MNAR is an important tool to give confidence that valid results are obtained using multiple imputation and complete case analysis	MCAR, MAR, MNAR	The study showed that both complete case analysis and multiple imputation can produce unbiased results under more conditions	The method was limited by the absence of nonlinear terms in the substantive models
[138]	Grades Dataset	To develop a new decision tree approach for missing data handling	MCAR, MAR, MNAR	The method produced a higher accuracy compared to other missing values handling techniques and had more interpretable classifier	The algorithm suffered from a weakness when the gating variable had no predictive power
[139]	Air Pressure System data	The study proposed a sorted missing percentages approach for filtering attributes when building machine learning classification model using sensor readings with missing data		The technique proved to be effective for scenarios dealing with missing data in industrial sensor data analysis	The proposed approach could not meet the needs of automation
[139]	Abalone and Boston Housing	To experiment the reliability of missing value handling at not missing at random	MAR	The results of the study indicated that the approach achieved satisfactory performance in solving the lower incomplete problem compared to other six methods	The approach did not consider any missingness rate which may have affected the analysis
[140]	Cleveland Heart disease	Proposed a systematic methodology for the identification of missing values using the KNN, MICE, mean, and mode with four classifiers Naive Bayes, SVM, logistic regression, and random forest		The result of the study demonstrated that MICE imputation performed better than other imputation methods used on the study	The approach compared stage of the art methods with simple imputation methods, mean and mode that are bias and unrealistic results
[141]	Iris, Wine, Ecoli and Sonar datasets	To retrieve missing data by considering the attribute correlation in the imputation process using a class center-based adaptive approach using the firefly algorithm	MCAR	The result of the experiment demonstrated that the class center-based firefly algorithm was an efficient method for handling missing values	Imputation was tested on only one missing value mechanism
[15]	Abalone, Iris, Lymphography and Parkinsons	Proposed a novel tuple-based region splitting imputation approach that used a new metric, mean integrity rate to measure the missing degree of a dataset to impute various types missing data		The region splitting imputation model outperformed the competitive models of imputation	Random generator was used to impute missing values and other mechanisms for missing values were not considered
[142]	Artificial and real metabolomics data	To develop a new kernel weight function-based imputation approach that handles missing values and outliers	MAR	The proposed kernel weight-based approach proved to be superior compared to other data imputation techniques	The method was experimented on one type of dataset and may not perform as reported on other types of data

Back to article page