Skip to main content

Table 1 A summary of various missing data techniques in machine learning

From: A survey on missing data in machine learning



Performance objective





Balance, Breast, Glass, Bupa, Cmc, Iris, Housing, Ionosphere, wine

To study the influence of noise on missing value handling methods when noise and missing values distributed throughout the dataset


The technique proved that noise had a negative effect on imputation methods, particularly when the noise level is high

Division of qualitative values may have been a problem


German, Glass(g2), heart-statlo, ionosphere, kr-vs-kp, labor, Pima-indians, sonar, balance-scale, iris, waveform, lymphography, vehicle, anneal, glass, satimage, image, zoo, LED, vowel, letter

Experimenting methods for handling incomplete training and test data for different missing data with various proportions and mechanisms


In this technique an understanding of the relative strengths and weaknesses of decision trees for missing value imputation was discussed

The approach did not consider correlations between features


Los Angeles ozone pollution and Simulated data

To study classification and regression problems using a variety of missing data mechanisms in order to compare the approaches on high dimensional problems


Here the authors tested the potential of imputation technique’s dependence on the correlation structure of the data

Random choice of missing values may have weakened the experiment consistency


Breast Cancer

To evaluate the performance of statistical and machine learning imputation techniques that were used to predict recurrence in breast cancer patient data


The machine learning techniques proved to be the most suited imputation and led to a significant enhancement of prognosis accuracy compared to statistical techniques

One type of data was used for the imputation model, therefore, the presented results may not generalise to different datasets


Iris, Wine, Voting, Tic-Tiac-Toe, Hepatitis

To propose a novel technique to impute missing values based on feature relevance


The approach employed mutual information to measure feature relevance and proved to reduce classification bias

Random choice of missing values may have weakened the experiment consistency


Liver, Diabetis, Breast Cancer, Heart, WDSC, Sonar

Experimented on missing data handling using Random Forests and specifically analysed the impact of correlation of features on the imputation results


The imputation approach was reported to be generally robust with performance improving when increasing correlation

Random choice of missing values in MNAR could have weakened the consistency of the experiment


Wine , Simulated

To create an improved imputation algorithm for handling missing values


Demonstrated the superiority of a new algorithm to existing imputation methods on accuracy of imputing missing data

Features may have had different percentages of missing data, also MAR and MNAR may have been weakened


De novo simulation, Health surveys S1, S2 and S3

To compare various techniques of combining internal validation with multiple imputation


The approach was regarded to be comprehensive with regard to the use of simulated and real data with different data characteristics, validation strategies and performance measures

The approach influenced potential bias by the relationship between effect strengths and missingness in covariates


Pima Indian Diabetes dataset

To experiment on missing values approach that takes into account feature relevance


The results of the technique proved that the hybrid algorithm was better than the existing methods in terms of accuracy, RMSE and MAE

Missing values mechanism was not considered


Iris, Voting, Hepatitis

Proposed an iterative KNN that took into account the presence of the class labels


The technique considered class labels and proved to perform good against other imputation methods

The approach has not been theoretically proven to converge, though it was empirically shown


Camel, Ant, Ivy, Arc, Pcs, Mwl, KC3, Mc2

To develop a novel incomplete-instance based imputation approach that utilized cross-validation to improve the parameters for each missing value


The study demonstrated that their approach was superior to other missing values approaches



Blood, breast-cancer, ecoli, glass, ionosphere, iris, Magic, optdigits, pendigits, pima, segment, sonar, waveform, wine, yeast, balance-scale, Car, chess-c, chess-m, CNAE-9, lymphography, mushroom, nursery, promoters, SPECT, tic-tac-toe, abalone, acute, card, contraceptive, German, heart, liver, zoo

To develop a missing handling approach is introduced with effective imputation results


The method was based on calculating the class center of every class and using the distances between it and the observed data to define a threshold for imputation. The method performed better and had less imputation time

Only one missing mechanism was implemented



Developed a multiple imputation method that can handle the missingness in ground water dataset with high rate of missing values


Here the technique used to handle the missing values, was chosen looking at its ability to consider the relationships between the variables of interest

There was no prior knowledge on the label of missing data which may have provided difficulty when performing imputation


Dukes’ B colon cancer, the Mice Protein Expression and Yeast

Developed a novel hybrid Fuzzy C means Rough parameter missing value imputation method


The technique handled the vagueness and coarseness in the dataset and proved to produce better imputation results

There was no report of missing values mechanisms used for the experiment


Forest fire, Glass, Housing, Iris, MPG, MV, Stocks, Wine

The method proposed a variant of the forward stage-wise regression algorithm for data imputation by modelling the missing values as random variables following a Gaussian mixture distribution. Categorical


The method proved to be effective compared to other approaches that combined standard missing data approaches and the original FSR algorithm

There was no report of missing values mechanisms used for the experiment


Weather dataset

This method applied four(Likewise, Multiple imputation, KNN, MICE) missing data handling methods to the training data before classification


Of the imputation methods applied the authors concluded that the most effective missing data imputation method for photovoltaic forecasting was the KNN method

There was no report of missing values mechanisms used for the experiment


Air quality data

To make time series prediction for missing values using three machine learning algorithms and identify the best method


The study concluded that deep learning performed better when data was large and machine learning models produced better results when the data was less

Heavy costs in time consumption and computational powers for training when implementing their most effective method (deep learning)


Traumatic Brain Injury and Diabetes

To demonstrate how performance varies with different missing value mechanisms and the imputation method used and further demonstrate how MNAR is an important tool to give confidence that valid results are obtained using multiple imputation and complete case analysis


The study showed that both complete case analysis and multiple imputation can produce unbiased results under more conditions

The method was limited by the absence of nonlinear terms in the substantive models


Grades Dataset

To develop a new decision tree approach for missing data handling


The method produced a higher accuracy compared to other missing values handling techniques and had more interpretable classifier

The algorithm suffered from a weakness when the gating variable had no predictive power


Air Pressure System data

The study proposed a sorted missing percentages approach for filtering attributes when building machine learning classification model using sensor readings with missing data


The technique proved to be effective for scenarios dealing with missing data in industrial sensor data analysis

The proposed approach could not meet the needs of automation


Abalone and Boston Housing

To experiment the reliability of missing value handling at not missing at random


The results of the study indicated that the approach achieved satisfactory performance in solving the lower incomplete problem compared to other six methods

The approach did not consider any missingness rate which may have affected the analysis


Cleveland Heart disease

Proposed a systematic methodology for the identification of missing values using the KNN, MICE, mean, and mode with four classifiers Naive Bayes, SVM, logistic regression, and random forest


The result of the study demonstrated that MICE imputation performed better than other imputation methods used on the study

The approach compared stage of the art methods with simple imputation methods, mean and mode that are bias and unrealistic results


Iris, Wine, Ecoli and Sonar datasets

To retrieve missing data by considering the attribute correlation in the imputation process using a class center-based adaptive approach using the firefly algorithm


The result of the experiment demonstrated that the class center-based firefly algorithm was an efficient method for handling missing values

Imputation was tested on only one missing value mechanism


Abalone, Iris, Lymphography and Parkinsons

Proposed a novel tuple-based region splitting imputation approach that used a new metric, mean integrity rate to measure the missing degree of a dataset to impute various types missing data


The region splitting imputation model outperformed the competitive models of imputation

Random generator was used to impute missing values and other mechanisms for missing values were not considered


Artificial and real metabolomics data

To develop a new kernel weight function-based imputation approach that handles missing values and outliers


The proposed kernel weight-based approach proved to be superior compared to other data imputation techniques

The method was experimented on one type of dataset and may not perform as reported on other types of data