 Research
 Open access
 Published:
Adaptive multiple imputations of missing values using the class center
Journal of Big Data volume 9, Article number: 52 (2022)
Abstract
Big data has become a core technology to provide innovative solutions in many fields. However, the collected dataset for data analysis in various domains will contain missing values. Missing value imputation is the primary method for resolving problems involving incomplete datasets. Missing attribute values are replaced with values from a selected set of observed data using statistical or machine learning methods. Although machine learning techniques can generate reasonably accurate imputation results, they typically require longer imputation durations than statistical techniques. This study proposes the adaptive multiple imputations of missing values using the class center (AMICC) approach to produce effective imputation results efficiently. AMICC is based on the class center and defines a threshold from the weighted distances between the center and other observed data for the imputation step. Additionally, the distance can be an adaptive nearest neighborhood or the center to estimate the missing values. The experimental results are based on numerical, categorical, and mixed datasets from the University of California Irvine (UCI) Machine Learning Repository with introduced missing values rate from 10 to 50% in 27 datasets. The proposed AMICC approach outperforms the other missing value imputation methods with higher average accuracy at 81.48% which is higher than those of other methods about 9 – 14%. Furthermore, execution time is different from the Mean/Mode method, about seven seconds; moreover, it requires significantly less time for imputation than some machine learning approaches about 10 – 14 s.
Introduction
Big data has become a critical technology for developing novel solutions in a wide variety of fields. For instance, large and complex amounts of structured and unstructured data are growing at highspeed rates [1]. The development of these big data will enhance the discovery of useful information, such as hidden patterns and unknown correlations, that can be useful in many fields, including healthcare, financial, manufacturing, and social life [2], such as the failure or malfunction of the sensors that provide information or partial observation of an object of interest because of some hidden phenomenon.
Many realworld applications suffer a common drawback, missing or unknown data. For example, some results may be lost in an industrial experiment due to mechanical faults during the data gathering procedure. Likewise, some tests cannot be done in medical diagnosis because some medical tests may not be appropriate for certain patients, or the medical report proforma permits the omission of specific qualities.
The quality of data [3] is a significant concern to them for conducting effective data analytics. Although the outcome of data analysis tasks depends on several factors such as attribute selection, algorithm selection, and sampling techniques, a critical dependency relies upon the efficient handling of missing values [4]. The data is either missing or incorrectly entered by a human, which results in an incorrect prediction [5, 6], as missing values degrade performance. Therefore, missing data is a significant issue in big data analytics, as it can significantly increase the cost of computation and skew the results [7]. As a result, data quality is a fundamental requirement for big data processing, and data quality suffers when missing values are present [8].
A data analysis algorithm cannot handle incomplete datasets directly by itself. The simplest way to deal with this problem is case deletion, which means directly removing all the data of cases with missing values [9, 10]. However, if the missing value rate is high, the deletion approach affects the remainder of the complete data and can reduce the accuracy of the results [11]. As a result, reliable imputation techniques are necessary to consider the matter of missing data. Additionally, imputation of missing data can aid in the maintenance of the completeness of a dataset, which is critical in smallscale data mining projects and big data analytics.
To date, missing value imputation (MVI) has been proposed as a promising solution for incomplete datasets [12,13,14,15,16,17,18,19]. MVI can be broadly classified into statistical and machine learning techniques. The mean and mode are common statistical MVI technique measurements that typically require a short time to compute. However, machine learning MVI techniques, such as support vector machine (SVM), and random forest (RF) methods, require a long computation time to achieve high accuracy [20,21,22,23]. On the other hand, the knearest neighbor (KNN) technique [24] requires much less imputation time than other machine learning techniques [25,26,27,28]. However, the KNN method performs only an online search of the nearest neighbors through the Euclidean distance function [29].
Among the KNNbased methods, Troyanskaya et al. [30] and Daberdaku et al. [31] presented a weighted KNN algorithm for the missing data imputation. Further, Cheng et al. [32] proposed a KNN method that used purity to enhance the performance of K nearest neighbors. Fan et al. [33] proposes the weighted KNN approach, which uses the inverse of the Euclidean distance as the weight for each data point. Of all these KNNbased weighted methods, the set of nearest neighbors is computed by the weight distance between the data of missing values and the complete data.
Sometimes, although more complex algorithms might produce better imputation results, they will generally require a higher computational cost, which is a consideration in machine learning techniques versus statistical techniques [34]. However, most machine learning techniques are usually more computationally expensive than many statistical techniques, due to the model training and construction process.
Recently, class centerbased missing value imputation was proposed to produce relatively better imputation results at a lower computational cost [35,36,37]. The class center is based on the mean of the data samples in a specific class, which is similar to the idea of the cluster center (or centroid) applied in the kmean algorithm [38]. Thereafter, the Euclidean distances between each data sample and the class center are measured, to define a threshold for the later imputation guideline in “Materials and methods” section.
This study aims to propose an algorithm for missing value imputation such that it achieves high accuracy, yet requires minimal time. This presents a novel imputation method: the adaptive multiple imputations of missing values using the class center approach (AMICC). The key contributions of our work are (1) the class center is based on the mean/mode of the data samples and replaces values appropriately, according to the attribute type of the dataset, (2) the proposed adaptive threshold value follows the standard deviation (STD) values, where the computation can indicate how data are spread out over a range of normal and filter outliers, and (3) for outlier data, the missing values are replaced with more appropriate values by using the median and the average weight distance values of the class.
The remainder of this article is organized as follows: “Materials and methods” section presents the related work and describes the proposed model, the diversity operator, and the AMICC algorithm design. The experimental results are presented in “Experiments and results” section, while “Discussion” section offers a discussion and the conclusions are presented in “Conclusions” section.
Materials and methods
In this section, we first present the missing value imputation in “Missing value imputation” section. “Our adaptive multiple imputations of missing values using class centers” section details our proposed method, the AMICC algorithm.
Missing value imputation
MVI uses a statistical or machine learning method to estimate the observed data chosen to replace the missing values. The simplest statistical methods for continuous and discrete variables are mean and mode imputation [39], respectively. Besides statistical techniques, MVI also uses machine learning methods to estimate the observed data chosen to replace the missing values. For instance, MVI analyzes a pattern classification task where the missing feature is employed as the target output for the classification model. The rest of the complete features are the input attributes used to train and test the model [40].
One of the most widely used machine learning techniques is KNN imputation [41], where missing values are imputed using the values calculated from the nearest neighbor observed data. In finding the nearest neighbors, the preferred choice in nearest neighbor classification is to define the Euclidean distance, which is defined as:
where function \(dist(x_{i}, x_{j})\) computes the distance between the instance \(x_{i}\) and \(x_{j}\), N is the number of attributes or features, and \(x_{ni}\) represents the ith instance in the nth attribute.
The baselines compared with the proposed method are usually based on statistical and machine learning techniques. Two other wellknown machine learning techniques for MVI are the SVM [42] and RF [43] algorithms. The SVM algorithm uses kernel functions for the nonlinear mapping of an original feature space into a higher dimensional feature space to build a hyperplane. The RF algorithm is an ensemble of decision tree classifiers, which establishes the outcome based on the predictions of the decision trees. RF predicts the outcome by taking the average or mean of the output from various trees.
Our adaptive multiple imputations of missing values using class centers
Several realworld datasets often found with notanumbers (NaNs), blank fields, or other placeholders may have missing values. Training a model with a dataset of many missing values can drastically impact the quality of the machine learning or statistical model, resulting in higher computational costs. If the quantum of missing data is large, the efficiency will fluctuate accordingly.
As indicated in “Missing value imputation” section, missing values are commonly replaced by mean/mode. Hence, in the AMICC method, the class center is based on the mean/mode of the data samples in a specific class. In the MVI approach, the AMICC method replaces the missing values with the mean or mode depending on the attribute type. In the outlier data of each class, the AMICC method identifies the threshold values for checking the outlier data; the threshold values are calculated based on the distances between the class centers and their correspondence to the complete data.
In Fig. 1, the AMICC approach comprises three modules: the first focuses on data preprocessing in “Data preprocessing” section, the second calculates the threshold identification in “Threshold identification” section and the third imputes the missing values in “Imputation of missing values” section. These three modules are described in the following subsections.
Data preprocessing
In the data preprocessing section, there are some differences in UCI dataset experiments and missing data types based on the missing completely at random (MCAR) which the presence of missing data does not depend on the input values perse [44]. Therefore, in large datasets plagued by MCAR missing data, samples with missing values can be discarded without biasing the distribution of the remaining data.
This study simulated missing rates of 10%, 20%, 30%, 40%, and 50% [9, 15, 20, 32, 35, 41] to compare the proposed method to the imputation methods listed in the UCI datasets experiment. As shown in Eq. (2), the missing rate is a percentage of the total number of missing values in the dataset. All variables except the class attribute had their missing values simulated.
The missing rate of 50% in this study is the highest when the number of examples and features is considered. For example, consider the Blood dataset, which contains 748 examples and nine features, as illustrated in Table 1. According to equation (2), there were 3,366 missing values when the missing rate was set to 50% (50 = 3366 × 100 / (748 × 9)).
Additionally, normalization is a technique frequently used during the data preparation process. The goal of normalization is to change the values of numeric columns in the dataset to a standard scale, without distorting differences in the ranges of values. The incomplete dataset must be normalized in the domain [0,1], as normalized data on the same scale avoids the effect of different attribute ranges on distance calculation. Thereafter, the incomplete dataset is divided into two subsets: one is the incomplete data containing the missing values for later imputation, and the other is the complete data without missing values for calculating the initial values of the next step.
For example, Fig. 2 shows a threeclass incomplete dataset with ten feature dimensions (\(F=10\)), in which the question marks represent attributes with missing values. Class i (i = 1 to N; N = 3) of D, denoted by \(D_{i}\), is divided into \(D_{i\_complete}\) and \(D_{i\_incomplete}\).
Threshold identification
Figure 3 shows that the process of identifying the threshold based on the distances between the class centers and their correspondences to the complete data described in more detail below.
From the incomplete dataset D containing N classes, dataset D is divided into complete (\(D_{\_complete}\)) and incomplete (\(D_{\_incomplete}\)) subsets, where \(D_{\_incomplete}\) contains missing values. For the ith class of \(D_{i\_complete}\), the class center (cent(\(D_{i}\))), mode, and median are calculated. When computing the class center values for a numerical attribute, the mean is used as the class center. Otherwise, if the attribute is categorical, the mode value is the class center value.
Next, the Euclidean distances between cent(\(D_{i}\)) and every data sample in Class i are computed. Figures 4 and 5 show an example of calculating the center of Class 1, cent(\(D_{1}\)), and the distances between cent(\(D_{1}\)) and the other data samples.
Based on the distances, in Fig. 4a, the mean is used for calculating the distances for a numerical dataset; in Fig. 4b, the mode is used for calculating the distances for a categorical dataset; in Fig. 5, the mean or mode is used for calculating distances for a mixed dataset, in which the mean or mode of these distances is used as the threshold (\(T_{1}\)) for Class 1. Thereafter, this step is repeated until the threshold for each class is obtained. The pseudocode for the threshold identification module is shown in Algorithm 1.
Imputation of missing values
Imputation techniques can be straightforward or quite complicated. These techniques compute the mean/mode of the nonmissing values in the complete data and replace the missing values in incomplete data. A single value replaces a missing value for a single imputation, such as the mean of the entire dataset. Multiple imputations are widely accepted as the standard for dealing with missing data in a variety of research fields. Multiple imputations are used to derive unbiased and valid estimates from available data.
In outlier data, the AMICC method checked the normal distribution using the STD value to determine whether the given measurement deviates from the mean. In statistics, STD is a frequently used yardstick of measurement variability. A low STD value indicates that data points are typically very close to the norm, whereas a high STD value indicates that data points span a wide range of values.
Figure 6 shows that the process of the imputation of missing values consists of the following two steps; the first step is to perform a preliminary imputation of the missing value using the mean/mode of each attribute in a class and the second step is to compare the outlier data with STD values. There are two ways to handle outlier data; (1) if STD \(<=\) 1, check the outlier data by calculating the distance between the missing value and the class center; if the distance exceeds a threshold, the missing value is considered outlier data and replaced with the median value. Next, (2) if STD > 1, the missing value is considered an outlier. The average weight distance is calculated from the weight distance between the missing value and its nearest neighbors in the complete data. Then, the average weight distance is replaced for the missing value. The proposed method for imputed values is described in detail below.
Step 1: For Class i, the incomplete dataset (\(D_{i\_incomplete}\)) is composed of a missing data sample (Num). Figures 7, 8, and 9 illustrate examples of a Class 1 incomplete dataset (\(D_{1\_incomplete}\)) for numerical, categorical, and mixed datasets, respectively, where the data j (j = 1 to Num) contain one missing value in Figs. 7a, 8a, and 9a and multiple missing values in Figs. 7b, 8b, and 9b. In the examples shown in these figures, the missing feature of data j, cent(\(D_{1}\)), and imputed values are in the red text. The distance between cent(\(D_{1}\)) and the imputed data j is calculated and compared with the threshold \(T_{1}\) in the next step.
Step 2: This step consists of two cases, (1) if STD \(<=\) 1, from the preliminary imputed dataset from Step 1, Fig. 10 illustrates how to impute outlier data for STD values less than one, in which the algorithm compares the outlier data to the threshold value of the class. In Fig. 11, for example, if the distance is less than \(T_{1}\), the imputation process for data j is complete; otherwise, each outlier datum is imputed to the median of Class 1.
In the other case (2), if STD > 1, Fig. 12 shows imputed data to the outlier. According to equation (3), the average weight distance [33] is arrived at by calculating the weight distance between the missing value and its nearest neighbors in the complete data.
where \(W_{i}\) is a weight distance of ith outlier data, \(y_{i}\) is the ith instance of outlier data, and \(x_{1}\) is the first instance of complete data. From “Missing value imputation” section, function \(dist(y_{i}, x_{j})\) computes the distance between the instance \(y_{i}\) and \(x_{j}\). This step is repeated until the average weight distance for each instance is obtained.
After computing all average weight distances, the outlier datums are imputed to the average weight distance. Algorithm 2 is proposed for missing value imputation.
Experiments and results
This section presents the performance evaluation and comparison of the proposed AMICC method and statistical and machine learning methods.
Experimental setup
The experimental data included 13 numerical, six categorical, and eight mixed datasets collected from the UCI Machine Learning Repository [45]. These datasets have been the subject of several studies on machine learning methods and cover examples of datasets of small, medium, and largesize [9, 12, 13, 24, 25, 35, 41]. The characteristics of these datasets are shown in Tables 1, 2, and 3. All the datasets show considerable diversity in the number of examples, features, and classes.
In Table 3, the numbers of numerical and categorical attributes are indicated in parentheses for each dataset. The dataset is treated as a numerical dataset if the number of numerical attributes is greater than that of categorical attributes. Otherwise, the dataset is treated as a categorical dataset. For example, as the Abalone dataset consists of seven numerical and one categorical attribute, this mixed dataset is treated as numerical, in which distances are calculations based on the mean. On the other hand, as the Acute dataset consists of one numerical and five categorical attributes, it is considered as categorical, in which distances are calculations based on the mode.
In Tables 4, 5, and 6 summarize the total number of missing rates of numerical, categorical, and mixed dataset types, respectively, where c is the number of complete data, and i is the number of incomplete data. In Section “Data preprocessing”, as shown in equation (2), the missing rate is a percentage of the total number of missing values in the dataset. All variables except the class attribute had their missing values simulated. The simulated missing rates of 10%, 20%, 30%, 40%, and 50%.
The missing rate of 10% in this study is the lowest when the number of examples and features is considered. For example, consider the Spect dataset, which contains 267 examples and 22 features, as illustrated in Table 2. According to equation (2), there were 587 missing values (incomplete data) when the missing rate was set to 10% (10 = 574 × 100 / (267 × 22)).
Next, Kfold crossvalidation was used to decrease the bias of the test results [46]. This is an effective method of improving the evaluation and comparison of learning algorithms by dividing the data into K segments. In each iteration, one of the K segments is used to examine the model, and the other K1 segments are combined to form a training set. This study used a tenfold crossvalidation intelligent classifier system to reduce the bias associated with random sampling [47, 48].
In the classification phase, after different techniques individually imputed the missing values of the incomplete training subset, each training subset was used to train an SVM classifier. The testing subset was used to examine the classification accuracy of the SVM classifier. The MCAR mechanism for the incomplete dataset was implemented ten times for each missing rate, to avoid biased results, as indicated in “Missing value imputation” section.
During the MVI process, the proposed AMICC approach was compared to baseline approaches consisting of statistical methods (Mean/Mode imputation), machine learning methods (SVM [42], KNN [27], and RF [43]), and a class centerbased MVI approach (CCMVI [35]). In statistical MVI methods, the mean/mode are common statistical measurements used to replace all missing values with the mean/mode value. In machine learning MVI techniques, SVMs are effective in various pattern recognition problems and provide superior classification performance due to their modeling flexibility. KNN is the most widely used data mining technique and was developed based on missing value imputation. Additionally, the RF significantly improves correlation; it is reasonable for missing data ranging from moderate to high. On the other hand, CCMVI is based on determining the class center and using the distances between the class center and other observed data to define a threshold for later imputation.
In the evaluation phase, the accuracy of the results obtained from the model is defined, as described by Eq. (4).
where N is the number of data points in the dataset to be classified (the test set), \(n \in N\), and nc are the original class of item n. Function f is equal to 1 if \(classify(n)= nc\); otherwise it is 0.
The rootmeansquare error (RMSE) is a commonly used metric for comparing the actual values to the values imputed by various MVI techniques [12, 22, 41]. This measure is solely appropriate for numerical data values [35]. The RMSE of a model prediction for an estimated variable of \(X_{model}\) is given below.
where \(X_{obs}\) is the observed value and \(X_{model}\) is the modeled value. This study used the RMSE to measure the error of the imputation method because a relatively high RMSE is undesirable. The smaller the error is, the more accurate the model.
In addition to classification accuracy, the hit rate is the number of hits divided by the size of the test dataset. The predicted rating is called a hit if its rounded value is equal to the actual rating identified in the test dataset. The hit rate can be used to evaluate the performance of a model for categorical data [35], as it represents the percentage of instances where the model correctly predicts the actual class of an observation, as described by Equation (6).
where \(n_{hits}\) is the number of hits associated with the actual rating and \(n_{total}\) is the number of test samples.
The performance of the proposed predictive model was measured using the accuracy, RMSE, and hit rate to determine the efficiency of the proposed model compared to that of other existing methods.
Accuracy analysis
In “Experimental setup” section, Tables 7, 8, and 9 summarize the average classification performance of numerical, categorical, and mixed dataset types, respectively. The results in Table 7 show the average classification of the Mean method for numerical datasets, while the results in Table 8 show the average classification of the Mode method for categorical datasets. The average results indicated that the AMICC method achieved the highest accuracy across all dataset types, at least 79.349%, 87.865%, and 77.721%, respectively, and outperformed the other methods significantly (\(p<\) 0.001). Similarly, Table 9 shows the average results for real data from Srinagarind hospital, revealing that the AMICC approach attained the maximum accuracy of at least 81.094%.
Subsequently, the CCMVI method outperformed the SVM, KNN, and RF methods, all of which produced comparable average results, whereas the Mean/Mode method performed poorly.
Figures 13, 14, and 15 show the average classification performance of numerical, categorical, and mixed datasets, respectively. Each bar represents the variation in accuracy findings over five distinct missing rate ranges (10–50%). The high accuracy results are consistent with a low missing value rate for the Mean/Mode, SVM, KNN, and RF methods [9, 18, 35]. On the other hand, the CCMVI and AMICC methods produce highly accuracy results consistent with a high missing rate and achieve higher accuracy than the other methods.
For AMICC and CCMVI, the results performed well despite high missing values because they replaced missing values with class statistics values. Thus, if the missing rate was high and the number of mean/mode values used to replace missing values increased, the results were highly accurate.
RMSE and hit rate analysis
Tables 10, 11, 12, and 13 illustrate the distribution of the RMSE/hit rate values over all the experiments performed on the numerical, categorical, and mixed datasets, respectively. Each result contains the RMSE/hit rate values attained by each imputation method.
Tables 10 and 11 show the average RMSEs of all missing rates of the MVI methods for the numerical and mixed datasets, respectively. The AMICC method outperformed the other methods, with the RMSEs for numerical and mixed datasets under 0.716 and 0.785, respectively. Following the best approach, the CCMVI method for the numerical under 1.016 and for mixed datasets under 0.993. The other MVI methods demonstrated similar average RMSE results.
On the other hand, Table 11 illustrates the average result for real data from Srinagarind hospital, demonstrating that the AMICC technique outperformed the other methods, with RMSEs for mixed datasets under 0.502.
Tables 12 and 13 show the average hit rates, also known as recall or sensitivity, of all missing rates of the MVI methods for the categorical and mixed datasets, respectively. The AMICC method outperformed the other methods with the hit rate for categorical and mixed datasets at 50.654% and 33.791%, respectively. The CCMVI method was the next best method for categorical and mixed datasets at 49.623% and 19.614%, respectively. The Mean/Mode, SVM, KNN, and RF methods demonstrated similar average hit rate results.
Additionally, Table 13 shows the average hit rate for real data from Srinagarind hospital, indicating that the AMICC technique outperformed the other methods, with a hit rate at 73.797% for mixed datasets.
Figure 16 shows the boxplots of the average RMSEs of the missing values for the MVI methods on the numerical and mixed datasets. Each boxplot displays the average RMSE result for all the missing rates. The red pluses indicate the means, and the blue horizontal line within each box shows the median. The AMICC approach had an RMSE value less than those of all the other MVI methods on the numerical and mixed datasets, with values under 1.
Figure 17 shows the boxplots of the average hit rate of the missing values of the MVI methods on the categorical and mixed datasets. Each boxplot displays the average hit rate result for all missing rates. The AMICC approach had a hit rate higher than those of all the other MVI methods. In addition, the AMICC approach had a median value that marked the midpoint of the data in the interquartile range, showing that a hit rate dataset was normally distributed.
Execution time analysis
When choosing a suitable algorithm for missing value imputation, it is necessary to consider not only algorithm accuracy but also algorithm execution time.
Table 14 shows the overall average execution time of the MVI methods for the datasets. The average execution time results show that the Mean/Mode method required the least execution time at 9.612 s. The KNN method had the secondfastest execution time of 10.905 s, increasing approximately 1.293 s over the Mean/Mode method. The KNN method’s execution time was much faster than that of the other machine learning techniques because the KNN algorithm is a lazy learning method that does not require a model learning process.
The AMICC method was the thirdfastest at 16.392 s and increased from Mean/Mode about 6.780 s; it took more execution time because it needed to calculate the class center distance of the imputed missing data and replace the missing values [27, 35].
Discussion
The experimental results show that the proposed method with multiple imputations using the class center of the average outperforms the other MVI methods. In Table 15, the AMICC approach comprises three important algorithms: the first algorithm (1), which focuses on check attribute type; the second algorithm (2), which defines the class center; and finally, the third algorithm (3), which replaces outlier with weight distance and median values. These three algorithms are described in the following sections.
Table 15 summarizes the average classification accuracies for the comparison algorithms of the proposed method. AMICC verified the type of attribute before replacing the missing values. From Section “Imputation of missing values”, lines 3 to 7 of the pseudocode in Algorithm 2 for the imputation of missing values checks the attribute type and replaces the missing values with the mean/mode value. For example, when checking attribute type was added to the baseline, the accuracy improved to 65.660%, 75.436%, and 61.664% for the numerical, categorical, and mixed dataset types, respectively. The performance was enhanced because if missing data was substituted with an inappropriate value for the attribute type, the imputed value became noise. In other words, if the pseudocode for checking the attribute type was removed, the result will be as described in row 1 of Table 15 for the algorithm Baseline.
In addition, the AMICC method outperformed the others because it defined a class center algorithm. The class center was calculated using the mean of the data samples within a particular class, which is similar to the cluster center or centroid concept, which can represent the content of a class. Section “Threshold identification” defines a class center algorithm for lines 8 to 12 of the pseudocode in Algorithm 1. For example, when a defined class center algorithm was included, the accuracy rose to 69.220%, 80.701%, and 64.634% for the numerical, categorical, and mixed datasets, respectively. Additionally, the efficiency was boosted because if missing data were replaced with an outer class mean value that is not suitable for the class center, the imputed value became inaccurate.
Furthermore, the AMICC method specified a threshold value for outlier detection and replaced it with weight distance and median value. Section “Threshold identification”, lines 18 to 21 of the pseudocode in Algorithm 1, illustrates the threshold identification for a defined threshold value. The imputation of missing value occurred in Algorithm 2, lines 15 to 18 of the pseudocode. For example, when imputed missing values were combined with weight distance and median values, the accuracy enhances to 79.349%, 87.865%, and 77.721% for the numerical, categorical, and mixed datasets respectively. By comparison, the proposed method outperformed the CCMVI method, which relied on threshold values and replaced outliers by subtracting and adding STD values. Other MVI techniques, such as the SVM, KNN, RF, and Mean/Mode, did not provide a threshold or check for outlier data; consequently, if missing data were replaced and then became outlier data, the imputed value became noise.
Conclusions
Big data has been applied to provide effective solutions in several fields. However, much of the collected big data in various domains contain missing values. In this study, we proposed an adaptive multiple imputations of missing values using the class center (AMICC) approach to produce reasonably promising imputation results. The AMICC method is composed of three modules. The first module focuses on data preprocessing; the incomplete dataset must be normalized on the same scale, then split the incomplete and complete data. The second module determines the threshold by calculating the distance between data samples and their associated class centers. Finally, the third module discusses missing value imputation.
The experiments were conducted using numerical, categorical, and mixed datasets. The AMICC method was compared with two statistical techniques (i.e., the CCMVI and Mean/Mode imputation methods) and three wellknown machine learning methods (i.e., the SVM, RF, and KNN algorithms). The results showed that the proposed AMICC method outperformed other techniques on all the datasets by achieving the highest accuracy, the lowest RMSE, and the highest hit rate among all six experimented methods. The performance of the AMICC method was superior because it checked the type of attributes in the dataset and replaced values according to the attribute type. For numerical and categorical variables, the AMICC method replaced missing values with class’s mean and mode values, respectively. Additionally, it replaced outlier data with weight distance and median values.
As part of our future work, we intend to investigate the following. We wonder whether different distance functions can be used for defining the threshold values and compared to determine the optimal function. Further, in the case of outlier threshold values, an investigation into the method selection, including the median, STD, and mean methods, may result in increased accuracy and faster computation.
Availability of data and materials
The datasets used in this study appear in http://archive.ics.uci.edu/ml (accessed on 1 May 2021).
Abbreviations
 AMICC:

Adaptive multiple imputations of missing values using the class center
 CCMVI:

Class centerbased missing value imputation approach
 KNN:

Knearest neighbor
 MCAR:

Missing completely at random
 MVI:

Missing value imputation
 NaNs:

Not a Numbers
 RMSE:

Root mean square error
 RF:

Random forest
 STD:

Standard deviation
 SVM:

Support vector machine
 UCI:

University of California Irvine Machine Learning Repository
References
Gao Z, Yang Y, Khosravi MR, Wan S. Class consistent and joint group sparse representation model for image classification in internet of medical things. Computer Commun. 2021;166:57–65.
Liu ZG, Pan Q, Dezert J, Martin A. Adaptive imputation of missing values for incomplete pattern classification. Pattern Recogn. 2016;52:85–95.
Lee CH, Yoon HJ. Medical big data: promise and challenges. Kidney Res Clin Practice. 2017;36(1):3–11.
Fan J, Han F, Liu H. Challenges of big data analysis. Natl Sci Rev. 2014;1(2):293–314.
Rahm E, Do HH. Data cleaning: problems and current approaches. IEEE Data Eng Bull. 2000;23(4):3–13.
Chen C, Liu L, Wan S, Hui X, Pei Q. Data dissemination for industry 4.0 applications in internet of vehicles based on shortterm traffic prediction. ACM Trans Internet Technol (TOIT). 2021;22(1):1–18.
Schinka JA, Velicer WF, Weiner IB. Handbook of Psychology: Research Methods in Psychology, vol. 2. New Jersey: Wiley; 2013.
Khan SI, Hoque ASML. Sice: an improved missing data imputation technique. J Big Data. 2020;7(1):1–21.
Xia J, Zhang S, Cai G, Li L, Pan Q, Yan J, Ning G. Adjusted weight voting algorithm for random forests in handling missing values. Pattern Recogn. 2017;69:52–60.
Ramezani R, Maadi M, Khatami SM. A novel hybrid intelligent system with missing value imputation for diabetes diagnosis. Alexandria Eng J. 2018;57(3):1883–91.
GarcíaLaencina PJ, SanchoGómez JL, FigueirasVidal AR. Pattern classification with missing data: a review. Neural Computing Appl. 2010;19(2):263–82.
Sim J, Kwon O, Lee KC. Adaptive pairing of classifier and imputation methods based on the characteristics of missing values in data sets. Expert Syst Appl. 2016;46:485–93.
SeijoPardo B, AlonsoBetanzos A, Bennett KP, BolónCanedo V, Josse J, Saeed M, Guyon I. Biases in feature selection with missing data. Neurocomputing. 2019;342:97–112.
Doquire G, Verleysen M. Feature selection with missing data using mutual information estimators. Neurocomputing. 2012;90:3–11.
Tutz G, Ramzan S. Improved methods for the imputation of missing data by nearest neighbor methods. Comput Stat Data Anal. 2015;90:84–99.
Hron K, Templ M, Filzmoser P. Imputation of missing values for compositional data using classical and robust methods. Comput Stat Data Anal. 2010;54(12):3095–107.
Lee MC, Mitra R. Multiply imputing missing values in data sets with mixed measurement scales using a sequence of generalised linear models. Comput Stat Data Anal. 2016;95:24–38.
Hamidzadeh J, Moradi M. Enhancing data analysis: uncertaintyresistance method for handling incomplete data. Appl Intell. 2020;50(1):74–86.
Ispirova G, Eftimov T, Korošec P, Koroušić Seljak B. Might: statistical methodology for missingdata imputation in food composition databases. Appl Sci. 2019;9(19):4111.
Folino G, Pisani FS. Evolving metaensemble of classifiers for handling incomplete and unbalanced datasets in the cyber security domain. Appl Soft Computing. 2016;47:179–90.
Baraldi AN, Enders CK. An introduction to modern missing data analyses. J School Psychol. 2010;48(1):5–37.
Amiri M, Jensen R. Missing data imputation using fuzzyrough methods. Neurocomputing. 2016;205:152–64.
Sanitin Y, Saikaew KR. Prediction of waiting time in one stop service. Int J Mach Learning Computing. 2019;9:3.
Zhang S. Costsensitive knn classification. Neurocomputing. 2020;391:234–42.
Garciarena U, Santana R. An extensive analysis of the interaction between missing data types, imputation methods, and supervised classifiers. Expert Syst Appl. 2017;89:52–65.
RazaviFar R, Cheng B, Saif M, Ahmadi M. Similaritylearning informationfusion schemes for missing data imputation. KnowledgeBased Syst. 2020;187:104805.
Aydilek IB, Arslan A. A hybrid method for imputation of missing values using optimized fuzzy cmeans with support vector regression and a genetic algorithm. Inf Sci. 2013;233:25–35.
Yelipe U, Porika S, Golla M. An efficient approach for imputation and classification of medical data values using classbased clustering of medical records. Computers Elect Eng. 2018;66:487–504.
Mesquita DP, Gomes JP, Junior AHS, Nobre JS. Euclidean distance estimation in incomplete datasets. Neurocomputing. 2017;248:11–8.
Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman RB. Missing value estimation methods for dna microarrays. Bioinformatics. 2001;17(6):520–5.
Daberdaku S, Tavazzi E, Di Camillo B. A combined interpolation and weighted knearest neighbours approach for the imputation of longitudinal icu laboratory data. J Healthcare Inform Res. 2020;4(2):174–88.
Cheng CH, Chan CP, Sheu YJ. A novel puritybased k nearest neighbors imputation method and its application in financial distress prediction. Eng Appl Artif Intell. 2019;81:283–99.
Fan GF, Guo YH, Zheng JM, Hong WC. Application of the weighted knearest neighbor algorithm for shortterm load forecasting. Energies. 2019;12(5):916.
Kiasari MA, Jang GJ, Lee M. Novel iterative approach using generative and discriminative models for classification with missing features. Neurocomputing. 2017;225:23–30.
Tsai CF, Li ML, Lin WC. A class center based approach for missing value imputation. KnowledgeBased Systems. 2018;151:124–35.
Nugroho H, Utama NP, Surendro K. Class centerbased firefly algorithm for handling missing data. J Big Data. 2021;8(1):1–14.
Nugroho H, Utama NP, Surendro K. Normalization and outlier removal in class centerbased firefly algorithm for missing value imputation. J Big Data. 2021;8:129.
Sajidha S, Desikan K, Chodnekar SP. Initial seed selection for mixed data using modified kmeans clustering algorithm. Arab J Sci Eng. 2020;45(4):2685–703.
SilvaRamírez EL, PinoMejías R, LópezCoello M. Single imputation with multilayer perceptron and multiple imputation combining multilayer perceptron and knearest neighbours for monotone patterns. Appl Soft Computing. 2015;29:65–74.
Kim T, Ko W, Kim J. Analysis and impact evaluation of missing data imputation in dayahead pv generation forecasting. Appl Sci. 2019;9(1):204.
Pan R, Yang T, Cao J, Lu K, Zhang Z. Missing data imputation by k nearest neighbours based on grey relational structure and mutual information. Appl Intell. 2015;43(3):614–32.
Pelckmans K, De Brabanter J, Suykens JA, De Moor B. Handling missing values in support vector machine classifiers. Neural Netw. 2005;18(5–6):684–92.
Liu CH, Tsai CF, Sue KL, Huang MW. The feature selection effect on missing value imputation of medical datasets. Appl Sci. 2020;10(7):2344.
Little RJ, Rubin DB. Statistical Analysis with Missing Data. New Jersey: Wiley; 2002.
Dua D, Graff C. UCI Machine Learning Repository. 2017. http://archive.ics.uci.edu/ml Accessed 1 May 2021
François D, Rossi F, Wertz V, Verleysen M. Resampling methods for parameterfree and robust feature selection with mutual information. Neurocomputing. 2007;70(7–9):1276–88.
Ling H, Qian C, Kang W, Liang C, Chen H. Combination of support vector machine and kfold cross validation to predict compressive strength of concrete in marine environment. Construction and Building Materials. 2019;206:355–63.
Jiang G, Wang W. Error estimation based on variance analysis of kfold crossvalidation. Pattern Recogn. 2017;69:94–106.
Acknowledgements
The authors acknowledge the financial support from both funding parties.
Funding
This research was funded by the Thailand Research Fund under grant No. RDG6050040 and the Faculty of Engineering, Khon Kaen University, Khon Kaen province under grant No. Ph.D001/2561 as well as NSERC (Canada) and University of Manitoba.
Author information
Authors and Affiliations
Contributions
The author confirms the sole responsibility for this manuscript fully as a sole author for the following: study conception and design, data collection, analysis and interpretation of results, and manuscript preparation. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interestss
The authors declare that they have no competing interests. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Phiwhorm, K., Saikaew, C., Leung, C.K. et al. Adaptive multiple imputations of missing values using the class center. J Big Data 9, 52 (2022). https://doi.org/10.1186/s40537022006080
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s40537022006080