Missing values compensation in duplicates detection using hot deck method

Duplicate record is a common problem within data sets especially in huge volume databases. The accuracy of duplicate detection determines the efficiency of duplicate removal process. However, duplicate detection has become more challenging due to the presence of missing values within the records where during the clustering and matching process, missing values can cause records deemed similar to be inserted into the wrong group, hence, leading to undetected duplicates. In this paper, duplicate detection improvement was proposed despite the presence of missing values within a data set through Duplicate Detection within the Incomplete Data set (DDID) method. The missing values were hypothetically added to the key attributes of three data sets under study, using an arbitrary pattern to simulate both complete and incomplete data sets. The results were analyzed, then, the performance of duplicate detection was evaluated by using the Hot Deck method to compensate for the missing values in the key attributes. It was hypothesized that by using Hot Deck, duplicate detection performance would be improved. Furthermore, the DDID performance was compared to an early duplicate detection method namely DuDe, in terms of its accuracy and speed. The findings yielded that even though the data sets were incomplete, DDID was able to offer a better accuracy and faster duplicate detection as compared to DuDe. The results of this study offer insights into constraints of duplicate detection within incomplete data sets.

Duplicates are detected by searching for all objects (or records) that represent the same real-world entity to ensure database consistency [6].A duplicate detection mechanism tends to search similarities between two or more entities by using similarity measures.The similarity measure is based on equivalence relation where, suppose the real-world entities are er, es, and et, if er ≡ es and er ≡ et then es ≡ et .Because of this transitive relationship, the duplicate detection will be partitioning all database entities into nonempty and disjointed partition classes where each system refers to another real-world entity.Besides, an entity is represented as a relation in a relational database.
Although clustering is the result, decisions upon the presence of duplication are often made in a pairwise fashion before they are combined with a globally consistent result, namely duplicate clustering.The pairwise comparison approach relies on the use of similarity in attribute values to indicate real-world equivalence between the corresponding database entities.According to the score of similarity between the attribute values during the candidate pair comparison, each entity pair is assigned to the MATCHES class (assumed duplicates) or the UNMATCHES class (no duplicates are assumed).The similarity measure can calculate the similarity for the attributes individually and aggregate their similarity to an overall record similarity.If necessary, a domain expert might be called to verify the similarity results.
In general, duplicate detection methods are prone to two types of error: (a) false-positives: detection of the actual non-duplicates as duplicates, and (b) false negatives: no detection of the actual duplicates [7].Relaxing the matching criteria increases the number of false positives while tightening them may increase false negatives.There is usually a trade-off between both types of errors in duplicate detection.Some applications consider that false positives are worse than false-negative which results in their effect on the accuracy of the detected duplicates, whereas the opposite is true with other applications [8].

The challenge of detecting duplicates
Detecting duplicates is challenging especially within incomplete data sets.In this case, pairwise matching tends to be unsuccessful without a pre-processing step.Such pre-processing step is removing attributes with high rates of missing values or replacing missing values with estimated values based on the data set existing values (imputation) [9,10].Besides, before matching can be performed, attribute selection must be conducted.The desired high matching ratio (that relies on the results of attribute selection) is crucial in determining valid string comparison at a later stage.
Detecting duplicates within incomplete data sets pose a unique challenge [11][12][13][14][15].This is because missing values that are present within records will make these records look unique although they refer to the same real-world entity.As a result, several records (the duplicates) of the same real-world entity are kept and treated as if they are different.Therefore, the challenges facing the duplicate detection process within incomplete data sets can be summarized as follows: • Data clustering: In duplicate detection, the data sets are partitioned into small groups or clusters to reduce the time for pairwise comparisons.Clustering incomplete data sets is difficult especially if the attributes used to create a sorting key contain missing values.As a consequence, the probability of the records (that are likely to be similar) being grouped will be reduced.Thus, candidate attributes that can be used to create the sorting key based on property uniqueness and completeness need to be identified accurately.Candidate attributes refer to the attributes with low repetition value rate and low missing value rate.The selection of these attributes determines the success of duplicate grouping.However, the success ratio depends on the number of missing values of the selected attributes.Therefore, the process of compensating the missing values in the selected attributes is necessary to increase the probability of having records that refer to the same entity of the real world in the same group [16].• Data matching: This process is important in finding duplicate records in a data set.
The matching process is performed by comparing the values of the attributes in several ways such as by character to character, token to token, or sound to sound.The standard matching methods usually fail when the data sets under consideration are incomplete, especially in the case involving nominal values [17].
Comparing all attribute values is expensive.Therefore, to find a suitable string (called a token) from the attributes to increase the probability of matching records is deemed crucial.The creation of the comparison strings with the same mechanism of the selected attributes to form sorting keys may seem logical.Although a large number of presented methods for duplicate detection exist, there is still a need for substantial improvement to deal with a huge number of errors daily [18].Therefore, this calls for a new method to address the problem of duplicate record detection within incomplete data sets.Full duplicate detection is often unattainable, and whether the false-positive results are preferred more than the false negatives depends on the requirement of database applications.An independent implementation process for duplicate detection algorithm needs to adjust the values of the input parameters, such as values of similarity measures or threshold [19][20][21].These values have to be adapted according to the given detection scenario.Thus, the configuration of duplicate detection depends heavily on the domain under study and some quality characteristics (such as completeness or accuracy) for the attributes within the data sets.
The rest of this paper is structured as follows.In the Related Work section, the problem of missing values and the work related to duplicate detection techniques are presented.The details on the mechanism of the proposed method for detecting duplicates within incomplete data sets are provided in the Duplicate Detection within Incomplete Data Sets (DDID) section, followed by a section on Experimental Setup, Results, Discussion and Conclusion.

Related work
The problem of duplicate detection has been addressed under different names such as record linkage, entity matching, record reconciliation, or merge/purge.In general, the proposed algorithms for duplicate detection are commonly aimed to achieve: 1.The efficiency of duplicate detection process by reducing the number of comparisons for candidate pairs.In this case, the speed of the algorithm is of concern and influenced by the number and cost of comparisons; and 2. The effectiveness of duplicate detection to classify candidate pairs of records as duplicates or non-duplicates accurately.In this case, the accuracy of the algorithm is of concern [22,23].[25,26]).In general, researchers are aware of the difficulty of detecting duplicates within incomplete data sets [12,18].Incomplete data sets due to missing values have been reported as one of the challenges in data duplication [10,27], along with other issues.These issues are typographical errors, abbreviations, and different representations of the same logical values [12,23,[28][29][30][31]. Records with missing entries are of two forms: fully missing (all attributes' values except the primary key are missing) or partially missing (some of the attributes' values are missing).In the case of fully missing, as information about the record is missing entirely, the probability to label it as a duplicate is lower as compared to the partially missing case.In reality, fully missing is a rare case.Detecting duplicate records with partially missing values is difficult especially in the case where a real-world entity is accidentally represented by multiple key attribute values in a data set (i.e., as a result of data integration) or in the case where the key attribute is missing (or unknown) [32].
Missing value problem is usually handled either by predicting the value or by disregarding the missing value completely [33].The most common methods to deal with the strings that contain missing values can be categorized as follow: 1. Deletion: A completely missing point is skipped when utilizing a data set with the missing value [34].2. Interpolation: In a geometric sense, a line is drawn between the ends of sequences on either side of the missing value [16].3. Imputation: This operation involves fixing of value at the point where it is missing in the string using a specific algorithm like K-means, where the missing value will be replaced with a substituted value.
4. Smoothing: Smoothing is applied in estimating missing values in time series data for univariate data.Traditional time series analysis is commonly directed toward scalarvalued data and represented by moving average, or autoregressive-moving average models to smooth them into one continuous curve.
Incomplete data sets have been reported as an obstacle in the data clustering phase [10], as missing values can affect the creation of the sorting keys.Besides, the presence of missing values leads to failure in the matching process due to two factors.Firstly, the rate of missing values in the records to match.If the selected attributes contain a high rate of missing values, a high rate of false-negative and false-positive will be produced.Secondly, the type of attributes that form comparative strings "Tokens".For example, if the "Token" is created from the attributes whose value repetitiveness is high (such as "city" or "district"), the record pair will be identified as duplicates even though they are not.
Several duplicate detection methods highlight the issue of compiling and partially comparing complete records.The problem of incomplete data that can be found in data sets takes several patterns (see [35,36]).Missing value is a type of data completeness issue that affects duplicate detection.Several methods as mentioned earlier are proposed to deal with the problem of missing values in data sets, such as through ignoring records with missing entries, manually imputing missing data, and using the expectation-maximization (EM) algorithm.In cases where there are a large number of missing entries, or the data sets contain a large number of attributes where the probability of missing at least one entry is high, the first two methods are not practical.Moreover, it has been reported that a method such as EM is impractical to estimate missing data within a large number of attributes (i.e., the feature set is large), being computationally difficult as well as the algorithm convergence rate (EM) will be very slow if the missing rate is high (a large portion of the data is missing) [32,[37][38][39].
Tamilselvi and Saravanan presented a model for duplicate detection (and elimination) caused by errors and missing values [23].The proposed model is based on several steps: 1. Use an attribute selection algorithm for the best attribute for duplicate identification; 2. Convert the selected attribute value to a token; 3. Partition data by a clustering algorithm; 4. Compute match rate by using similarity measures; and 5. Remove duplicates and merge.
In this model, the completeness criterion for the selected attributes relies on token matching.
Van Gennip et al. proposed a method for extending the TF-IDF soft method to address the two common issues in the duplicate detection method namely, the sparsity (that occurs due to missing entries) and having more than one instances for the same object [32].The proposed method is based on: (1) calculating similarity scores between records, (2) grouping records together into independent groups, and (3) comparing different methods to generate similarity scores between records (plus a different set of string matching, frequency-inverse methods, and ngram techniques).They also suggested the possibility of replacing the method of dealing with missing values in both TF-IDF and Jaro-Winkler by performing a data imputation with a potential candidate.
Nevertheless, duplicate detection methods still need further refinement to overcome obstacles that prevent the detection of records that are likely to be similar.Duplicate detection methods are found to be limited in handling incomplete data sets in the following aspects: 1. Lack of an optimal mechanism to avoid the problem of missing values during the attribute selection using appropriate criteria for creating the sorting keys.2. No appropriate compensation mechanism for selected attribute missing values.3. Improvement of duplicate detection time and precision matching.4. Lack of intuitive method for using duplicate detecting tools that require less user's experience and expertise.

Duplicates detection within the incomplete data sets (DDID)
In this paper, Duplicate Detection within the Incomplete Data sets (DDID) was proposed to address the limitation of existing works.DDID was inspired by a framework of "DuDe" for duplicate detection toolkit (See [40]).Several stages of duplicate detection were extended in DDID to address the problem of missing values that affected the duplicate detection processes.These stages involved automatic attribute selection by using criteria (uniqueness and completeness) that would ensure the quality of the selected attributes.A dynamic mechanism was adopted to create the sorting key from the selected attributes for each record in the data set for the clustering stage.Finally, comparative strings were created for the matching stage.The attribute selection phase was the cornerstone for detecting duplicates within incomplete data sets of the DDID method.In this stage, the "appropriate" attributes were defined for generating the sort keys for the clustering stage and the matching strings to calculate the similarity.The attribute selection process depended on the rate of duplicate values and missing values in each attribute as well as the threshold value of the acceptance rate.The attribute selection process underwent three main sequential stages as shown in Algorithm 1.The goal of the attribute selection algorithm was to reduce the time spent by data specialists in identifying appropriate attributes for detecting duplicates and increasing the speed in the later stages.The goals were achieved by avoiding selecting attributes that increased the number of unnecessary comparisons (because of the missing values or the high rate of duplicates values in it).The algorithm began by defining the value of the threshold T, where 0 > T > 1 .The higher the value of T the greater the acceptability of the amount of repetition and the missing values in the attribute.Good selection of the threshold value is essential because choosing a very low threshold value affects the creation of the sort key, which, in turn, affects the detection of duplicates in the data set.
The uniqueness factor in the attribute selection is a decisive factor in limiting unnecessary comparisons due to the high frequency of some attribute values such as (gender) whose values are within a specified range, which may lead to a high rate of false-positive.The uniqueness function (UF) counts the number of duplicates in each attribute by using a HashSet class.Attributes are arranged according to the number of duplicates in ascending order.In DDID, the sort keys and matching strings were created from three attributes, hence, the completeness function (CF) calculated the numbers of missing values in the first three (low repeated) attributes after excluding the primary key attributes.Figure 1 shows the result of applying the uniqueness and completeness criteria to the restaurant data set.Nevertheless, as there is a possibility for the presence of missing values in the selected attributes that eventually will cause a clustering problem, the missing value problem needs to be resolved first.Therefore, the use of an appropriate compensation method for missing values is necessary to improve the duplicate detection.

Missing values compensation
To a large extent, the compensation method such as Hot Deck is generally acceptable to deal with missing values, especially when the record has more than one occurrence in the data set.In the Hot Deck imputation, missing values are imputed from other records in the database that share attributes related to the incomplete attribute [41].
As indicated by the Hot Deck strategy, the missing value for the record is compensated by searching for the missing value in the complete records that are most similar to the record that consists of missing values [42,43].This technique depends on the correlation of influencing values where matching of more specific attributes is used.Hence, missing values are imputed from other records in the database that share attributes related to the incomplete records.
In DDID, the Hot Deck imputation method was used to compensate the missing values in the high-rank attributes that were specified in the attribute selection stage to avoid any defect in the creation phase of the dynamic sorting key and comparative strings.Listing 1 shows the method used to compensate for the missing values in the high-rank attributes.
The similarity of the selected attributes for the two records during the compensation process was a condition for the compensation procedure.To illustrate this, a simplified case for the data set is as shown in Fig. 2. The selected attributes were name, addr and class, where record 3 and 4 were regarded as similar.Therefore, the missing name value in record 3, was compensated by the value 'art's deli' taken from the name value in record 4; as records 207 and 208 were similar, the addr value which was missing from record 208 was compensated by addr value from record 207.Details on the similarity algorithm used are presented in the Experimental Setup section.

Restaurant data set
This data set was originally taken from Fodor's and Zagat's restaurant guides.This data set consisted of attributes namely name, address, city, phone, type, and class.The data set comprised 866 records of which 13% of them were duplicates.

CD data set
This data set was bigger than the Restaurant data set as it contained 9763 records and 107 attributes.CD-related attributes such as artist, title, category, genre, cdextra, year, and track were kept in this data set.Only 3% of the records were duplicates.
These two data sets underwent a series of changes as reported in Draisbach and Naumann (see [40]) in addition to the changes mentioned in [15].The unique identifiers that were added by the authors had been removed so that the creation of the dynamic sorting keys was not affected by them, as these identifiers did not belong to the actual data sets.

MusicBrainz
These data sets were generated using the data pollution tool called as DaPo [54].The first data set of MusicBrainz contains 193,750 tuples while the second data set contains 800,000 tuples extracted from 1,937,500 tuples, where each tuple described a certain audio recording.These data sets consisted of 12 attributes such as TID, CID, CTID, SourceID, id, number, title, length, artist, album, year, and language.

Evaluation of DDID
To evaluate DDID, an experimental approach was adopted.We hypothesized that even though under the constraint of incomplete data set, DDID was capable to accurately detect duplicate records.DuDe, an earlier duplicate detection method, was used as a benchmark to evaluate DDID's performance.In addition to accuracy, the speed of detection was also observed.The following measures were used in the experiment: 1. Duplicates detection accuracy: To evaluate the accuracy of DDID in detecting duplicates within incomplete data sets, the measure of Draisbach and Naumann's (2010) study (refer [40]) was used.The creation of a statistical component was used to read the file containing the gold standard.Duplicate pairs from the transitive closure and non-duplicate pairs from the comparison stage were added to the statistical component and compared with the gold standard.The statistical component calculated the number of true-negatives and false-negatives of the actual classified pairs.The statistical component needed non-duplicate pairs to obtain an additional measure for the quality of the comparison.The accuracy of DuDe and DDID performance was determined by calculating the F-Score that combined recall and precision through the harmonic mean.Thus, it is a measure of the accuracy of the experiment [22,55].The Recall, Precision, and F-Score were computed by using certain formulas: (1) The formula construct descriptions are as follows: • False positive: Represents the number of candidate pairs, wrongly declared as duplicate records.• False negative: Represents the number of candidate pairs, wrongly declared as non-duplicate records.• True positive: Represents the number of candidate pairs, correctly declared as duplicate records.• True negative: Represents the number of candidate pairs, correctly declared as non-duplicate records.
2. Statistical analysis: was used to determine whether the performance of DDID was statistically significant from DuDe.

Experimental setup
To experiment with duplicate detection within an incomplete data set, an arbitrary pattern had been used to hypothetically add missing values randomly within the data sets under study.4% missing values were added in the key attributes of the Restaurant data set whereas 1.5% missing values were added in the CD data set.For the MusicBrainz data sets, the missing values were not injected due to the high percentage of missing values (blank, null, and unknown) from the source.Table 1 shows the rate of missing values (in percentage) in the key attributes for two MusicBrainz data sets.The data sets were stored in CSV format.The experiment was conducted with Java 8.2 installed on Windows 10 platform using a Dell laptop with 4GB RAM, Core i3 processor, 500GB HDD. Figure 3 illustrates the steps adopted in the experiment.Once the data sets were ready, a Java class that had been created to read incomplete data sets was run.For DDID, attribute selection routines, known as uniqueness and completeness functions, were used to define high-ranked attributes before dynamic sort keys could be generated.Next, the missing values were compensated by the compensation algorithm (using the Hot Deck imputation method).In addition, the missing values compensation was only performed by DDID, instead of DuDe.However, the steps afterward were the same for both DDID and DuDe.
The sorted neighborhood algorithm (see [56]) with a window size of 20 records was used to sort the keys for Restaurant data set with the size of 50 records for CD data set and 150 records for MusicBrainz data sets.Levenshtein Distance similarity algorithm (see [57]) with a similarity threshold set to 0.9 was used to measure the similarity between the comparative strings in DDID and the chosen attributes in DuDe.The major difference between DuDe and DDID was in the sort keys generation.DuDe depended on the attribute chosen by the users as the value for the sort keys, such as the name attribute in the Restaurant data set and the artist attribute for both CD and MusicBrainz data sets.In contrast, DDID was based on the fulfillment of criteria for the uniqueness and completeness functions in the selected attributes to dynamically generate the sort keys and also the comparative/matching strings.The sort keys in DDID combined a portion of the selected attribute values (e.g., the first three elements of the value of each selected attribute) together into a single string, where the same applied to comparative/matching strings with a certain length (e.g., the first eight elements of the value of each selected attribute as used with CD data set).
The results of duplicate pairs from the transitive closure and non-duplicate pairs from the comparison stage became the input for the statistical component generation.Key outputs such as the detection runtime, number of generated record pairs, and number of classified duplicates were calculated.Error types in duplicate detection (false-positives, false-negatives, true-positives, and true-negatives) were also measured where the gold data was used as reference.The final key performance indicators (KPIs), namely Precision, Recall, F-measure, and Accuracy were calculated by using the statistical outputs.The statistical outputs were stored in CSV, whereas the classified duplicates were kept in a JSON file.

Key performance results
Table 2 shows the results of the key outputs yielded in the experiment from the implementation of DuDe and DDID.Duplicate detection by both methods involved 16,245 candidate pairs for the Restaurant data set and 477,162 candidate pairs for the CD data set and 28857575 candidate pairs for MusicBrainz (A) data set and 119188974 for Music-Brainz (B) data set.
Both methods showed the same number of declared duplicates for the Restaurant data set, however, for the bigger data sets which are CD, MusicBrainz (A), and MusicBrainz(B), DuDe declared more duplicates than DDID.For all data sets, DDID showed a better performance than DuDe by producing fewer errors.
Figure 4 shows the results for the key performance indicators namely the Recall, F-score, and Precision by both methods.Overall, the scores of DDID were higher than DuDe in all categories.
It is important to point out that the procedures used in DDID did not only increase the efficiency of the duplicate detection within incomplete data sets.Therefore, to clarify the behavior of DDID in dealing with duplicates within the complete data sets, both methods were tested on the completed restaurant data set before adding the missing values.Table 3 showed that DDID performed well in detecting duplicates within the complete data set as compared with DuDe.
In the experiment, the value of the elapsed time for duplicate detection was represented by milliseconds for both DuDe and DDID.A statistical analysis was conducted using the SPSS to determine the significance of performance demonstrated by DDID against DuDe.As shown in Fig. 5, since the result of the data distribution for the elapsed time for both DDID and DuDe was normally distributed, a parametric test, the t-test was used.Hypotheses used in this test were as follows: • Null hypothesis, H 0 : The mean elapsed time for both DuDe and DDID is the same.
• Alternative hypothesis, H a : The mean elapsed time for both DuDe and DDID is not the same.As P < 0.05 , the null hypothesis was rejected and the alternative hypothesis was accepted.Thus, there was a significant difference in terms of duplicate detection speed between DuDe and DDID for the data sets.
Moreover, the elapsed time factor was directly related to the following:  1.Both methods used sorted neighborhood algorithm, where the bigger size of window, the larger number of comparisons, hence, increasing duplication detection elapsed time.2. The size of the data within the attributes of the data set: the larger the data size, the greater the number of comparisons, and since the duplicate detection model or DDID depended on specific parts of the values of the selected attributes to generate matching strings, the elapsed time would be lower.3. The presence of a high rate of missing data, as in the case of the MusicBrainz data sets, led to an increase in the false-positive, thus, increasing the elapsed time.In DDID, this problem was reduced by using the compensation algorithm (Hot Deck).
In addition, the sort keys and matching strings were created from multiple attributes, hence, reducing the possibility of missing value presence.
From the results, the Hot Deck method used in the DDID method can improve the performance of elapsed time and accuracy in duplicate detection.More importantly, the improvements made by DDID are statistically significant.At the same time, the percentage of missing values negatively affects the accuracy of duplicate detection if it forms a large fraction of the key attribute values.

Conclusion
In conclusion, Duplicate Detection within the Incomplete Data sets (DDID) method has been proposed to encounter missing value problem in duplicate detection.Attribute selection mechanism is introduced in DDID where the uniqueness and completeness criteria are fundamental elements for sorting key creation and string comparison.To reduce the impact of missing values on the selected attributes of duplicate records, a compensation method has been included using the Hot Deck algorithm to compensate for missing values in the selected attributes.Then, the accuracy and speed of duplicate detection of DDID have been evaluated against DuDe which is used only with complete data sets, instead of data sets with missing values.The results show that DDID behaves better than DuDe in all aspects under study.In addition, the Hot Deck compensation method has triggered DDID to achieve higher number of declared duplicates and fewer errors as compared to DuDe.The results of the statistical test show that there is a significant improvement made by DDID in duplicate detection speed even though the data sets are incomplete.

Fig. 1
Fig. 1 Implementing the uniqueness and completeness functions on a restaurant dataset

Fig. 2
Fig. 2 An example of missing values compensation

Table 1
Missing values rate in MusicBrainz data sets

Table 2
Experiment results

Table 4
shows the results of the t-test on the elapsed time of DuDe and DDID.In the t-test results, DuDe had a larger mean than DDID with M = 2317.90ms ( SD = 184.628),M = 11945.07ms ( SD = 574.791),M = 1083948.73ms ( SD =

Table 3
Implementation result on the complete restaurant data set

Table 4
Independent samples T-test for elapsed time