Griffeth RW, Hom PW, Gaertner S. A meta-analysis of antecedents and correlates of employee turnover: update, moderator tests, and research implications for the next millennium. J Manag. 2000;26(3):463–88.
Google Scholar
Shilane P, Chitloor R, Jonnala UK. 99 deduplication problems. In: 8th USENIX workshop on hot topics in storage and file systems (HotStorage 16), USENIX association, Denver, CO. 2016. p. 1–5.
Xia W, Jiang H, Feng D, Douglis F, Shilane P, Hua Y, Fu M, Zhang Y, Zhou Y. A comprehensive study of the past, present, and future of data deduplication. Proc IEEE. 2016;104(9):1681–710.
Article
Google Scholar
Chernov I, Ivashko E, Rumiantsev A, Ponomarev V, Shabaev A. Survey on deduplication techniques in flash-based storage. In: 2018 22nd conference of open innovations association (FRUCT). IEEE, Jyvaskyla, Finland. 2018.
Xu L, Pavlo A, Sengupta S, Ganger GR. Online deduplication for databases. In: proceedings of the 2017 ACM international conference on management of data. ACM, Chicago Illinois USA. 2017; p. 1355–68.
Wandhekar V. Validation of deduplication in data using similarity measure. Int J Comput Appl. 2015;116(21):18–22.
Google Scholar
Menestrina D, Whang SE, Garcia-molina H. Evaluating entity resolution results. In: proceedings of the VLDB endowment. VLDB Endowment, Singapore. 2010;3:208–19
Panse F. Duplicate detection in probabilistic relational databases. University of Hambur, PhD thesis. 2015.
Umathe VH, Chaudhary G. A review on incomplete data and clustering. Int J Comput Sci Inf Technol. 2015;6(2):1225–7.
Google Scholar
Wang S, Li M, Hu N, Zhu E, Hu J, Liu X, Yin J. K-means clustering with incomplete data. IEEE Access. 2019;7:69162–71.
Article
Google Scholar
Subramaniyaswamy V, Pandian C. A complete survey of duplicate record detection using data mining techniques. Inf Technol J. 2012;11(8):941–5.
Article
Google Scholar
Sadinle M. Detecting duplicates in a homicide registry using a Bayesian partitioning approach. Ann Appl Stat. 2014;8(4):2404–34.
Article
MathSciNet
MATH
Google Scholar
Chen Q, Zobel J, Zhang X, Verspoor K. Supervised learning for detection of duplicates in genomic sequence databases. PLoS ONE. 2016;11(8):1–20.
Google Scholar
Huang Y, Chiang F. Refining duplicate detection for improved data quality. In: TDDL/MDQual/Futurity@ TPDL. 2017.
Ali A, Emran NA, Asmai SA, Thabet A. Duplicates detection within incomplete data sets using blocking and dynamic sorting key methods. Int J Adv Comput Sci Appl. 2018;9(9).
Wubetie HT. Missing data management and statistical measurement of socio-economic status: application of big data. J Big Data. 2017;4(1):47.
Article
Google Scholar
Lazar A, Jin L, Spurlock CA, Wu K, Alex S. Data quality challenges with missing values and mixed types in joint sequence analysis. In: data quality challenges with missing values and mixed types in joint sequence analysis. In: 2017 IEEE international conference on big data (Big Data). Boston, MA, USA: IEEE. 2017; p. 2620–7.
Elmagarmid AK, Ipeirotis PG, Verykios VS. Duplicate record detection: a survey. IEEE Trans Knowl Data Eng. 2007;19(1):1–16.
Article
Google Scholar
Monge AE, Elkan CP. An efficient domain-independent algorithm for detecting approximately duplicate database records. In: DMKD; 1997.
Bilenko M, Mooney RJ. Learning to combine trained distance metrics for duplicate detection in databases. Technical report, Department of Computer Sciences University of Texas at Austin. 2002.
Chen L, Tang C, Yang J, Gao Y. A multilevel and domain-independent duplicate detection model for scientific database. In: Tang C, Yang J, Chen L, Gao Y, editors. Web-Age Information Management. Berlin: Springer; 2010. p. 729–41.
Chapter
Google Scholar
Naumann F, Herschel M. An introduction to duplicate detection. Synth Lect Data Manag. 2010;2(1):1–87.
Article
MATH
Google Scholar
Tamilselvi J, Saravanan V. Detection and elimination of duplicate data using token-based method for a data warehouse: a clustering based approach. Int J Dyn Fluids. 2009;5(2):145–64.
Google Scholar
Köpcke H, Rahm E. Frameworks for entity matching: a comparison. Data Knowl Eng. 2010;69(2):197–210.
Article
Google Scholar
Alrehamy H, Walker C. SemLinker: automating big data integration for casual users. J Big Data. 2018;5(1):14.
Article
Google Scholar
Konstantinou N, Abel E, Bellomarini L, Bogatu A, Civili C, Irfanie E, Koehler M, Mazilu L, Sallinger E, Fernandes AAA, Gottlob G, Keane JA, Paton NW. VADA: an architecture for end user informed data preparation. J Big Data. 2019;6(74):1–32.
Google Scholar
Haque S, Mengersen K, Stern S. Assessing the accuracy of record linkages with Markov chain based Monte Carlo simulation approach. J Big Data. 2021;8(8):1–25.
Google Scholar
Lehti P, Fankhauser P. Unsupervised duplicate detection using sample non-duplicates. In: Spaccapietra S, editor. Journal on data semantics VII. Berlin, Heidelberg: Springer; 2006. p. 136–64.
Chapter
Google Scholar
Bronselaer A, Van Britsom D, De Tré G. Propagation of data fusion. IEEE Trans Knowl Data Eng. 2015;27(5):1330–43.
Article
Google Scholar
Bharathi B, Reddy CS. Duplicate record deletion in relational database management systems. Int J Sci Eng Res. 2017;8(5):266–71.
Google Scholar
Babu SA. Duplicate record detection and replacement within a relational database. Adv Comput Sci Technol. 2017;10(6):1893–901.
Google Scholar
van Gennip Y, Hunter B, Ma A, Moyer D, de Vera R, Bertozzi LA. Unsupervised record matching with noisy and incomplete data. Int J Data Sci Anal. 2018;6(2):1–21.
Google Scholar
Sitaram D, Dalwani A, Narang A, Das M, Auradkar P. A measure of similarity of time series containing missing data using the mahalanobis distance. In: advances in computing and communication engineering (ICACCE). 2015 second international conference. Dehradun, India: IEEE. 2015; p. 622–7.
Abdallah L, Shimshoni I. A distance function for data with missing values and its application. Int J Comput Sci Eng. 2013; p. 7.
Emran NA. Data completeness measures. In: Abraham A, Muda AK, Choo Y-H, editors. Pattern analysis. Intelligent security and the internet of things. Cham: Springer International Publishing; 2015. p. 117–30.
Chapter
Google Scholar
Emran NA, Embury SM, Missier P. Model-driven component generation for families of completeness. In: QDB/MUD, CTIT workshop proceedings series, Auckland, New Zealand. 2008; p. 123–32.
Horton NJ, Kleinman KP. Much ado about nothing: a comparison of missing data methods and software to fit incomplete data regression models. Am Stat. 2007;61(1):79–90.
Article
MathSciNet
Google Scholar
Pigott TD. A review of methods for missing data. Educ Res Eval. 2001;7(4):353–83.
Article
Google Scholar
Ng SK, Krishnan T, Mclachlan GJ. The EM algorithm. In: James EG, Karl HW, Yuichi M, editors. Handbook of computational statistics: concepts and methods. Berlin: Springer; 2004. p. 137–68.
Google Scholar
Draisbach U, Naumann F. DuDe: the duplicate detection toolkit. In: proceedings of the international workshop on quality in databases (QDB), Singapore. 2010;10000:1000000.
Ellis, B. A consolidated, macro for iterative hot deck imputation. document présenté au NorthEast SAS Users Group-2007. 2007.
Song Q, Shepperd M, Chen X, Liu J. Can k-NN imputation improve the performance of C4.5 with small software project data sets? A comparative evaluation. J Syst Softw. 2008;81(12):2361–70.
Article
Google Scholar
Sim J, Lee JS, Kwon O. Missing values and optimal selection of an imputation method and classification algorithm to improve the accuracy of ubiquitous computing applications. Mathematical Problems in Engineering. 2015. p. 1–14.
Article
Google Scholar
Bilenko M, Mooney RJ. On evaluation and training-set construction for duplicate detection. In: Proceedings of the KDD-2003 workshop on data cleaning, record linkage, and object consolidation. ACM, Washington, DC. 2003; p. 7–12.
Ong S, Pei A. A comparative study of record matching algorithms. PhD thesis, University of Edinburgh, Scotland. 2008.
Ektefa M, Marzanah AJ, Sidi F, Memar S, Ibrahim H, Ramali A. A threshold-based similarity measure for duplicate detection. In: IEEE Conference on open systems, IEEE, Langkawi, Malaysia. 2011; p. 37–41.
Daggupati B. Unsupervised duplicate detection (UDD) Of query results from multiple web databases. PhD thesis, California State University Channel Islands. 2011.
Leitao L, Calado P, Herschel M. Efficient and effective duplicate detection in hierarchical data. IEEE Trans Knowl Data Eng. 2013;25(5):1028–41.
Article
Google Scholar
Skandar A, Rehman M, Anjum M. An efficient duplication record detection algorithm for data cleansing. Int J Comput Appl. 2015;127(6):28–37.
Google Scholar
Bo C, Wang K, Fox JJ, Skadron K. Entity resolution acceleration using the automata processor. In: proceedings—2016 IEEE international conference on big data, Big Data. 2016; p. 311–8.
Priyanka M, Baby A. A survey on various duplicate detection methods. Int J Comput Sci Inf Technol. 2017;8(1):7–9.
Google Scholar
Meshram MT. Duplicate detection with map reduce and deletion procedure. Int J Comput Trends Technol. 2017;48(2):51–3.
Article
Google Scholar
Zieger T. Self-adaptive data quality automating duplicate detection. PhD thesis, Potsdam. 2018.
Hildebrandt K, Panse F, Wilcke N, Ritter N. Large-scale data pollution with apache spark. IEEE Trans Big Data. 2020;6(2):396–411.
Article
Google Scholar
Yan S, Lee D, Kan M-Y, Giles LC. Adaptive sorted neighborhood methods for efficient record linkage. In: proceedings of the 7th ACM/IEEE-CS joint conference on digital libraries. ACM, Vancouver BC Canada. 2007; p. 185–94.
Hernández MA, Stolfo SJ. The merge/purge problem for large databases. ACM SIGMOD Rec. 1995;24:127–38.
Article
Google Scholar
Levenshtein VI. Binary codes capable of correcting deletions, insertions, and reversals. Soviet physics doklady. 1966;10(8).