Privacy preserved incremental record linkage

Khan, Shahidul Islam; Khan, Abir Bin Ayub; Hoque, Abu Sayed Md Latiful

doi:10.1186/s40537-022-00655-7

Research
Open access
Published: 14 November 2022

Privacy preserved incremental record linkage

Shahidul Islam Khan^1,2,
Abir Bin Ayub Khan¹ &
Abu Sayed Md Latiful Hoque²

Journal of Big Data volume 9, Article number: 105 (2022) Cite this article

2342 Accesses
3 Citations
Metrics details

Abstract

Using an incremental approach to solve the record linkage problem is a relatively new research area. In incremental record linkage, every inserted record is compared with some existing clusters of records based on its blocking key value. Then, considering similarity, either the record will be put into an existing cluster, or a new cluster will be created for it. Although few papers have presented their solutions for incremental record linkage targeting the linkage quality or efficiency, privacy issue regarding the approach has not yet been discussed. Privacy is a major concern when record linkage is performed for sensitive data, e.g., health records, financial records, etc. In this regard, we have come up with a novel concept privacy-preserving incremental record linkage (PPiRL) which encapsulates privacy-preserving techniques with an incremental record linkage approach. In this chapter, we have proposed an end-to-end framework as our solution for PPiRL. For preserving privacy, we have used two types of privacy techniques namely phonetic encoding and generalization. We have used a recently developed phonetic algorithm “nameGist” to handle text-based features. For generalization, we have used the K-anonymization algorithm for numeric and categorical features. For handling incremental updates and internal linkage, we have used the Naive incremental clustering approach using Hierarchical Agglomerative clustering as the base clustering algorithm. We have performed various experiments to test the privacy and linkage quality of PPiRL. We have compared our work with the existing incremental record linkage framework and also with existing privacy-preserved record linkage techniques. It is apparent from our results that other than a small trade-off in linkage quality, our framework works better as a combined package of privacy and linkage solutions that any existing frameworks do not yet provide.

Introduction

Nowadays, fast-growing datasets that contain hundreds of millions of records are being collected, stored, processed, analyzed, and mined. To enable an in-depth analysis of such large datasets, information from multiple data sources is often required to be integrated. For getting maximum insight from integrated data (e.g., correlations among diseases in the case of a medical dataset), record linkage is essential. Record linkage is the process of identifying record pairs from different information systems that belong to the same real-world entity, i.e., a customer, or a patient. Given two repositories of records, the record linkage process consists of determining all pairs that are similar to each other. Record linkage faces two challenges on the edge of big data. First, the high velocity of data updates swiftly makes previous linkage results extinct. Second, a massive volume of data requires a long time for applying record linkage in the traditional (batch linkage) approach. These two challenges require an incremental solution so that when data updates appear, we can swiftly update linkage results [18].

Usually, distinct identifiers, e.g., primary keys, are not always present in the databases that need to be linked. This makes record linkage a problematic task. So, to perform linkage, the common attributes of datasets are used in many cases. These include the name, birth date, address, and other personal details of an entity. Currently, maintaining privacy and confidentiality are significant challenges for record linkage. During the linking of databases across organizations using personal information, careful protection of the privacy of this information is a must. The process of discovering records of similar individuals from separate databases without disclosing identifying attributes of these individuals is known as ‘privacy-preserved linkage of records,’ ‘linkage of blind data,’ or the ‘private linkage of records’ problem [41].

Although few papers such as [8, 18, 29, 32, 38, 44, 45] have presented their solutions for incremental record linkage targeting the linkage quality or efficiency, the privacy issue of this approach is yet to discuss. In this regard, we have come up with a new idea called “privacy-preserving incremental record linkage” or in short “PPiRL,” which encapsulates privacy-preserving techniques with an incremental approach to record linkage problems. We have the following key contributions to this research. We propose the concept of Privacy Preserved Incremental Record Linkage (PPiRL) for the first time and derived the required mathematical model for it. We also develop a PPiRL framework and finally implemented it. We test the performance of the PPiRL framework and compare it with the existing state-of-the-art techniques.

Importance of privacy preservation

Information privacy of an individual or organization deals with the ability to determine what data in a computer system to be shared with others. It is considered an important aspect of information sharing wherever personally identifiable information is accumulated in any form. Depending on the category of information, e.g., health, finance, etc. sometimes it is more important to maintain privacy than others.

Medical databases, in particular, involve highly private information. Detailed information about patients might become obtainable when databases are linked, such as some people having certain chronic diseases and also having financial problems. Security weaknesses of a healthcare information system can result in privacy losses as Protected Health Information (PHI) of patients has high sell values in the underground markets. It is an increasing target for hackers to break down the security of health systems and expose the private data of patients for money.

Nobody likes his medical records to be revealed to unauthorized persons. Hackers or eavesdroppers generally aim to exploit personally identifiable information to do business with the extracted information or extort a famous person or celebrity. A social security number (SSN) is sold for twenty-five cents, and a credit card number can be priced at one dollar in the United States underground market. Surprisingly, a medical record’s price in the same market ranges from ten dollars up to one thousand dollars. These sold records normally have a date of birth, the name of the patient, numbers of health policy, codes of diagnosis, and other significant ID numbers. Eavesdropper uses this data for creating duplicated IDs to buy health-related products, insurance, drugs, etc. From the year of 2014, the extent of hacking over healthcare servers has increased by a significant margin. The attackers’ motivation is to get huge PHI in a single successful hack. Analyzing the data provided by the U.S.Department of Health and Human Services, it is found that hackers are increasingly targeting healthcare servers which is very alarming to the health information systems using record linkage [23, 25, 27, 36].

On the other hand, the banking industry always looks for new methods which help to create more precise customer profiles. For example, exploitation of internal and external data, even aggregate information about customers’ purchasing habits to other organizations using record linkage. These methods help enhance customer credibility. However, there are potential risks to violate financial ethics which can occur as data abuse. So, in the case of financial records, privacy preservation is vital [17, 30].

Application areas of PPiRL

Privacy needs careful consideration when data from several organizations are linked. Many fields like public health research, health surveillance, census, and centralized data warehouses are in constant need of privacy-preserved linkage as there are many parties involved in the linkage process.

In public health research, researchers often set on to investigate the types of injuries caused by car accidents, intending to uncover the correlation between types of accidents and the resulting injuries [41]. This kind of research can have a significant influence on potentially lifesaving changes in policymaking. In this scenario, several parties such as hospitals, the police, as well as public and private health insurers are involved.

Financial organizations such as online marketplaces, e-commerce sites, and banks require to develop a complete and up-to-date profile of their customers by linking data from different organizations. Here also several financial institutions such as banks and e-commerce sites are involved [17].

In health surveillance, early outbreak detection systems to prevent infectious diseases require data from various sources to be gathered and linked, such as human health data, animal health data, and consumed drugs data. Privacy is a prime concern when such data are linked and stored at a central location [27, 40].

Our contributions

We have three major contributions which are apparent throughout the paper.

1.
To the best of our knowledge, We are the first to recognize privacy-preserving incremental record linkage as a new field of research. Recognition of this field paves the way for solving the problems of record linkage, integration, and data mining relating to the volume and velocity of data along with privacy issues.
2.
We have proposed a novel end-to-end framework that encompasses both privacy and linkage of data in an incremental approach. We have named it privacy-preserving incremental record linkage (PPiRL). We have also provided the necessary definition and mathematical model for PPiRL.
3.
We have implemented our PPiRL framework and provide various comparisons of our framework with traditional privacy-preserving record linkage (PPRL) techniques and incremental record linkage (IRL) techniques.

Organization of the paper

We have organized the paper as follows. In the “Literature review” section, we provided a review of the important literature related to record linkage, privacy-preserving record linkage, and incremental record linkage. We have provided some background and formulated the problem of privacy-preserving incremental record linkage in the “Background knowledge and problem formulation” section. Our main contribution, the PPiRL framework, is explained next. Experimental results, privacy evaluation, and comparisons are presented in “Experimental Results” section. We have presented a short discussion on the result in the next section. Finally, the “Summary” section concludes this paper.

Literature review

Record linkage

Record linkage or entity resolution refers to the process of identifying and aggregating records from one or more datasets, which represent the same real-world entities. Recently the world has encountered an eruption in the volume and velocity of data that is being accumulated by individuals as well as organizations. All these data are either generated by the people or about the people. To achieve good results in data mining, the quality of the data is essential. A serious obstacle to proper data analysis is the noise in the collected data [3, 28, 35]. Low-quality data which contain erroneous, missing data, or out-of-date values generate low accuracy outcomes after data analysis [9]. In order to improve the quality of data and do complex data analysis and mining, a solution is to integrate data from different sources. This integration of data paves the way to identify conflicting data values, enrich data, or impute missing values [22]. A traditional approach to getting the linkage of records is to find the similarity among record pairs. After similarity calculation, supervised or unsupervised algorithms can be applied to extract the linkage result.

Record linkage [16, 22], schema matching [6], and data fusion [6, 7] are the three main tasks in data integration. Among them, record linkage is aimed at identifying all records that refer to similar real-world entities in several databases. It can also be applied to detect identical records in a single database [15, 33]. For record linkage, three significant complications can be recognized. Firstly, linkage quality plays a crucial role in the linkage of records. As the fact that real-world data are ‘dirty’ is responsible for the loss of quality of linkage [21]. Only the exact matching of personal identifying features is not able to give us the desired output. In this case, we need approximate matching besides accurate matching to achieve good accuracy in linkage quality [9, 14]. To decrease the number of potential comparisons required between scalability is essential. The usage of expensive similarity comparison methods creates a performance bottleneck [4, 12]. This challenge can be overcome by using proper indexing techniques [11].

Figure 1 illustrates the outline of the general record linkage process comprising several steps. Data preprocessing which includes data cleaning and standardization is the first step in this process. It is a crucial step because real-world data contain inconsistent, noisy data [3, 35]. Indexing [11] is the second step in the process. In the comparison step, record pairs are compared elaborately with the help of similarity functions [10]. The classification step classifies the record pairs using a decision model thus generating matches, non-matches, or possible matches [19]. Finally, evaluation measures of different types are deployed to measure the complexity [11] and quality of resultant linkage [12].

Privacy-preserving record linkage (PPRL)

Definition 1

Privacy-preserving record linkage (PPRL): Assume P$_{1}$, . . . , P$_{m}$ are the m owners of the databases D$_{1}$, . . . , D$_{m}$, respectively. They wish to determine which of their records R$_{1}^{i} \in D_{1}$, R$_{2}^{j} \in D_{2}$,..., R$_{m}^{k} \in D_{m}$ match based on their demographic data according to a decision model C(R$_{1}^{i}$, R$_{2}^{j}$,..., R$_{m}^{k}$ ) that classifies records of different datasets into one of the two classes: M (Match), and N(Non-match). P$_{1}$, . . . , P$_{m}$ do not wish to reveal their actual records R$_{1}^{i}$, R$_{2}^{j}$,..., R$_{m}^{k}$ with any other party. They however are prepared to disclose to each other, or to an external party, the actual values of some selected attributes of the records that are in class M to allow analysis [40]. A viable PPRL solution that can be used in real-world applications should have three properties: scalability, linkage quality, and privacy.

Figure 2 illustrates the outline of the privacy preserving record linkage process. Large databases across organizations needed to be linked. At the same time preserving the privacy of the records stored in these databases is also crucial. This necessity directs a new research area called privacy-preserving record linkage (PPRL) [13, 39, 42]. PPRL is alternatively called as privacy record linkage [1, 2, 24, 46] and blind-folded record linkage [13, 43]. Due to privacy concerns, commercial interests, or legal restrictions, it is often not allowed to exchange private or confidential data between organizations. When there arises a cross-organizational project, databases of different organizations need to link in such a way that no sensitive information is being exposed to any of the parties involved, and no outsider can eavesdrop on the data to learn anything. PPRL ensures that after the end of a linkage project, only a limited amount of information is disclosed to the exchanging parties. The disclosed information may contain, (i) the number of records that have been classified as matches, (ii) the attributes of these matched records, and (iii) a selected set of attributes from these matched records [40].

Incremental record linkage

Incremental record linkage (IRL) is the clustering process where only the newly arrived records will be compared with existing clusters. Then, based on similarity, either the new records will be put into some existing cluster(s), or a new cluster will be created for it if the new records are dissimilar to existing clusters according to some threshold value. Incremental record linkage has been studied in [44, 45]; however, they focused on evolving matching rules and discussed concisely only evolving data.

On the other hand, incremental graph clustering methods have been proposed by some researchers. Mathieu et al. [29] studied incremental correlation clustering for the following two cases: (1) one vertex is added each time, and (2) already identified clusters need to be preserved. Charikar et al. [8] studied incremental clustering when the number of clusters is predefined. Both papers focused on theoretical analysis rather than implementations. A novel incremental heuristic algorithm was presented in [38] for the Clique Partition Problem (CPP), a well-studied graph partitioning problem. The algorithm was much faster for the tested datasets comparing the batch linkage algorithm. Privacy issues were not considered in any of the above research.

An efficient approach for incremental record linkage has been proposed in [18] where the authors presented a framework using several algorithms and showed viable efficiency compared to the previous works. Nasciment et al. [32] proposed heuristic-based approaches to speed up the performance of the IRL algorithm. Both papers deal with linkage quality and efficiency. None of them considers the privacy issues for record linkage. To the best of our knowledge, our framework is the first to perform an incremental linkage that considers privacy issues.

Background knowledge and problem formulation

Some key terms related to incremental record linkage are discussed below.

Base dataset: A large collection of database records having both identifiable and non-identifiable attributes denoted by D here.

Increment: A dataset that contains records that need to be merged with the base dataset denoted by $\varDelta D$.

Batch record linkage: Here, for each increment dataset $\varDelta D$, the record linkage process is executed for the whole dataset D+$\varDelta D$. Let us assume a scenario in which our base dataset contains one million records whereas each increment contains one thousand records. In batch record linkage, we have one million as our starting point to apply to the cluster. When the first increment arrives, we have to perform clustering over one million and one thousand combined records. It is a time-consuming process and hence inefficient. The process is illustrated in Fig. 3.

The figure is divided into four boxes where an arrow indicates the direction of one box to another. Each of the boxes represents a distinct stage in the batch record linkage process.

Incremental record linkage (IRL): An incremental record linkage process preserves the clusters developed from the base dataset D and merges the records from the incremental dataset $\varDelta$D using a similarity function. The IRL process creates new clusters if some of the records of$\varDelta$D are not similar to any of the existing clusters based on a similarity function. In order to get a practical understanding of how incremental record linkage works, Fig. 4 will be helpful. Here, three boxes in the figure represent three distinct stages of the IRL process.

Definition 2

Incremental Record Linkage (IRL): Let D be a set of records and $\varDelta$D be an increment to D. Let $\rho _D$ be the clustering of records in D. Incremental record linkage clusters records in D + $\varDelta D$ based on $\rho _D$. We denote the incremental record linkage method by f and denote the results by f(D, $\varDelta D ,\rho _D$).

The aim of incremental record linkage is to improve performance significantly compared to its corresponding batch linkage algorithm especially if the increment is small [18]. Specifically, the computation of f(D, $\varDelta D ,\rho _D$) should be faster than the computation of F(D + $\varDelta D$) if |$\varDelta D|<<|D|$ holds. At the same time, incremental record linkage should achieve equivalent accuracy to its reference batch algorithm. We denote this constraint as f(D, $\varDelta D ,\rho _D) \approx F(D + \varDelta D$).

Now we formally define the problem of privacy-preserving incremental record linkage. For a set of records, privacy-preserving incremental record linkage is essentially a combination of linkage and privacy preservation. In this problem, each cluster generally contains records where privacy is ensured with the help of several privacy-preserving techniques. The records of the cluster represent a distinct real-world entity. The linkage should have both a high recall value and a high precision value.

Definition 3

Privacy-Preserving Incremental Record Linkage (PPiRL): Let D be a set of records and A is the set of attributes of D. Let ${\bar{A}}$ is the set of sensitive attributes of D and ${\bar{A}} \subset A$. $\varDelta$D is an increment to D. We denote the privacy preservation method by $\varGamma$ and denote the privacy preserved results $\varGamma (D)$ by ${\bar{D}}$, and $\varGamma (\varDelta D)$ by $\bar{\varDelta D}$ respectively. Let $\bar{\rho _D}$ be the clustering of records in ${\bar{D}}$. Privacy-preserving Incremental record linkage clusters records in ${\bar{D}}$ + $\bar{\varDelta D}$ based on $\bar{\rho _D}$. We denote the privacy-preserved incremental record linkage method by $f^\prime$, and denote the results by $f^\prime$(${\bar{D}}, {\bar{\varDelta }} D ,\bar{\rho _D}$).

Privacy preserving incremental record linkage (PPiRL) has three goals. PPiRL wants to ensure the privacy of sensitive records. it aims to improve performance significantly compared to the corresponding privacy-preserving batch clustering algorithm. Specifically, the computation of $f^\prime$(${\bar{D}}, {\bar{\varDelta }} D ,\bar{\rho _D}$) should be faster than the computation of F(${\bar{D}} + {\bar{\varDelta }} D$) if |${\bar{\varDelta }} D|<<|{\bar{D}}$| holds. On the other hand, PPiRL tries to achieve equivalent accuracy to its reference batch algorithm. We denote this constraint as $f^\prime$(${\bar{D}}, {\bar{\varDelta }} D ,\bar{\rho _D}$) $\approx$ F(${\bar{D}} + {\bar{\varDelta }} D$).

PPiRL, an end-to-end Framework

Our proposed solution is an end-to-end framework for record linkage which considers a significant reduction of time for performing record linkage along with privacy preservation without compromising the linkage quality. There are five basic steps in the framework with different functionality. Data pre-processing, privacy preservation, blocking, clustering, and evaluation are the five stages of our framework illustrated for the base dataset in Fig. 5 and for increments in Fig. 6.

Data pre-processing

Pre-processing of data helps improve the condition of data by handling errors and inconsistencies from data. Although data quality issues are found in a single dataset, quality issues become serious when data is integrated from multiple sources into a warehouse [35]. Some essential steps of data pre-processing are feature selection, data standardization, data cleaning, missing data imputation, normalization, etc. Details of data pre-processing are out of the scope of this paper.

Privacy preservation

Depending on the attributes which are more prevalent in the healthcare datasets we have selected two types of privacy techniques to be implemented in our framework. To best suit our purpose at times we have used state-of-the-art algorithms.

Phonetic encoding

The phonetic encoding algorithm groups values that have similar pronunciations. It inherently provides privacy and increases scalability as well. The procedure is illustrated in Fig. 7. One of the most sensitive attributes in healthcare datasets is the name of the patients. A leak of a patient’s name could jeopardize the patient’s privacy in a bad way. Using phonetic encoding we got the following advantages:

1.
Names will be encoded. So they can not be easily identified.
2.
Names will be generelized. That is similar pronouncing names with different spelling will produce same code.
3.
Because of generalization of names, the output code is robust against noises and spelling errors, which are common in healthcare centers specially in the developing countries [26].

We used nameSignificance algorithm for phonetic encoding as it produces better results than Soundex, Metaphone, and some other commonly used phonetic algorithms.

K-anonymization method

K-anonymization is a popular generalization algorithm. The main purpose of generalization is to help overcome the problem that lies with record linkage which is the re-identification of entities. The data generalization process generalizes data in a way that re-identifying the data to its source record is quite impossible. There are many generalization techniques. Among them, the K-anonymization method has been proven to be an effective privacy technique that can preserve the privacy of record linkage results [5]. In the K-anonymization technique, we assume that data related to a specific person is gathered in a dataset. The anonymization process is started by removing all the identifying features like SSN, explicit identifiers, etc. Even after removing the identifying features, it is possible to find a person’s data by finding a pattern in other features. To tackle that K-anonymization generalizes the feature values as much as possible with a value of K either being fixed at the beginning of getting an adaptive value as the linkage process continues. Fig. 7 explains the K-anonymization process elaborately.

In Fig. 8, we can see that common identifying feature such as name is suppressed at the very beginning. Then, other features which could be used to identify a person in a dataset are generalized by varying the value of K at a different time. In our case, we have used the adaptive value of K which allows us to select the best value of K depending on the results found.

Blocking

Before moving on to the explanation of blocking methods for our framework we should first understand the due importance of blocking as an inseparable phase of new incremental record linkage or standard record linkage procedures [4, 37]. To find the best match of records record pair comparison is needed to be done on the dataset. When the possible quantity of matched records increases the number of record pair comparisons also increase rapidly. For large datasets, this approach is computationally impractical. Blocking allows us to divide the whole dataset into blocks depending on some criteria. There are many existing techniques to achieve the task of blocking.

Feature set for blocking

For our framework, we have five identifiable attributes of the patient which we have applied to our clustering algorithm. These attributes are Patient Name, Gender, Age, Contact Number, and Address. We have to select an attribute or a group of attributes as a blocking key such that it maintains a balance between the computation and communication cost. We have carried out several experiments to find the best set of features which would bring effective results. To achieve this, we took all the possible feature sets from the five already selected features for our framework, and after careful inspection of the results, we selected the one with the best outcome. In our case, the Gender-Address set of features was better than other options. So, for further experiments, we have used this set of features for blocking.

Clustering

There are many clustering algorithms available. Because of the simple approach and easy implementation process compared to existing approaches we have opted to go for agglomerative hierarchical clustering.

The Davies-Bouldin index (DB) is calculated as follows. For each cluster C, the similarities between C and all other clusters are computed, and the highest value is assigned to C as its cluster similarity. Then the DB index can be obtained by averaging all the cluster similarities. The smaller the index is, the better the clustering result is. By minimizing this index, clusters are the most distinct from each other and, therefore, achieve the best partition. The Davies-Bouldin index was originally defined for a Euclidean space; applying it to record linkage requires some adjustment for the definition of distance. We adopt the definition, described as follows. For each cluster C, the intra-cluster distance is defined as the complement of average similarity between records in the cluster; that is,

$$\begin{aligned} D(C) = 1-Avg_{r,r ^{\prime }\in C} sim(r,r ^{\prime }) \end{aligned}$$

For each pair of distinct clusters C and C’, the inter-cluster distance is defined as the complement of average similarity between records across the clusters; that is,

$$\begin{aligned} D(C, C^{\prime }) = 1-Avg_{r \in C,r ^{\prime }\in C^{\prime }} sim(r,r^{\prime }) \end{aligned}$$

The separation measure between C and C’is then defined as

$$\begin{aligned} M(C, C^{\prime }) = \frac{D(C) + D(C^{\prime }) + \alpha }{D(C,C^{\prime }) + \beta } \end{aligned}$$

Where alpha and beta are small positive numbers such that the denominator or numerator would affect the result even when the other is 0.2 for each cluster C, we define its separation measure as

$$\begin{aligned} M(C) = max_{C^{\prime }\ne C} M(C,C^{\prime }) \end{aligned}$$

DB-index is defined as the average separation measure for all clusters and we wish to minimize it.

Correlation penalty: For each pair of records in the same cluster, there is a cohesion penalty being the complement of the similarity; for each pair of records in different clusters, there is a correlation penalty being the similarity. We wish to minimize the sum of the penalties.

$$\begin{aligned} \begin{aligned} CC(L_G) = \sum _{C \in L_G, r, r^{\prime }\in C} (1-sim(r,r^{\prime })) + \\ \sum _{C,C^{\prime }\in L_G, C \ne C^{\prime }, r \in C, r^{\prime }\in C^{\prime }} sim(r,r^{\prime }) \end{aligned} \end{aligned}$$

A special case for correlation clustering is when we take binary similarities: the similarity between two records is either 0 (dissimilar) or 1 (similar).

We have the group averaged agglomerative version for our framework. Although ward’s criterion is popularly used to compute the distance between two clusters during agglomerative clustering in our case we needed something customized which would serve our purpose directly. Ward’s criterion uses the K-means squared error criterion to determine the distance. It is also interpreted as the squared Euclidean distance between the centroids of the merged clusters. However, in our clustering, we did not use centroids rather all the data objects’ average similarity was used to determine the suitable clusters. In order to achieve that we applied our similarity metric. There we examined the identifying attributes of a patient and calculated similarity with pre-assigned weights of the attributes. We chose the weights from domain knowledge, global standards, and local trends.

Impact of incremental approach in PPRL

Figures 5 and 6 provide a comparison of the impact of the incremental feature on the overall framework. Figure 5 represents the part of the framework at the initial state with a base dataset. That is the same as the traditional batch processing approach. Here the whole dataset is clustered using step 3, 4 and 5 that is based on blocking, clustering and evaluation results. But when the increments come, the framework acts differently than the batch processing framework. It does only process the increment part of the dataset and adjusts it with the previous clustering results, shown in Fig. 6. The incremental record linkage (IRL) process preserves the clusters developed from the base dataset D and merges the records from the incremental dataset $\varDelta D$ using any similarity function. Figure 4 illustrates the inner concept of the IRL Framework. Figure 9 illustrates the PPiRL framework for clustering of a base dataset as well as multiple increments.

IRL improves the linkage speed significantly compared to its corresponding batch linkage algorithm especially if the increments are small. Specifically, the computation of f(D, $\varDelta D ,\rho _D$) should be faster than the computation of F(D + $\varDelta D$) if |$\varDelta D|<<|D|$ holds. For example, if the Base dataset contains 1 Million records and increment 1 contains 10 thousand records, in the batch processing framework, it will need to process 1.01 million records to produce the correct clusters including the 1st increment. On the other hand, our proposed incremental framework will only need to process 10 thousand records instead of 1.01 million records to produce correct clusters incorporating the 1st increment. So our proposed framework can produce record linkage much faster with producing similar results as batch processing which is present in the "Result" section.

Evaluation

The outcome of the PPiRL technique needs to be evaluated from different aspects. As our main goal is to combine the privacy technique with incremental record linkage, so we need to evaluate our framework in light of privacy and clustering validation.

Linkage evaluation

Evaluation of linkage evaluates the quality of clustering results. Attaining success in different clustering applications has been acknowledged as a key task. Linkage evaluation can be implemented in two ways. External validation of clustering can be implemented when external information like the class label of a cluster is already present for the dataset. However, when this type of external validating information is not present internal validation measures could be used. For external validation, we have used F-measure to validate the outcomes of our framework. For internal validation, we have used the Davies Bouldin Index penalty and correlation penalty.

Privacy evaluation

To strengthen the privacy of our framework, it is imperative to evaluate the framework with more than one type of privacy analysis technique. Hence the implementation of frequency analysis and dictionary attack analysis was integrated with the framework. To ensure the privacy aspect of our framework we have gone for several privacy evaluation techniques. Frequency analysis, dictionary attack, and adversary model simulation proof testing are the key evaluation measures that we have taken for the framework.