Assessing the accuracy of record linkages with Markov chain based Monte Carlo simulation approach

Record linkage is the process of finding matches and linking records from different data sources so that the linked records belong to the same entity. There is an increasing number of applications of record linkage in statistical, health, government and business organisations to link administrative, survey, population census and other files to create a complete set of information for more complete and comprehensive analysis. To make valid inferences using a linked file, it has become increasingly important to have effective and efficient methods for linking data from different sources. Therefore, it becomes necessary to assess the ability of a linking method to achieve high accuracy or to compare between methods with respect to accuracy. This motivates the development of a method for assessing the linking process and facilitating decisions about which linking method is likely to be more accurate for a particular linking task. This paper proposes a Markov Chain based Monte Carlo simulation approach, MaCSim for assessing a linking method and illustrates the utility of the approach using a realistic synthetic dataset received from the Australian Bureau of Statistics to avoid privacy issues associated with using real personal information. A linking method applied by MaCSim is also defined. To assess the defined linking method, correct re-link proportions for each record are calculated using our developed simulation approach. The accuracy is determined for a number of simulated datasets. The analyses indicated promising performance of the proposed method MaCSim of the assessment of accuracy of the linkages. The computational aspects of the methodology are also investigated to assess its feasibility for practical use.


Introduction
Record linkage (Newcombe et al. 1959;Fellegi and Sunter 1969) is the process of finding matches and linking records from one or more data sources (e.g., the Census and various health registries or Centrelink datasets) such that the linked records represent the same entity.An entity might be a business, a person, or some other type of listed unit.The term record linkage came originally from the area of public health and also from epidemiological and survey applications (Winkler 1999).In record matching algorithms, records in two files are compared with one another, typically using variables, such as name, address, and date-of-birth, sex, etc.The individual variables used for connecting records are generally called linking variables or linking fields, while a collection of linking variables together is called a linking key.
The most commonly used methods in record linkage are deterministic and probabilistic linkage methods.In a deterministic approach, two records are said to be a link if they agree on a high quality identifier (e.g.social security number, tax file number, driver license, etc.) or a combination of identifiers (e.g.firstname, date of birth and street name), where quality is usually assessed in terms of precision and stability over time.
In a probabilistic method, no unique identifier is available.Record pairs from different files are compared using a set of identifying information comprising one or more linking fields.Each record pair is given a weight based on the likelihood that they are a match.This weight is determined by assessing each linking field for agreement or disagreement, assigning a weight based on this assessment, then summing these individual weights over all linking fields for that pair.This summation is based on the premise of conditional independence, which means that for a record pair the agreement on a linking field is independent of agreement on any other linking field for that pair (Fellegi and Sunter 1969).A decision rule, typically based on a cut-off value, finally determines whether the record pair is asserted to be linked, non-linked or should be considered further as a possible link.Probabilistic record linkage methods are now being well accepted and widely used (Herjog et al. 2007;Winkler 2001Winkler , 2005)).
In recent years, large amounts of data are being collected by organizations in the private and public sectors, as well as by researchers and individuals.Analysing these relevant data can provide huge benefits to businesses and government organizations.
Technological advancement now makes it possible to store and process these massive databases.However, data from different sources relating to the same entity need to be linked.Moreover, data within a single source may also need to be linked, for example, if there are multiple records for entity over time.Connecting data from different data sources can improve data quality and give better modelling structure (Newcombe et al. 1959, Wallgren 2007 To make correct inferences using a linked file, it is important to assess the accuracy of the linkages.This motivates two research challenges: to develop a method for assessing the linking process, and to find techniques to improve linking process to achieve higher accuracy where the overall accuracy assessment approach can be used with any method.
Perfect linkage means all records belonging to the same individual are matched and there are no links between records that belong to different individuals.However, in the absence of a unique identifier without error, it is very unlikely to have perfect linkages.This is because linking variables that may be suitable for identifying similar records, such as name, address, date-of-birth etc., may not uniquely identify a person; for example, names may change over time, ages may be entered incorrectly, or addresses may be displayed in different formats, all of which can result in erroneous linkage.In addition to the challenges of missing values, typographical or spelling errors and non-standardized formats of data, sometimes it is hard to identify a correct link even after clerical review.Linkage must also deal with issues of privacy and confidentiality.For example, a person may choose not to enter their age, or individuals' names may not be provided in a de-identified file made available to an analyst or manager.Finally, the linkage method must be scalable, in order to provide fast results for increasingly large datasets.
One way of measuring linkage error is by the proportion of links that are correct matches.Incorrect links create measurement error and bias the analysis (Harron et al. 2014;Chipperfield et al. 2011;Chipperfield and Chambers 2015;Chambers et al. 2009;Lahiri and Larsen 2005).Larsen and Rubin (2001) use the posterior probability of a match for estimating true match status and improve the classification of matches and non-matches through clerical review.However, clerical review can be expensive and time consuming for large databases.Moreover, even after the clerical review it is not possible to be certain about a link being actually correct or incorrect.Lahiri and Larsen (2005) do not consider 1-1 linkage where every record from one file is linked to a distinct record in another file.However, the analytic estimates of precision in Lahiri and Larsen (2005) are poor for 1-1 probabilistic linkage (Chipperfield and Chambers 2015).
As a quality measure, Christen (2012) suggests precision, which is the proportion of links that are true matches.Winglee et al. (2005) use a simulation-based approach, Simrate to estimate linkage quality.Their method uses the observed distribution of data in matched and non-matched pairs to generate a large simulated set of record pairs.They assign a match weight to each record pair following specified match rules, and use the weight distribution for error estimation.The simulated distribution is used to select an appropriate cut-off for estimating the error rates but they do not explicitly consider precision.In their simulation approach, they didn't simulate the linking process; instead they simulated the comparison outcome for linkage quality measures.
Moreover, for the quality measure, most of the work was focused on overall file accuracy.Chipperfield and Chambers (2015) developed a parametric bootstrap method of making inferences for binary variables where they used a probabilistically linked file which is created under the 1-1 linkage constraint.They showed that using the posterior probability of a match for the estimation of a true match can produce biased results.
In our approach we have taken a different approach by simulating the linking process using simulated agreement matrices and measure the corresponding accuracy.We also estimate the accuracy for individual records as well as the overall file which includes all records.This paper develops a Markov Chain based Monte Carlo simulation (MaCSim) approach to assess linkage accuracy.MaCSim utilizes two linked files with known true match status which helps us to estimate the necessary parameter values for the MaCSim algorithm.We create an agreement matrix and then generate re-sampled versions of this matrix.At each iteration, we use the simulated matrix to link the two datasets using a defined linkage method, and estimate the accuracy of the link.This ultimately implies the accuracy of the linking method that has been followed to relink the records.
The MaCSim algorithm can be used as a stand-alone method to assess the accuracy of previously linked files.Alternatively, it can be used to evaluate or compare other linking methods.Based on the obtained accuracy results, the user can conclude decide on a preferred method or evaluate whether it is worth linking the two files at all.The computational aspects of this methodology are investigated using a simulated dataset received from the Australian Bureau of Statistics (ABS) to assess its feasibility for practical use.The dataset contains 400,000 records accord with 400,000 hypothetical individuals.
The paper is organised as follows.Section 2 describes the proposed assessment method, MaCSim.A range of analyses using the method is described in Section 3. The paper concludes with a summary and discussion of future work in Section 4.

Method
The aim of the MaCSim method is to assess the linking process using a Markov Chain based Monte Carlo simulation approach.The simulation algorithm (Section 2.3) maintains internal consistency patterns of agreement while preserving underlying probabilistic linking structure.
Consider a pair of linked files  and , where  contains   entries and  contains   entries.There are  linking fields in each file.We define   to be the probability that the ℎ linking field in both files has the same value for a matched pair of records and   to be the probability that the ℎ linking field values in both files are the same for a non-matched pair of records.Further, let   be the probability that either or both of the ℎ linking field values in any record pair are missing regardless of whether the record pair is matched or non-matched.We assume that all missing values occur at random, and denote by   the probability that the ℎ linking field has a value in either file  or file , individually.Hence, the probability that neither value is missing (from both files) is 1-  = (1-  ) 2 .Therefore, we obtain   = 1 − √1 −   .

Creating Agreement Matrix 𝑨
An agreement matrix, , is created from the two files to be linked,  and , where We assume that   ≤   , and each record in file  has a single true matching record in file .We also assume for simplicity of notation that   represents the agreement value of the ℎ linking field for the true matched record pair in both files.

Probabilistic Record Linkage
The basis of a probabilistic linkage method supposes that there are two files  and  with records   , where  ∈ ,  ∈ .All possible pairs of records from these two files can be divided into two disjoint sets  (for matched pair) and  (for non-matched pair).A pair of records will be an element of the set  if they are truly matched (i.e.
both represent the same entity).Otherwise, it will be an element of the set  (i.e. represent two different entities).The probabilistic method aims to classify the record pair as an element of either  or .It will be observed whether or not each record pair agrees on the values of the ℎ linking variable to help decide whether they belong to set  or  (Fellegi and Sunter 1969).
The conditional probabilities   and   can be written as The odds ratio {  |} can be used for considering the accuracy of (, ) as a link.
The estimates of   and   can be used to calculate the odds ratios for agreement and disagreement on ℎ linking variable.The agreement and disagreement weights are then defined as follows: where    and    represent the agreement and disagreement weights for ℎ linking variable respectively.The base of the logarithm used is immaterial, and base 2 is chosen here as it allows a comparison to information theory results.(Newcombe et al. 1959;Fellegi and Sunter 1969).

Simulating Agreement Matrix 𝑨
The idea is to generate re-sampled versions of the agreement matrix  in such a way as to preserve the underlying probabilistic linking structure.For this purpose, the MaCSim algorithm develops a Markov Chain { () } =0,1,2,… on ={set of possible agreement pattern arrays}, with  (0) = , the observed agreement pattern array for the files  and .The key step is to simulate the observed agreement matrix  to create  * which includes all the simulated agreement matrices and then apply a linking method to link records using the simulated agreement matrices in each simulation.We estimate the linkage accuracy for each record in every simulation.These estimates are collated and summarized to provide an overall linkage accuracy as described in Section 3.

Simulation Algorithm
Markov Chain Monte Carlo (MCMC) (Gelman et al. 1995;Gilks et al. 1996) is an algorithm that constructs a Markov Chain which converges after a certain number of steps to the desired probability distribution and then samples efficiently from this distribution.The generated sample is used as an approximation to the probability distribution for further inference.
A Markov Chain is a process whereby the next step or iteration of the process only depends on the current step, not on the previous steps in the process.That is, a sequence  1 ,  2 , … of random elements of some set is a Markov chain if the conditional distribution of  n+1 given  1 , …,   , depends only on   (Geyer 2011).
The set from which the values of  are taken is called the state space of the Markov chain.
In case of a finite state space, say, { 1 , … ,   }, the initial distribution can be defined as, where ( 1 , … ,   ) is a vector.The transition probability matrix  comprises probabilities   defined by The structure of the transition probabilities for the MCMC algorithm employed by MaCSim is now outlined.Given the current state of the chain, A (n) , the next state, A (n+1) , will be constructed as follows: Step 1: Initially, set   (+1) =   () for all , ,  .
Note that at this point, we will assume that a missing value will remain as it is (0) in the agreement matrix, since it is easy to consider a value as a missing entry but not easy to assign a value to a missing entry.
It is also important to note that the transition probabilities  and  are specific for each linking field , where  ∊ {1, … , }.However, for the simplicity of notation we use  1 ,  2 instead of  1 ,  2 and  1 ,  2 ,  3 instead of  1 ,  2 ,  3 respectively.
Once values for  and  are determined to ensure the stationary distribution of the chain has the desired structure (see Sections 2.4 and 2.5), this Markov chain can be used to generate an appropriate set of re-sampled agreement matrices.
In practice, every ℎ iteration can be retained, where  > 1 is a specified constant, to reduce autocorrelation (See Section 3.3).

Underlying Intuition and Maintaining Consistency
The transition structure as defined above is designed to replicate circumstances whereby a random element of file  is selected and then a change in its value for the ℎ linking variable is made with probability based on its current agreement status with its corresponding partner in the opposite file.It is noted that if a change does occur, this has the consequent effect of changing the agreement patterns in the associated non-matching record pairs.For instance, if the selected linking variable value in the selected record of the selected file matches its counterpart in the opposite file and was changed, then any agreement indicator for which the associated record in the opposite file was unity (indicating agreement of the values for the selected linking variable) must be re-set to -1, as in steps 4()() and 4()(), as they can no longer agree.
Alternatively, for non-matched records for which the agreement indicator was -1, the values now may or may not agree, so we reset the indicator value to 1 with the given probability.With this underpinning, it is clear that the internal consistency patterns of agreement will be maintained.

Maintaining Marginal Distributions
In addition to internal agreement consistency, we need to ensure that the stationary distribution of the Markov chain maintains the required probabilities of agreement for both matched and non-matched records across the two files.This requires appropriate selection of the transition probability parameters  = ( 1 ,  2 ) and  = ( 1 ,  2 ,  3 ).
In particular, we require that the probability that linking field values for matched record pairs agree remains equal to   .That is, {  (+1) = 1} =   .
Assuming that the chain starts in the following state, it is straightforward to see that Thus, we require  2 =  1   /(1 −   −   ).Of course, this requirement puts limits on  1 , since any value of  1 > (1 −   −   )/  would result in  2 > 1.However, if   > 0.5(1 −   ) (which it certainly should be for any reasonable and useful linking variable), the necessary constraint of  1 < (1 −   −   )/  is always satisfied. 1 in this scenario can be thought of as a "mixing rate" parameter and thus the value of p1 should be set as large as possible for using our Markov chain in a computationally efficient manner (i.e.allowing the use of a relatively small value of ).This means, without any other constraints, we should select  1 = (1 −   −   )/  which then implies that  2 = 1.However, as we shall now see, whether we can choose this option for  1 depends on the values of   .
In our approach, the key assumption is (1 −   −   ) ≥ 0 , (1 −   −   ) ≥ 0 and   , the probability of agreement for matched record pair, should always be greater than the probability of agreement for non-matched record pair,   i.e.   >   .
Choosing appropriate values for the  parameters arises from the requirement to maintain the probability of agreement between values of the linking variable among non-matched records.In other words, we must ensure that {  (+1) = 1} =   .To this end, we note that based on the steps in the algorithm described in Section 2.3, where the above probabilities are calculated based on the relationships between the values of   () and   () .For example, {  () = 0} =   , but {  () = 0|  () ≠ 0} =   .In addition, we assume that the agreement status on any linking variable for a non-matched record pair, (, ) is independent of the agreement status of the linking variables for the associated true matched record pair, (, ).This means that for any matched record pair, whether the values of any linking variable from file X matches the value of the same linking variable in file Y from a randomly selected non-matched record is independent of whether the true matched pair agreed on the linking variable or not.
Based on the above discussion, in order to maintain the marginal probabilities of matching, we choose the transition probability parameters  = ( 1 ,  2 ) and  = ( 1 ,  2 ,  3 ) as follows:

Maintaining Correlation Structure of 𝐴 𝑖𝑗𝑙' s:
The choice of  and  values in the previous section maintain marginal distributions of the  ' s.However, since the transition probabilities do not depend in any way on the correlation structure in the vectors   = ( 1 , … ,   ), we cannot ensure that the members of { () } =0,1,2,… will maintain the original dependence structure.To ensure this structure we need estimates of the conditional probabilities: Given values for   (), it is possible to select values for  and  so that the stationary distribution of our Markov chain, { () } =0,1,2,… , will have the desired conditional probability structure.

Estimating 𝒎, 𝒖 𝒂𝒏𝒅 𝒈 probabilities
In the comparison stage, each linking field value for a record pair from the two files is compared; the result is a ternary code, 1 (when values agree), -1 (when values disagree) and 0 (when either or both values are missing).Hence, the comparison outcomes (i.e.agreement matrix, ) contain values 1, -1, and 0. According to these codes, each linking field is given a weight using the probabilities ,  and  to recap,  is the probability that the field values agree when the record pair represents the same entity;  is the probability that the field values agree when the record pair represents two different entities, and  is the probability when the field values are missing from either or both records in the pair.
For each linking field using the synthetic data, , ,   are estimated in the following way:  = number of values that agree for matched record pairs/total number of matched record pairs. = number of values that agree for nonmatched record pairs/total number of nonmatched record pairs. = total number of record pairs of which one or both values are missing/total number of possible record pairs.These probabilities can be estimated using a linked file or they may be known from previous linkages of similar types of data.

Creating an observed link
To create the observed links, weights are calculated from the agreement matrix  using the probabilities ,   .For any (, )-th record pair and any linking variable , if the agreement value is 1 (i.e.  =1) then the weight is calculated using ); if the value is -1 (i.e.  =-1), the weight is calculated using   = (1 −   −   )/(1 −   −   ) and for a missing value (i.e.  =0), the weight formula is   =  (  /  ) =  (1).
Given the assumption that missingness occurs at random, and thus has the same chance of occurring in a true matched pair as in a non-match, missing values will not contribute to the weight.
Once weights of all record pairs,   are calculated, the observed links are created following the steps described below: a. First, all record pairs are sorted by their weight, from largest to smallest.
b.The first record pair in the ordered list is linked if it has a weight greater than the chosen cut-off value.
c.In all the other record pairs that contain either of the records from the associated record pair that have been linked in step b, are removed from the list.Thus, possible duplicate links are discarded.
d. Go to step b for the second record and so on until no more records can be linked.
Finally, the maximum number of records that are considered as links will be less than or equal to the number of records in the smaller file since it contains the maximum number of possible matches.

Data
A synthetic dataset received from the Australian Bureau of Statistics is used for demonstration and analysis to avoid privacy issues associated with using real personal information.Moreover, for synthetic data, it is possible to assign a unique identifier to every record and link them back for verification.Thus, it is possible to calculate the matching quality and validate the accuracy of the model predictions.Many critical issues related to linking process can be investigated by providing controlled conditions with synthetic datasets.
A large file  is generated that comprises 400,000 randomly ordered records corresponding to 400,000 hypothetical individuals.Then, the first 50,000 records are taken to form file X.Every record has eight data fields (Table 1).For a record, the value of each variable is generated independently (e.g. the value of BDAY is independent of the value of SA1) and a discrete uniform distribution is used to generate its value except the value of COB.300,000 records are assigned a value '1101' for 'Born in Australia'.The remaining 100,000 records are randomly assigned one of about 300 country codes according to the corresponding proportion of people in the 2006 Australian Census.In file X, the RECID (Record Identifier) stays matched to the Y file for each record.This makes it easy to identify true matches and non-matches in the linking process.

Data field Value
RECID (Record identifier) 7 alphanumeric characters ranges from 'A000001' to 'A400000'.BDAY values are numeric and ranges from 1 to 366.

BYEAR (Birth Year)
Value is numeric and ranges from 1955 to 2009.

SEX (Male/Female)
The value 1 and 2 represents male and female respectively.Exactly 50% of all records are male, and the rest 50% are female.

EYE (Eye Colour)
Values are numbered from 1 to 5 and are evenly distributed.Some values in file X are changed intentionally to simulate errors in linking fields.The value of a variable in file X is changed by replacing it either with a randomly chosen value from the records in file Y or setting the value to 'missing'.For this modification, individual records are selected independently.The SA1 field is changed to an adjacent SA1 for 500 (1%) records, and the first five digits of the corresponding Meshblock code are altered appropriately.For 1,500 (3%) records, the MB is changed to another MB within the same SA1 region.BDAY is changed to 'missing' for 4,000 (8%) records.For 500 records (1%), the day and month corresponding to the numeric code are altered.

COB (Country of
In the BYEAR field, 50 records are replaced with 'BYEAR-2', 50 with 'BYEAR+2'.1200 records are reset to 'BYEAR-1' and 1200 to 'BYEAR+1'.For the SEX field, the value of 50 records (0.1%) is reversed.For 5,000 records (10%), the value of EYE field is set to 'missing'.For another 5,000 records (10%) a valid alternative is chosen as a replacement value.The COB field is set to 'missing' for 750 records (approximately 2%) of the records coded to "1101".COB is also set to 'missing' for 250 records (approximately 2%) with another country code.For 125 of these cases, records are replaced with 'Australia' and for the remaining 125 cases, records in COB are recoded to another country within the same broad geographical region (e.g. with the same two-digit SACC code) (Australian Bureau of Statistics, Internal report -'Simulating Probabilistic Record Linkage', Peter Rossiter, Analytical Services Branch, July 2014).

Blocking strategy
In the linking process, the number of possible record pairs to compare will depend on the size of the two files.For large data files, comparing and calculating weights of each record pairs can cause a significant performance bottleneck.Moreover, it is not computationally efficient and often not possible to undertake matching algorithms which search through entire large data files to find matches.To overcome these challenges, the files are split into blocks where the matches are most likely.Thus, blocking reduces the large number of comparisons by only comparing record pairs that have the same value for a blocking variable.In this paper, different blocking strategies are applied and the accuracy of linkages is observed.
The analysis used two different blocking variables, namely SA1, and SA1 & SEX.We combine two variables as a blocking variable.For every blocking variable, the number of records in each block in file X is different.Due to the introduced misclassification error described above, the values of the variable SA1 are changed in file X.Therefore, while blocking with SA1, we took the original value of SA1 to make sure all the true matches are within this block.Similarly while blocking with SA1 & SEX, we consider the original values of this combined variable.
In each case, the variables that are involved as a blocking variable are not be used for linking.
Following the specific blocking strategy, an agreement array,  is created from the two files to be linked for a single block.Block-specific ,    probabilities are calculated for each linking variable, following the procedure described in Section 2.7.

Simulation (create simulated values of 𝑨)
The initial agreement matrix  is simulated following the steps described in Section 2.3.The thinning value  is set as 1,000 and the number of desired replicates of , say  * , is  = 1,000.Hence, 1,000,000 MCMC simulations are run and  samples  () ,  = 1, … .,1000, are retained.In  * , we have 1000 instances of the agreement matrix .

Examine simple distance between 𝑨 * entries
The distances between  * entries ( * (2) ,  * (3) , ….,  * () ) from the initial agreement matrix  * (1) is calculated.In every simulation, the distance is calculated by the total number of agreement values that are changed from the initial values divided by the total number of agreement values.In this way we obtain the proportion of agreement values that are changing in each simulation.For SA1, the distance plot allows estimation of a "burn-in" period for the chain as well as the thinning parameter () to ensure that the retained simulated matrices are less correlated.From the distance graph on blocking variable 'SA1' (Fig. 1(i)), the chain appears to have converged after 50 iterations when approximately 11% of the values in the elements of  * are changed.The chain stays stable in 1000 simulations.In the case of the combined blocking variable SA1 & SEX, we see from the plot (Fig. 1 (ii)) that the chain converges after 180 iterations to around 0.24.Hence compared to the single blocking variable SA1, the chain for the combined variable took more iteration to settle in.
Table 2 shows the comparison of the percentage of agree, disagree, and missing values for blocking variable SA1 and SA1_SEX.From the table we noticed that the percentage of agree is higher in case of SA1 compared to SA1_SEX.Since in the simulation algorithm, the changes of agreement/disagreement values in the next state depends on the agreement/disagreement values of the current state; thus, for these two

Proportion of times each record in File X is correctly re-linked
Based on the agreement values from  * , in every simulation we link records following the same linking process described earlier (Section 2.8) and observe how many times each record has been re-linked to the record to which it was originally linked.We perform this analysis on the first block when blocking with SA1 and also with combined variable SA1 & SEX.When we block the data with SA1, the first block contains 59 records in File X. Figure 4 (i) shows the proportion of correct links of each X record for this block in 1000 simulations.From this plot, we see that the correct re-link proportion for all 59 records lies between 93.5% and 100%.The plot also shows the average accuracy with the red line, which is 99%.We have a very low error rate for each record.The maximum error we obtained was 6.5% for record number 44.
With the combined variable SA1 & SEX (Fig. 4 (ii)), there are 26 records in the first block in file X. Figure 4 (ii) shows the correct re-link proportion of each X record in 1000 simulations.Here we obtained an accuracy in excess of 98%.The average accuracy is 99.8% which is shown by the green line.The maximum error is only 1.2%, for record number 8. In this analysis, we estimate the accuracy in every simulation for all records in File X for the first block when blocking with variable SA1 and also with the combined blocking variable SA1 & SEX.The plot (Fig. 5 (i)) shows the correct re-link proportion of all 59 records in each of 1000 simulations.We obtained 100% accuracy in most of the simulations.For some simulations 98.3% accuracy is obtained where 58 records (out of 59) are correctly linked to the original records and one record is incorrectly re-linked.

Correct re-link proportion in every simulation
The smallest accuracy, 93.2% (=55/59), is found in only three simulations where 4 records are incorrectly linked.Note that the average accuracy (indicated with the red records in every simulation is 99.8%, which is exactly the same as the average accuracy for each record in all simulations (Fig. 4 (ii)), as expected.

Conclusion
With ever expanding overlapping datasets, both administrative and substantive, the need to accurately assess the linkage of these databases is crucial.This research has focussed on providing methods to estimate the confidence of individual links in these integrated databases.This will prove extremely important in applying analysis techniques which can adequately account for the errors associated with linkage.
It is also important for assessing which, if any, linking method is likely to be more accurate for a linkage task.
The proposed assessment approach (MaCSim) will perform as a tool for assessing a linking method.With this tool, when we apply a linking method to relink records in each simulation and estimate the accuracy of the link by the correct relink proportions, we essentially assess that linking method.In our approach, the accuracy is determined for a number of simulated datasets, and therefore will better represent uncertainty than an estimate from just one dataset.
We have shown two initial results from simulated output on a synthetic dataset received from Australian Bureau of Statistics.The results indicate high accuracy in finding the correct relink proportion for matched record using our developed simulation approach.
Future work is to continue investigating optimal choices of block sizes and cut-off values.We will also investigate the approach to assess the effect of missing information and conditional independence assumptions on linkage accuracy and enhance the Markov chain methodology to account for the case of conditional dependence.Moreover, in this work we have used two datasets; how the approach will work on more than two datasets is yet to be investigated.
; Bakker and Daas 2012) The Australian Longitudinal Census Database (ACLD) is created by linking the 2006 and 2011 Australian Population Censuses.For the analysis of how characteristics of cohorts change over time, the Australian Bureau of Statistics performed probabilistic linkage of person records in its 2006 and 2011 Census of Population and Housing (Zhang and Campbell 2012).Wilkins et al. (2009) used a linked data set obtained by merging data collected in the Canadian Community Health Survey and data held in Statistics Canada's Hospital Person-Oriented Information database in order to model the relationship between an individual's probability of hospitalization and length of time spent subsequently in hospital and his/her smoking status.Determination of the 2005 prevalence rates of chronic diseases for the remote indigenous population of the Northern Territory of Australia, Zhao et al. (2008) required linkage of a primary care chronic disease register with hospital inpatient databases.In all of these applications, different data sets related to the same individuals at different points in time are linked, thus allowing longitudinal data analysis.
= (  );  = 1, …   ,  = 1, …   ,  = 1, … , , is a three-dimensional array denoting the agreement pattern of all linking fields across all records in the two files.Here,   = 1 if the ℎ linking field value for record  of file  and record  of file , are the same;   = −1 if these values are not the same and   = 0 if either or both the values are missing.
() = {  = 1|  = }, for any  ≠  where   = ℎ( 1 , …  (−1) ,  (+1) , … ,   ) = ℎ( ,− ) is a summary statistic depending on the agreement values associated with all the other linkage variables.The simplest form of the ℎ-function would just be the sum of the other agreement values, so that we would assume the probability of agreement on the th linking variable for two non-matched record pairs depends only on how many other agreements there are among the other linking variable values.Alternatively, the ℎfunction could simply be the identity function, implying potentially different  ()   values for each of the 2 −1 distinct possible arrangments of the remaining agreement indicators.Of course, the more complex the ℎ-function, the more distinct   () values will need to be estimated.Note that if the conditional independence assumption actually holds, then   () =   for any value of .
Area 1) a hypothetical two-level geographical location system, Statistical Area 1 (SA1).Each SA1 contains exactly 400 records.The values are 5 digit code numbered from 10001 to 11000.MB (Meshblock) Every SA1 consists of exactly 5 Meshblocks or MB.Each Meshblock contains 80 records of file Y and 10 records in file X.The values are 7 digit code ranges from 1000101 to 1100009.BDAY (Birth Day) 20,000 consecutive days from 1 January 1955 to 3 October 2009.
Birth) 75% of the total records are assigned a value '1101' for 'Born in Australia'.The remaining 25% records are randomly assigned one of about 300 country codes according to the corresponding proportion of people in the 2006 Census.

Fig. 1 Figure 1 :
Fig.1shows the distances in 1000 simulations using two blocking strategies: (i) number of values changes in each simulation is expected to be different.This is why, the convergence occur in two different points for these two blocking variables.

Figure 2 :Figure 3
Figure 2: Total number of agreement values,

Figure 4 :
Figure 4: Correct re-link proportion of each X record

Figure 5 :
Figure 5: Correct re-link proportion in every simulation 5 (i)) for all records in every simulation is 99%, which is exactly the same as the average accuracy for each record in all simulations (Fig.4(i)), as it is expected.With the combined blocking variable SA1 & SEX Figure 5 (ii) shows the correct re-link proportion of all 26 records in each of 1000 simulations.We obtained 100% accuracy in most of the simulations.For some simulations 96.1% accuracy is obtained where 25 records (out of 26) are correctly linked to its original records.The smallest accuracy, 92.3% (=24/26), is found in only one simulation where 2 records are incorrectly linked.Note that the average accuracy (indicated with the green line in Fig 5 (ii)) for all

Total number of agreement values changes in 𝐀 * in each simulation (i) Total number of agree (1) in 𝐀
*(ii) Total number of disagree (-1) in  *