Tuesday, May 24, 2011

A basic framework for matching or linking records

Let A be a file containing n records. The inner product A X A i.e. the set of all possible pairs, is ideally composed of three subsets MP, SP and UP.
MP is the set of pairs of matching records, SP  is the set of pairs of suspected matching records  and UP is the set non-matching pairs.

Our aim is to find a match rule (L) such that any pair of records in A X A falls in one of the three following sets:
M = set of definite matching pairs.
S = set of suspected matching pairs.
U = set of non-matching pairs.
Any record in A is composed of several fields. Ideally, we build several match keys involving these fields. For example, let us consider the fields on the records to be:
Given Name, Middle Name, Surname, Name Suffix, Street Number, Street Type, Apartment Number, Floor, Post Code, Locality, City, State, Country, Telephone Number, Mobile Number, SSN.
A match key may involve Given Name, Surname, Street Number, Street Type, and Apartment Number.
Another match key may involve Given Name, Surname, Telephone Number, Post Code and City.

Suppose there are k match keys defined in the system.

A pair (p) is compared at every defined match key i.e. for any pair, comparison is done for each match key and consequently a number λi (p) (or just λi) is returned to indicate the comparison result for the
ith. Match key.
 Here, 0 ≤ λi ≤ 1 for I = 1(1)k
Let {λ1, λ2, …, λk} be the comparison vector.

Let us define match probability to be the maximum value of these comparison results.
If the match probability is λ (or λ(p))then λ = MAX{λ1, λ2… λk}             [1]
[1] ensures that in order to be a match, a pair must be matching at least in one match key.
Let us denote the match probability corresponding to the pair p by λ(p).

Alternatively some other single-valued function in the range of 0 to 1 (could be a weighted average) can be considered instead of MAX in [1].

We also define two positive numbers 0 < c < l  < 1 such that, if λ(p)> l then we conclude that p ε M, if λ(p) < c then p ε U and if c < λ(p) < l then p ε S

The above rule put all the possible members of A X A (there are nC2 such pairs in A X A)  into three subsets M, U and S.

No comments:

Post a Comment