Data Quality: A basic framework for matching or linking records

Let A be a file containing n records. The inner product A X A i.e. the set of all possible pairs, is ideally composed of three subsets M_P,S_Pand U_P.
M_P is the set of pairs of matching records, S_Pis the set of pairs of suspected matching records and U_P is the set non-matching pairs.

Our aim is to find a match rule (L) such that any pair of records in A X A falls in one of the three following sets:
M = set of definite matching pairs.
S = set of suspected matching pairs.
U = set of non-matching pairs.

Any record in A is composed of several fields. Ideally, we build several match keys involving these fields. For example, let us consider the fields on the records to be:

Given Name, Middle Name, Surname, Name Suffix, Street Number, Street Type, Apartment Number, Floor, Post Code, Locality, City, State, Country, Telephone Number, Mobile Number, SSN.
A match key may involve Given Name, Surname, Street Number, Street Type, and Apartment Number.
Another match key may involve Given Name, Surname, Telephone Number, Post Code and City.

Suppose there are k match keys defined in the system.

A pair (p) is compared at every defined match key i.e. for any pair, comparison is done for each match key and consequently a number λ_i (p) (or just λ_i) is returned to indicate the comparison result for the
ith. Match key.
Here, 0 ≤ λ_i ≤ 1 for I = 1(1)k

Let {λ₁, λ₂, …, λ_k} be the comparison vector.

Let us define match probability to be the maximum value of these comparison results.
If the match probability is λ (or λ(p))then λ = MAX{λ₁, λ₂… λ_k} [1]
[1] ensures that in order to be a match, a pair must be matching at least in one match key.

Let us denote the match probability corresponding to the pair p by λ(p).

Alternatively some other single-valued function in the range of 0 to 1 (could be a weighted average) can be considered instead of MAX in [1].

We also define two positive numbers 0 < c < l < 1 such that, if λ(p)> l then we conclude that p ε M, if λ(p) < c then p ε U and if c < λ(p) < l then p ε S

The above rule put all the possible members of A X A (there are ⁿC₂such pairs in A X A) into three subsets M, U and S.

Data Quality

Tuesday, May 24, 2011

A basic framework for matching or linking records

No comments:

Post a Comment