Friday, June 3, 2011

Indirect Matching

Consider the following records:
#
Given Name
Middle Name
Last Name
St. No.
St. Name
St. Type
Apt
Cell No.
City
ZIP
1
John
Peter
Morkel
25
Main
Street
Apt 225
1234567890
Kansas
11111
2
Jon
P
Morkel
25
Main
Street
Ste 225
1212121212
Kansas
11111
3
J
P
Morkel
1750
Collins
Blvd
102
1212121212
Richardson
75068

In this case, the key-based matching we discussed earlier will declare the first two records to be a match and the last two records a match. But ideally, we want all the three records to be considered a match and they should form one cluster/group of matched records.
This can only be done by performing an indirect match according to the rule:
For any three records A, B and C; if A matches B and B matches C then A indirectly matches C.

If n fields are being used for linking records then we can consider a record to be a point in the n-dimensional space and also visualize and define a distance between two such points.
Actually, our key-based matching will consider two records to be a match provided they are close enough i.e. the distance between the records is not bigger than a predefined number.

A distance function can easily be defined for two records using the highest comparison probability returned by the match keys.
Suppose the highest comparison probability for the two records A and B be λAB. We can define the distance function D (A, B) = 1 – λAB to measure the distance between A and B.

Now, A and B will match only if D (A, B) < δ where δ ε [0, 1] is a pre-defined number.

In our example in this section, distance between the first two records and the distance between the last two records are less than the pre-defined number δ. But the distance between the first and third records is more than δ

In Data Quality in general and in record linking especially, though mathematics plays the central role, it never is the ultimate decision maker. We will see this in the example below:
#
Given Name
Middle Name
Last Name
St. No.
St. Name
St. Type
Apt
Cell No.
City
ZIP
1
John
Peter
Morkel
25
Main
Street
Apt 225
1234567890
Kansas
11111
2
J

Morkel
25
Main
Street
Ste 225
1212121212
Kansas
11111
3
Jessie

Morkel
25
Main
Street
Ste 225
1212121212
Kansas
11111

As per the rule of indirect matching, all the three records will be put under the same cluster and will be assigned the same master identifier.
But we have an issue here. Clearly, the first and the third records are not matching. Probably each of these represents the same house-hold.
So what’s the issue here? Obviously, either the first two records are not matching in reality or the last two records are not matching in reality. But unfortunately these two matches were concluded using the same logic. In fact, when we review the records manually, it is not possible to decide if the match between first two records is correct or the match between the last two records is correct.
In reality, we look for other pieces of information which could be DOB, TAX Id, SSN or any other identifier. If nothing works then we just contact the customers and find out.
Automatic matching cannot resolve situations where even manual review fails.

No comments:

Post a Comment