Data Quality: Indirect Matching

Consider the following records:

#	Given Name	Middle Name	Last Name	St. No.	St. Name	St. Type	Apt	Cell No.	City	ZIP
1	John	Peter	Morkel	25	Main	Street	Apt 225	1234567890	Kansas	11111
2	Jon	P	Morkel	25	Main	Street	Ste 225	1212121212	Kansas	11111
3	J	P	Morkel	1750	Collins	Blvd	102	1212121212	Richardson	75068

In this case, the key-based matching we discussed earlier will declare the first two records to be a match and the last two records a match. But ideally, we want all the three records to be considered a match and they should form one cluster/group of matched records.

This can only be done by performing an indirect match according to the rule:
For any three records A, B and C; if A matches B and B matches C then A indirectly matches C.

If n fields are being used for linking records then we can consider a record to be a point in the n-dimensional space and also visualize and define a distance between two such points.

Actually, our key-based matching will consider two records to be a match provided they are close enough i.e. the distance between the records is not bigger than a predefined number.

A distance function can easily be defined for two records using the highest comparison probability returned by the match keys.
Suppose the highest comparison probability for the two records A and B be λ_AB. We can define the distance function D (A, B) = 1 – λ_AB to measure the distance between A and B.

Now, A and B will match only if D (A, B) < δ where δ ε [0, 1] is a pre-defined number.

In our example in this section, distance between the first two records and the distance between the last two records are less than the pre-defined number δ. But the distance between the first and third records is more than δ

In Data Quality in general and in record linking especially, though mathematics plays the central role, it never is the ultimate decision maker. We will see this in the example below:

#	Given Name	Middle Name	Last Name	St. No.	St. Name	St. Type	Apt	Cell No.	City	ZIP
1	John	Peter	Morkel	25	Main	Street	Apt 225	1234567890	Kansas	11111
2	J		Morkel	25	Main	Street	Ste 225	1212121212	Kansas	11111
3	Jessie		Morkel	25	Main	Street	Ste 225	1212121212	Kansas	11111

As per the rule of indirect matching, all the three records will be put under the same cluster and will be assigned the same master identifier.
But we have an issue here. Clearly, the first and the third records are not matching. Probably each of these represents the same house-hold.
So what’s the issue here? Obviously, either the first two records are not matching in reality or the last two records are not matching in reality. But unfortunately these two matches were concluded using the same logic. In fact, when we review the records manually, it is not possible to decide if the match between first two records is correct or the match between the last two records is correct.
In reality, we look for other pieces of information which could be DOB, TAX Id, SSN or any other identifier. If nothing works then we just contact the customers and find out.
Automatic matching cannot resolve situations where even manual review fails.

Data Quality

Friday, June 3, 2011

Indirect Matching

No comments:

Post a Comment