Data Quality: Rome was not built in a day

Basic framework as discussed earlier, talks about match keys. It also says that for each pair of records, comparison at each match key returns a result λ_i where 0 ≤ λ_i ≤ 1.
λ_i’s are called match probability if λ_i’s can assume any real number in the unit interval or it is called a match indicator if it can have only two values, 0 and 1.

λi’s play the pivotal role in determining if a pair (of records) should be put in M, U or S.

Earlier, I said the each match key involves several fields. The idea of a match key is a pair of records should be matching if and only if the underlying records closely resemble each other at the fields which constitute the match key.
Let me explain this by an example:

#	Given Name	Middle Name	Last Name	St. No.	St. Name	St. Type	Apt	City	ZIP
1	Johnn	P	Morkel	25	Main	Street	Apt 225	Kansas	11111
2	John		Morkel	25	M		Ste 225	Kansas	11111
3	Jason	Peter	Morkel	25	M	Blvd	A 225		11111

A closer look tells us that first two records are probably the same i.e. they represent the same individual.
But it is highly likely the third record belongs to a different individual.

The logic behind automatic matching should closely follow our thinking process when we say that the first two records are probably the same.

Now let us look at the values on these two records. Given Names are close…may not be an exact match…but very close. Middle Names are not contradicting each other. Last Names are exactly the same. Street numbers are the same. On the street name, well… the initial characters are matching and there is no contradiction. Numeric digits are the same on Apartment Information while Cities as well as
ZIP codes are the same.

If we have a match key comprising of the Last Name, St. Number, Numeric portion of Apartment Information and ZIP Code then the first two records will agree on this key.

But the third record will also agree with the first two records on this match key. That is simply because; we have overlooked the Given Name.
To address this issue, we include Given Name in the match key definition. Unfortunately on the first two records, Given Names are similar but not exactly the same. So, instead of the Given Name value, our match code needs to include a transformed value of Given Name… a transformation so that JOHNN becomes JOHN but JASON does not become JOHN.

Now let us look at the two records in the following table:

#	Given Name	Middle Name	Last Name	St. No.	St. Name	St. Type	Apt	City	ZIP
1	John	Peter	Morkel	25	Main	Street	Apt 225	Kansas	11111
2	John	Proctor	Morkel	25	Main	Street	Ste 225	Kansas	11111

This pair of records has a good amount of similarity. Probably these represent the same house-hold. But unfortunately they represent probably different individuals. Since their Middle Names are contradicting.
Probably, we need to modify our match key so that the entire Middle Names are compared when both of the Middle Names have length more than 1, only an initial match on this field is performed when on at least one record, the length of Middle Name is 1 and a blank should be allowed to match a non-blank Middle Name.

So, the above match technique for the middle name is not a transform in the sense that it does not change the underlying values of the Middle Name.

So match keys are combination of a few transformed field values with associated match technique(s).

We defined one match key above. However, we need multiple match keys.
To understand this, let us look at the following records:

#	Given Name	Middle Name	Last Name	St. No.	St. Name	St. Type	Apt	Cell No.	City	ZIP
1	John	Peter	Morkel	25	Main	Street	Apt 225	1234567890	Kansas	11111
2	John	P	Morkel	25	Main	Street	Ste 225	1212121212	Kansas	11111
3	John	Peter	Morkel	1750	Collins	Blvd	102	1212121212	Richardson	75068

A close look at the above records will reveal that the first two records match as per the match key we discussed earlier but none of these will match to the third record. Address information on the third record is totally different. But the cell number on the third record matches to the cell number on the second record. It looks like the same person, at different point in time was in a different location but for at least sometime maintained the same cell number.
To capture this match, we have to use a different match key involving Given Name, Middle Name, Last Name and Cell Number.

In a real life scenario, we will have to deal with many more address fields as well as other fields like SSN, Tax Id etc. So we will have to have many match keys defined in the system.
Probably, a field (or more than one field) is common between two match keys. It may so happen that either the associated match technique or the transformation is different in these cases.

Again, match-keys can be defined in two ways. Suppose transformed values of n fields are used to define a match-key. We may just concatenate the values to obtain the match key. Such a key is called a hard match-key. Alternatively, we can define the key as ordered set of n string values. Such a key is called soft match-key. Though many matching engines are built using hard match-keys (sometimes these are called match codes), there are tools which are built using both soft-key and hard-key. Soft-key has an added advantage that it is flexible enough but comparing each component value in a soft-key takes longer matching time for the entire key. That is why I prefer using two sets of keys. Records are passed through the hard match-key first to select a possible matching pairs and then these pairs are evaluated once more using soft match-key. This is called 2-step matching.

We will come back to key based matching and discuss match probability and match indicator that I touched upon at the beginning of this post. But before that we need to examine one more property of a match key and the transformations that were discussed in this post.

Data Quality

Thursday, May 26, 2011

Rome was not built in a day – Key based matching

No comments:

Post a Comment