Thursday, May 26, 2011

Rome was not built in a day – Key based matching

Basic framework as discussed earlier, talks about match keys. It also says that for each pair of records, comparison at each match key returns a result λi where 0 ≤ λi ≤ 1.
λi’s are called match probability if λi’s can assume any real number in the unit interval or it is called a match indicator if it can have only two values, 0 and 1.

λi’s play the pivotal role in determining if a pair (of records) should be put in M, U or S.

Earlier, I said the each match key involves several fields. The idea of a match key is a pair of records should be matching if and only if the underlying records closely resemble each other at the fields which constitute the match key.
Let me explain this by an example:

#
Given Name
Middle Name
Last Name
St. No.
St. Name
St. Type
Apt
City
ZIP
1
Johnn
P
Morkel
25
Main
Street
Apt 225
Kansas
11111
2
John

Morkel
25
M

Ste 225
Kansas
11111
3
Jason
Peter
Morkel
25
M
Blvd
A 225

11111

A closer look tells us that first two records are probably the same i.e. they represent the same individual.
But it is highly likely the third record belongs to a different individual.
The logic behind automatic matching should closely follow our thinking process when we say that the first two records are probably the same.

Now let us look at the values on these two records. Given Names are close…may not be an exact match…but very close. Middle Names are not contradicting each other. Last Names are exactly the same. Street numbers are the same. On the street name, well… the initial characters are matching and there is no contradiction. Numeric digits are the same on Apartment Information while Cities as well as
ZIP codes are the same.

If we have a match key comprising of the Last Name, St. Number, Numeric portion of Apartment Information and ZIP Code then the first two records will agree on this key.

But the third record will also agree with the first two records on this match key. That is simply because; we have overlooked the Given Name.
To address this issue, we include Given Name in the match key definition.  Unfortunately on the first two records, Given Names are similar but not exactly the same. So, instead of the Given Name value, our match code needs to include a transformed value of Given Name… a transformation so that JOHNN becomes JOHN but JASON does not become JOHN.

Now let us look at the two records in the following table:
#
Given Name
Middle Name
Last Name
St. No.
St. Name
St. Type
Apt
City
ZIP
1
John
Peter
Morkel
25
Main
Street
Apt 225
Kansas
11111
2
John
Proctor
Morkel
25
Main
Street
Ste 225
Kansas
11111

This pair of records has a good amount of similarity. Probably these represent the same house-hold. But unfortunately they represent probably different individuals. Since their Middle Names are contradicting.
Probably, we need to modify our match key so that the entire Middle Names are compared when both of the Middle Names have length more than 1, only an initial match on this field is performed when on at least one record, the length of Middle Name is 1 and a blank should be allowed to match a non-blank Middle Name.

So, the above match technique for the middle name is not a transform in the sense that it does not change the underlying values of the Middle Name.

So match keys are combination of a few transformed field values with associated match technique(s).

We defined one match key above. However, we need multiple match keys.
To understand this, let us look at the following records:
#
Given Name
Middle Name
Last Name
St. No.
St. Name
St. Type
Apt
Cell No.
City
ZIP
1
John
Peter
Morkel
25
Main
Street
Apt 225
1234567890
Kansas
11111
2
John
P
Morkel
25
Main
Street
Ste 225
1212121212
Kansas
11111
3
John
Peter
Morkel
1750
Collins
Blvd
102
1212121212
Richardson
75068

A close look at the above records will reveal that the first two records match as per the match key we discussed earlier but none of these will match to the third record. Address information on the third record is totally different. But the cell number on the third record matches to the cell number on the second record. It looks like the same person, at different point in time was in a different location but for at least sometime maintained the same cell number.
To capture this match, we have to use a different match key involving Given Name, Middle Name, Last Name and Cell Number.
In a real life scenario, we will have to deal with many more address fields as well as other fields like SSN, Tax Id etc. So we will have to have many match keys defined in the system.
Probably, a field (or more than one field) is common between two match keys. It may so happen that either the associated match technique or the transformation is different in these cases.


Again, match-keys can be defined in two ways. Suppose transformed values of n fields are used to define a match-key. We may just concatenate the values to obtain the match key. Such a key is called a hard match-key. Alternatively, we can define the key as ordered set of n string values. Such a key is called soft match-key. Though many matching engines are built using hard match-keys (sometimes these are called match codes), there are tools which are built using both soft-key and hard-key. Soft-key has an added advantage that it is flexible enough but comparing each component value in a soft-key takes longer matching time for the entire key. That is why I prefer using two sets of keys. Records are passed through the hard match-key first to select a possible matching pairs and then these pairs are evaluated once more using soft match-key. This is called 2-step matching.

We will come back to key based matching and discuss match probability and match indicator that I touched upon at the beginning of this post.  But before that we need to examine one more property of a match key and the transformations that were discussed in this post.

No comments:

Post a Comment