Monday, May 30, 2011

Building Blocks – String Matching

We discussed earlier how a match key is formed using the transformed values of several fields and the associated match techniques. Basically we need to compare two strings.
Before discussing this, it will only be fair to state here that the transformation mentioned earlier brings to apparently distant strings closer.
One such transformation could be standardization which can bring to strings CALCUTTA and KOLKATA together.

Two strings may be compared for an exact match.
Many match engines are based on such exact matches. Let us consider the following strings:
Base String
Input String
INNOCENT
INNOCEMT
EXPRESSION
EPXRESSION
PRICEWATERHOUSECOOPERS
PRICEWATERHOUSECOOPER


In each row, the two strings are close enough to conclude them to be matching but none of these pairs is an exact match. This kind of situation arises largely from typographical errors. And consequently, any match engine that uses exact match on the match keys will fail to match the corresponding keys.
A distance function may address such issues. It is a function that takes into account two strings and returns a number that represents the distance between the input strings. A distance function needs to have the following properties:


1. Distance between two exactly same strings is 0
2. Distance between two input strings is non-negative
3. Distance between two strings increases (at least non-decreasing) when the similarity between the strings decreases.
Let us define one such distance functions.
Suppose s1 and s2 are two strings each of which are of length l1 and l2. Also call the distance function between s1 and s2 by d (s1, s2). Then the above three rules can be expressed as:
  1. d (s1, s1) = 0
  2. d (s1, s1) ≥ 0
  3. d (s1, s1) ≥ d (s1, s3) when s1 is more similar to s3 than s2
Let us define one such function here.
Let m be the number of position-wise matching characters in s1 and s2
Case1: l1 = l2
Here we can consider d (s1, s2) = 1 – m/l
Note that the maximum possible value for m is l and that happens when s1 and s2 are exactly same. In such a case the distance becomes 0.
Case2: l1 > l2
Here we consider d (s1, s2) = 1 – m / [l2 (l1 – l2 + δ)] where δ > 0 a constant.
Note that the maximum possible value for m is l2 and that happens when s2 is a sub-string of s1.
While fixing a value for δ, it has to be kept in mind if the length of s1 is one more than the length of s2 and s2 is a sub-string of s1 then d (s1, s2) = 1-1/(1 + δ)
If we want two strings as above (lengths differ by one and the smaller one is a sub-string of the other) to differ by 5 unit then 1-1/(1 + δ) = 0.05 => δ = 0.05 (appx.). In fact, we see that in order to have such strings closer, we need to set δ as a very small positive number.

The distance function defined above lies between 0 and 1 and a match probability can be defined using this distance function. For example, the match probability can be p (s1, s2) = 1 - d (s1, s2)
However, the above distance function is only an indicative one and can be further improved.

Friday, May 27, 2011

Restricting False Positives – Cut-Off point for each match-key


While discussing the basic framework, we saw the two cut-off points m and M between 0 and 1 which are applied on the match probability λ(p) in a way that if λ(p)> M then we conclude that p ε M, if λ(p) < m then p ε U and if m < λ(p) < M then p ε S

Here the pre-defined points m and M are called cut-off points.
Cardinality of the set of matching pairs reduces when the cut-off point M increases. So, one way of reducing the number of false positive matches seems to be increasing the value of cut-off point M.
But it could result in some genuine matches to land up in the set of suspected matches and thereby the total cost of error gets increased.
This is a typical issue encountered while implementing data quality solutions.
One way to address this issue is to introduce cut-off points for the individual match-keys.

While discussing match-key, we talked about the probability/indicator returned by each match-key during a comparison involving a pair p. These probabilities/indicators are denoted as λi where the suffix i runs from 1 to k in a system with k match-keys.

Besides having a cut-off point M for the composite probability/indicator, we can define cut-offs for each match-key and call those Mi such that if λi ≥ Mi for all i = 1 to k then only the composite probability/indicator for the underlying pair is calculated otherwise it is set to 0.

By introducing the Mi’s we will be able to restrict false positive matches.

Enhancing Match-Keys - Cross-Matching

Consider the following names:
Sl. #
Given Name
Middle Name
Last Name Prefix
Last Name
1
GABRIEL
MARQUIZ
DE
PEREIRA
2
MARQUIS
GEBRIEL
DE
PERERA
Suppose the address information is matching on these two records. A manual inspection tells us that the records are indeed matching.
How to establish the match in names?
Well, the Last Names do match (well…close enough) and we do not compare the Prefix. But what about the Given Name and Middle Name? Apparently, they have switched places besides initial being used on the second record.
Only way to find the match in Given Name and Middle Name is to perform a cross-matching between these fields. That is, the match keys should be defined in a way that Given Names are also compared with the corresponding Middle Names and vice-versa.

This is a good time to introduce the reader to a matching objective i.e. House-Holding.
Here we try to cluster (or link) the records belonging to the same house-hold. Well…in many cases, we define a house-hold to be comprised of members sharing the same Last Name, Address and the residential Phone Number.
Now, let us see the following two names and consider these for house-holding:
Sl. #
Given Name
Middle Name
Last Name
1
JONATHAN
A
ABOTT
2
ABOTT

MARGARET
Suppose the address information is matching on these two records. A manual inspection tells us that the records belong to the same house-hold. How is the Last Names matching? Well…Last Name on the first record matches to the Given Name on the second record. Again name components have switched places. But there is a little difference in the way we do cross-matching from the earlier example.
In the earlier example, we checked if the Given Name on one record matches to the Middle Name on the other record and the Middle Name on the second record matches to the Given Name on the first record. But for the house-hold matching example, we check if the Last Name on one record matches to the Given Name on the other record.

In DQ terminology these are known to be 2-way cross-matching and 1-way cross-matching.
Let us look at one more example where cross-matching has occurred involving three name fields:
Sl. #
Given Name
Middle Name
Last Name
1
RAJ
SINGH
THAKOR
2
THAKORE
RAJ
SINGH

Though theoretically such component switching can take place between any two (or more) fields in the database, we usually keep the cross-matching involving a few pair of fields.


Cross matching can be implemented when we use soft match-keys as discussed earlier. But there are many data quality tools out there which are built using hard match-keys alone.
So what do we do if one such tool is being used?
There is an alternative route that takes care of the cross matching in a limited way. Suppose we have two similar fields Field1 and Field2 which we want to cross match. In this case, first we make sure that there is no record with non blank Field2 and blank Field1. Then we make two copies of the records where both these fields are non blank. One such set of records is allowed to pass through as it is but for the records in the second set, we copy Field2 and overwrite Field1 with the values in Field2. Finally we append these records together to make a bigger set of records. Obviously, we put some identifier for each record to identify if they came from set1 or set2. Hard-key based match can be performed on this bigger set of records to arrive at the exactly same conclusions.
However, there is one limitation in this approach. For each such pair of fields we have to add new records in the database and this takes time/effort besides disc space.


Thursday, May 26, 2011

Rome was not built in a day – Key based matching

Basic framework as discussed earlier, talks about match keys. It also says that for each pair of records, comparison at each match key returns a result λi where 0 ≤ λi ≤ 1.
λi’s are called match probability if λi’s can assume any real number in the unit interval or it is called a match indicator if it can have only two values, 0 and 1.

λi’s play the pivotal role in determining if a pair (of records) should be put in M, U or S.

Earlier, I said the each match key involves several fields. The idea of a match key is a pair of records should be matching if and only if the underlying records closely resemble each other at the fields which constitute the match key.
Let me explain this by an example:

#
Given Name
Middle Name
Last Name
St. No.
St. Name
St. Type
Apt
City
ZIP
1
Johnn
P
Morkel
25
Main
Street
Apt 225
Kansas
11111
2
John

Morkel
25
M

Ste 225
Kansas
11111
3
Jason
Peter
Morkel
25
M
Blvd
A 225

11111

A closer look tells us that first two records are probably the same i.e. they represent the same individual.
But it is highly likely the third record belongs to a different individual.
The logic behind automatic matching should closely follow our thinking process when we say that the first two records are probably the same.

Now let us look at the values on these two records. Given Names are close…may not be an exact match…but very close. Middle Names are not contradicting each other. Last Names are exactly the same. Street numbers are the same. On the street name, well… the initial characters are matching and there is no contradiction. Numeric digits are the same on Apartment Information while Cities as well as
ZIP codes are the same.

If we have a match key comprising of the Last Name, St. Number, Numeric portion of Apartment Information and ZIP Code then the first two records will agree on this key.

But the third record will also agree with the first two records on this match key. That is simply because; we have overlooked the Given Name.
To address this issue, we include Given Name in the match key definition.  Unfortunately on the first two records, Given Names are similar but not exactly the same. So, instead of the Given Name value, our match code needs to include a transformed value of Given Name… a transformation so that JOHNN becomes JOHN but JASON does not become JOHN.

Now let us look at the two records in the following table:
#
Given Name
Middle Name
Last Name
St. No.
St. Name
St. Type
Apt
City
ZIP
1
John
Peter
Morkel
25
Main
Street
Apt 225
Kansas
11111
2
John
Proctor
Morkel
25
Main
Street
Ste 225
Kansas
11111

This pair of records has a good amount of similarity. Probably these represent the same house-hold. But unfortunately they represent probably different individuals. Since their Middle Names are contradicting.
Probably, we need to modify our match key so that the entire Middle Names are compared when both of the Middle Names have length more than 1, only an initial match on this field is performed when on at least one record, the length of Middle Name is 1 and a blank should be allowed to match a non-blank Middle Name.

So, the above match technique for the middle name is not a transform in the sense that it does not change the underlying values of the Middle Name.

So match keys are combination of a few transformed field values with associated match technique(s).

We defined one match key above. However, we need multiple match keys.
To understand this, let us look at the following records:
#
Given Name
Middle Name
Last Name
St. No.
St. Name
St. Type
Apt
Cell No.
City
ZIP
1
John
Peter
Morkel
25
Main
Street
Apt 225
1234567890
Kansas
11111
2
John
P
Morkel
25
Main
Street
Ste 225
1212121212
Kansas
11111
3
John
Peter
Morkel
1750
Collins
Blvd
102
1212121212
Richardson
75068

A close look at the above records will reveal that the first two records match as per the match key we discussed earlier but none of these will match to the third record. Address information on the third record is totally different. But the cell number on the third record matches to the cell number on the second record. It looks like the same person, at different point in time was in a different location but for at least sometime maintained the same cell number.
To capture this match, we have to use a different match key involving Given Name, Middle Name, Last Name and Cell Number.
In a real life scenario, we will have to deal with many more address fields as well as other fields like SSN, Tax Id etc. So we will have to have many match keys defined in the system.
Probably, a field (or more than one field) is common between two match keys. It may so happen that either the associated match technique or the transformation is different in these cases.


Again, match-keys can be defined in two ways. Suppose transformed values of n fields are used to define a match-key. We may just concatenate the values to obtain the match key. Such a key is called a hard match-key. Alternatively, we can define the key as ordered set of n string values. Such a key is called soft match-key. Though many matching engines are built using hard match-keys (sometimes these are called match codes), there are tools which are built using both soft-key and hard-key. Soft-key has an added advantage that it is flexible enough but comparing each component value in a soft-key takes longer matching time for the entire key. That is why I prefer using two sets of keys. Records are passed through the hard match-key first to select a possible matching pairs and then these pairs are evaluated once more using soft match-key. This is called 2-step matching.

We will come back to key based matching and discuss match probability and match indicator that I touched upon at the beginning of this post.  But before that we need to examine one more property of a match key and the transformations that were discussed in this post.