Data Quality: May 2011

Monday, May 30, 2011

Building Blocks – String Matching

We discussed earlier how a match key is formed using the transformed values of several fields and the associated match techniques. Basically we need to compare two strings.
Before discussing this, it will only be fair to state here that the transformation mentioned earlier brings to apparently distant strings closer.

One such transformation could be standardization which can bring to strings CALCUTTA and KOLKATA together.

Two strings may be compared for an exact match.
Many match engines are based on such exact matches. Let us consider the following strings:

Base String	Input String
INNOCENT	INNOCEMT
EXPRESSION	EPXRESSION
PRICEWATERHOUSECOOPERS	PRICEWATERHOUSECOOPER

In each row, the two strings are close enough to conclude them to be matching but none of these pairs is an exact match. This kind of situation arises largely from typographical errors. And consequently, any match engine that uses exact match on the match keys will fail to match the corresponding keys.

A distance function may address such issues. It is a function that takes into account two strings and returns a number that represents the distance between the input strings. A distance function needs to have the following properties:

1. Distance between two exactly same strings is 0
2. Distance between two input strings is non-negative
3. Distance between two strings increases (at least non-decreasing) when the similarity between the strings decreases.

Let us define one such distance functions.

Suppose s₁ and s₂ are two strings each of which are of length l₁ and l₂. Also call the distance function between s₁ and s₂ by d (s₁, s₂). Then the above three rules can be expressed as:

d (s₁, s₁) = 0
d (s₁, s₁) ≥ 0
d (s₁, s₁) ≥ d (s₁, s₃) when s₁ is more similar to s₃ than s₂

Let us define one such function here.
Let m be the number of position-wise matching characters in s₁ and s₂

Case1: l₁= l₂

Here we can consider d (s₁, s₂) = 1 – m/l

Note that the maximum possible value for m is l and that happens when s₁ and s₂ are exactly same. In such a case the distance becomes 0.

Case2: l₁ > l₂

Here we consider d (s₁, s₂) = 1 – m / [l₂(l₁ – l₂ + δ)] where δ > 0 a constant.

Note that the maximum possible value for m is l₂ and that happens when s₂ is a sub-string of s₁.
While fixing a value for δ, it has to be kept in mind if the length of s₁ is one more than the length of s₂ and s₂ is a sub-string of s₁ then d (s₁, s₂) = 1-1/(1 + δ)
If we want two strings as above (lengths differ by one and the smaller one is a sub-string of the other) to differ by 5 unit then 1-1/(1 + δ) = 0.05 => δ = 0.05 (appx.). In fact, we see that in order to have such strings closer, we need to set δ as a very small positive number.

The distance function defined above lies between 0 and 1 and a match probability can be defined using this distance function. For example, the match probability can be p (s₁, s₂) = 1 - d (s₁, s₂)

However, the above distance function is only an indicative one and can be further improved.

Friday, May 27, 2011

Restricting False Positives – Cut-Off point for each match-key

While discussing the basic framework, we saw the two cut-off points m and M between 0 and 1 which are applied on the match probability λ(p) in a way that if λ(p)> M then we conclude that p ε M, if λ(p) < m then p ε U and if m < λ(p) < M then p ε S

Here the pre-defined points m and M are called cut-off points.
Cardinality of the set of matching pairs reduces when the cut-off point M increases. So, one way of reducing the number of false positive matches seems to be increasing the value of cut-off point M.
But it could result in some genuine matches to land up in the set of suspected matches and thereby the total cost of error gets increased.
This is a typical issue encountered while implementing data quality solutions.
One way to address this issue is to introduce cut-off points for the individual match-keys.

While discussing match-key, we talked about the probability/indicator returned by each match-key during a comparison involving a pair p. These probabilities/indicators are denoted as λ_i where the suffix i runs from 1 to k in a system with k match-keys.

Besides having a cut-off point M for the composite probability/indicator, we can define cut-offs for each match-key and call those M_i such that if λ_i ≥ M_i for all i = 1 to k then only the composite probability/indicator for the underlying pair is calculated otherwise it is set to 0.

By introducing the M_i’s we will be able to restrict false positive matches.

Enhancing Match-Keys - Cross-Matching

Consider the following names:

Sl. #	Given Name	Middle Name	Last Name Prefix	Last Name
1	GABRIEL	MARQUIZ	DE	PEREIRA
2	MARQUIS	GEBRIEL	DE	PERERA

Suppose the address information is matching on these two records. A manual inspection tells us that the records are indeed matching.
How to establish the match in names?
Well, the Last Names do match (well…close enough) and we do not compare the Prefix. But what about the Given Name and Middle Name? Apparently, they have switched places besides initial being used on the second record.
Only way to find the match in Given Name and Middle Name is to perform a cross-matching between these fields. That is, the match keys should be defined in a way that Given Names are also compared with the corresponding Middle Names and vice-versa.

This is a good time to introduce the reader to a matching objective i.e. House-Holding.
Here we try to cluster (or link) the records belonging to the same house-hold. Well…in many cases, we define a house-hold to be comprised of members sharing the same Last Name, Address and the residential Phone Number.

Now, let us see the following two names and consider these for house-holding:

Sl. #	Given Name	Middle Name	Last Name
1	JONATHAN	A	ABOTT
2	ABOTT		MARGARET

Suppose the address information is matching on these two records. A manual inspection tells us that the records belong to the same house-hold. How is the Last Names matching? Well…Last Name on the first record matches to the Given Name on the second record. Again name components have switched places. But there is a little difference in the way we do cross-matching from the earlier example.
In the earlier example, we checked if the Given Name on one record matches to the Middle Name on the other record and the Middle Name on the second record matches to the Given Name on the first record. But for the house-hold matching example, we check if the Last Name on one record matches to the Given Name on the other record.

In DQ terminology these are known to be 2-way cross-matching and 1-way cross-matching.
Let us look at one more example where cross-matching has occurred involving three name fields:

Sl. #	Given Name	Middle Name	Last Name
1	RAJ	SINGH	THAKOR
2	THAKORE	RAJ	SINGH

Though theoretically such component switching can take place between any two (or more) fields in the database, we usually keep the cross-matching involving a few pair of fields.

Cross matching can be implemented when we use soft match-keys as discussed earlier. But there are many data quality tools out there which are built using hard match-keys alone.
So what do we do if one such tool is being used?
There is an alternative route that takes care of the cross matching in a limited way. Suppose we have two similar fields Field1 and Field2 which we want to cross match. In this case, first we make sure that there is no record with non blank Field2 and blank Field1. Then we make two copies of the records where both these fields are non blank. One such set of records is allowed to pass through as it is but for the records in the second set, we copy Field2 and overwrite Field1 with the values in Field2. Finally we append these records together to make a bigger set of records. Obviously, we put some identifier for each record to identify if they came from set1 or set2. Hard-key based match can be performed on this bigger set of records to arrive at the exactly same conclusions.
However, there is one limitation in this approach. For each such pair of fields we have to add new records in the database and this takes time/effort besides disc space.

Thursday, May 26, 2011

Rome was not built in a day – Key based matching

Basic framework as discussed earlier, talks about match keys. It also says that for each pair of records, comparison at each match key returns a result λ_i where 0 ≤ λ_i ≤ 1.
λ_i’s are called match probability if λ_i’s can assume any real number in the unit interval or it is called a match indicator if it can have only two values, 0 and 1.

λi’s play the pivotal role in determining if a pair (of records) should be put in M, U or S.

Earlier, I said the each match key involves several fields. The idea of a match key is a pair of records should be matching if and only if the underlying records closely resemble each other at the fields which constitute the match key.
Let me explain this by an example:

#	Given Name	Middle Name	Last Name	St. No.	St. Name	St. Type	Apt	City	ZIP
1	Johnn	P	Morkel	25	Main	Street	Apt 225	Kansas	11111
2	John		Morkel	25	M		Ste 225	Kansas	11111
3	Jason	Peter	Morkel	25	M	Blvd	A 225		11111

A closer look tells us that first two records are probably the same i.e. they represent the same individual.
But it is highly likely the third record belongs to a different individual.

The logic behind automatic matching should closely follow our thinking process when we say that the first two records are probably the same.

Now let us look at the values on these two records. Given Names are close…may not be an exact match…but very close. Middle Names are not contradicting each other. Last Names are exactly the same. Street numbers are the same. On the street name, well… the initial characters are matching and there is no contradiction. Numeric digits are the same on Apartment Information while Cities as well as
ZIP codes are the same.

If we have a match key comprising of the Last Name, St. Number, Numeric portion of Apartment Information and ZIP Code then the first two records will agree on this key.

But the third record will also agree with the first two records on this match key. That is simply because; we have overlooked the Given Name.
To address this issue, we include Given Name in the match key definition. Unfortunately on the first two records, Given Names are similar but not exactly the same. So, instead of the Given Name value, our match code needs to include a transformed value of Given Name… a transformation so that JOHNN becomes JOHN but JASON does not become JOHN.

Now let us look at the two records in the following table:

#	Given Name	Middle Name	Last Name	St. No.	St. Name	St. Type	Apt	City	ZIP
1	John	Peter	Morkel	25	Main	Street	Apt 225	Kansas	11111
2	John	Proctor	Morkel	25	Main	Street	Ste 225	Kansas	11111

This pair of records has a good amount of similarity. Probably these represent the same house-hold. But unfortunately they represent probably different individuals. Since their Middle Names are contradicting.
Probably, we need to modify our match key so that the entire Middle Names are compared when both of the Middle Names have length more than 1, only an initial match on this field is performed when on at least one record, the length of Middle Name is 1 and a blank should be allowed to match a non-blank Middle Name.

So, the above match technique for the middle name is not a transform in the sense that it does not change the underlying values of the Middle Name.

So match keys are combination of a few transformed field values with associated match technique(s).

We defined one match key above. However, we need multiple match keys.
To understand this, let us look at the following records:

#	Given Name	Middle Name	Last Name	St. No.	St. Name	St. Type	Apt	Cell No.	City	ZIP
1	John	Peter	Morkel	25	Main	Street	Apt 225	1234567890	Kansas	11111
2	John	P	Morkel	25	Main	Street	Ste 225	1212121212	Kansas	11111
3	John	Peter	Morkel	1750	Collins	Blvd	102	1212121212	Richardson	75068

A close look at the above records will reveal that the first two records match as per the match key we discussed earlier but none of these will match to the third record. Address information on the third record is totally different. But the cell number on the third record matches to the cell number on the second record. It looks like the same person, at different point in time was in a different location but for at least sometime maintained the same cell number.
To capture this match, we have to use a different match key involving Given Name, Middle Name, Last Name and Cell Number.

In a real life scenario, we will have to deal with many more address fields as well as other fields like SSN, Tax Id etc. So we will have to have many match keys defined in the system.
Probably, a field (or more than one field) is common between two match keys. It may so happen that either the associated match technique or the transformation is different in these cases.

Again, match-keys can be defined in two ways. Suppose transformed values of n fields are used to define a match-key. We may just concatenate the values to obtain the match key. Such a key is called a hard match-key. Alternatively, we can define the key as ordered set of n string values. Such a key is called soft match-key. Though many matching engines are built using hard match-keys (sometimes these are called match codes), there are tools which are built using both soft-key and hard-key. Soft-key has an added advantage that it is flexible enough but comparing each component value in a soft-key takes longer matching time for the entire key. That is why I prefer using two sets of keys. Records are passed through the hard match-key first to select a possible matching pairs and then these pairs are evaluated once more using soft match-key. This is called 2-step matching.

We will come back to key based matching and discuss match probability and match indicator that I touched upon at the beginning of this post. But before that we need to examine one more property of a match key and the transformations that were discussed in this post.