Data Quality: July 2011

Thursday, July 21, 2011

Context Sensitiveness in Matching

At present there is a serious discussion going on in the Linekdin group “Matching” (You need to be a member of the networking site Linekdin and also a member of the group “Matching” to access the thread) on the subject of Context Sensitiveness in matching. The subject is closely related to probable errors in matching. Looking at the richness of the content in this discussion and the vastness of the topic itself, I am tempted to share my understanding in this regard.

Let me begin by sharing an experience I had a few years back while implementing a data quality solution in a private bank. This bank was in the process of implementing a data quality solution for its large customer base. In order to fine-tune the matching algorithm, it gave us a control/test file (consisting of a few hundred records) and with this, we tried various possible algorithms. It took us some time before we came up with the proper match algorithm for the control file. Both the business users and the IT users were happy with the result displayed for the control file. But to our horror, the same algorithm became a disaster when a portion of the customer data was processed. We finally had to realign the algorithm from the start.

Before I explain the scenario, let me give one example of the disparity. Consider the two individual records (only a few fields) in the table below:

Name	Address	City	Tel1	Tel2
ABHISEK C KOTCHER	C TOWER, UNO 12, JEEVAN MANZIL	SURAT	1111111111	2222222222
AVISEK C	C12 OFF MG RD, NEAR JEEVAN MANZIL	SURAT	3333333333

Above two records were matched by the algorithm developed using the control file. But for the customer data integration activity these records were not a match as we realized later.

We wanted to know if this one was a one of case or there was something fundamentally wrong. To our shock, we found that the control file given by this bank was a portion taken out from their fraud detection de-duplication database which was prepared by another vendor earlier. Unfortunately this vendor did not make the bank aware of the effect of using the same or similar match algorithm under different context.

In case you can spare some time, you may refer to my earlier post “Errors in matching” posted during May 2011.
In a nutshell, there are two types of possible error when we say; there is a match (or no match) between two specific records. When the algorithm says it’s a match but actually the records represent two different entities, the error is called a false positive. And when the algorithm says that there is no match between the records but actually the records represent the same entity then the error called a false negative. Depending on the context in which the match results will be used there are two types of match objectives. One situation demands that a slight similarity should be captured by the match algorithm and thereby the corresponding objective becomes to reduce false negatives. Another type of scenario demands that two records should match only when there is strong similarity and the corresponding objective in this case becomes to reduce false positives.

In a fraud detection type of context, the objective is to capture a slight similarity so that none is escaped. But in a typical customer data integration type of context, the objective is to allow two records to match only when there is strong evidence that these represent the same entity.

I do not think there is any strategy to improve the match algorithm in a way so that both false positives and false negatives reduce (unless of course you change the input file/files!). Unfortunately there is no mathematical proof of this but experience of people in this field tells so. And that is why we have these two possible objectives rather than just one that requires reduction of both false positives and false negatives.

The idea is when one adjusts the match algorithm to reduce false positives as in the case of a typical CDI type of situation by making the match settings stricter, one increases the risk of having more false negatives. On the other hand, when one adjusts the match algorithm to reduce false negatives as in the case of a typical fraud detection type of situation by making the match settings relaxed, one increases the risk of having more false positive.

So, before you start working on the match algorithm (setting), be sure of the objective.

Friday, July 1, 2011

Compound Words

[I will use many examples in this discussion. Most of these examples are taken from Indian files but a few are from international files.]

While trying to de-dupe records, issues with compound words crop up often. A nice post on this is written by Henrik Liliendahl Sørensen.Such issues come up when we need to match two field values with at least one field value consisting of more than one words. For ease of discussion, I will split the topic into two. Firstly, we will talk about compound words in name matching.
Let me give a few examples of names:

Name – Record1	Matching Name-Record2
JOHN P SMITH	JOHNP SMITH
DADAN BHAI BOTTLEWALA	DADANBHAI BOTTLEWALA
AMAL KANTI SEN	AMALK SEN

After parsing, these names will be

Name – Record1			Matching Name-Record2
First Name	Middle Name	Last Name	First Name	Middle Name	Last Name
JOHN	P	SMITH	JOHNP		SMITH
DADAN	BHAI	BOTTLEWALA	DADANBHAI		BOTTLEWALA
AMAL	KANTI	SEN	AMALK		SEN

Notice that in each of these three cases, matching names do not have a middle name. Also on the first two instances, the concatenated values of first name and middle name of the first record matches to the first name of the second record.
Names on the third row, on the other hand are a bit different. Ideally speaking the two names do not exactly match. But since we know that the use of initials for the middle names is frequent, we need to allow these two names to match but with a probability less than 100%.
This is because we need to allow the two words KANTI and K to match as middle names with probability less than 100%.

Names in this table can be matched by using the following rule:

1. For all probable match pair of record

1.1. If the middle name is empty in exactly one record in a pair

1.1.1. If the two first names, when compared, do not give adequate match probability
then carry out the following

1.1.1.1.            Concatenate the first name and middle name of the other record in the
        pair and compare this string with the first name of the record where
        middle name is blank

1.1.1.2.            If the probability in the comparison above do not give good result, see if
        the first name on the record where middle name is not blank, is a subset
        (from the beginning) of the other first name then

1.1.1.2.1.                  Consider the remaining substring from the first name where
             middle name is blank. If the length of this substring is 1 then see
             if there is an initial match between this character and the middle
             name on the other record.

There are several types of occurrences of compound words in addresses.
Let us consider the following examples:

Case #	Address Word – Record1	Matching Address Word-Record2
1	25 MAIN ROAD NEAR IIT CAMPUS	25 MAIN ROAD NEAR I I T CAMPUS
2	21 MG ROAD BOWBAZAR	21 M G ROAD BOWBAZAR
3	SCHORBACHSTRASSE 9	SCHORBACH STRASSE 9
4	NEW YORK	NEWYORK

In the first two instances (case # 1 & 2) refer to one style of issues involving compound words where abbreviations of place names using the initials are written differently.
In the next instance (case # 3) refer to another style of issues involving compound words street names and the corresponding street types are combined together.

Let us see what happens to these addresses (case # 1, 2 and 3) after proper parsing

Original Address	Hse. No.	St. Nm.	St. Typ.	Location	Landmark
25 MAIN ROAD NEAR IIT CAMPUS	25	MAIN	ROAD		IIT CAMPUS
25 MAIN ROAD NEAR I I T CAMPUS	25	MAIN	ROAD		I I T CAMPUS
21 MG ROAD BOWBAZAR	21	MG	ROAD	BOWBAZAR
21 M G ROAD BOWBAZAR	21	M G	ROAD	BOWBAZAR
SCHORBACHSTRASSE 9	9	SCHORBACHSTRASSE
SCHORBACH STRASSE 9	9	SCHORBACH	STRASSE

In the first case, we need to match IIT CAMPUS to I I T CAMPUS. We can drop the keyword CAMPUS for matching and then remove the whitespace characters if the field contains only initials.
In the second case we can adopt the same technique.
Third case is unique. This is an address example from picked up from data file from Germany.
STRASSE is a popular street type in this country which is often clubbed with the corresponding street name. One way to handle this would be if the street type is STRASSE then combine street name and street type together and compare this value to the street name of the other record in a probable match pair.

Last instance (case # 4) is an example of the city field which can be tackled using standardization.

Lastly, I have seen many cases of typo that lead to issues involving compound words in matching. I prefer using a separate match technique built on the earlier match technique where we compared two strings where each string contained one word. I will briefly discuss this in my next post.