Data Quality: Compound Words

[I will use many examples in this discussion. Most of these examples are taken from Indian files but a few are from international files.]

While trying to de-dupe records, issues with compound words crop up often. A nice post on this is written by Henrik Liliendahl Sørensen.Such issues come up when we need to match two field values with at least one field value consisting of more than one words. For ease of discussion, I will split the topic into two. Firstly, we will talk about compound words in name matching.
Let me give a few examples of names:

Name – Record1	Matching Name-Record2
JOHN P SMITH	JOHNP SMITH
DADAN BHAI BOTTLEWALA	DADANBHAI BOTTLEWALA
AMAL KANTI SEN	AMALK SEN

After parsing, these names will be

Name – Record1			Matching Name-Record2
First Name	Middle Name	Last Name	First Name	Middle Name	Last Name
JOHN	P	SMITH	JOHNP		SMITH
DADAN	BHAI	BOTTLEWALA	DADANBHAI		BOTTLEWALA
AMAL	KANTI	SEN	AMALK		SEN

Notice that in each of these three cases, matching names do not have a middle name. Also on the first two instances, the concatenated values of first name and middle name of the first record matches to the first name of the second record.
Names on the third row, on the other hand are a bit different. Ideally speaking the two names do not exactly match. But since we know that the use of initials for the middle names is frequent, we need to allow these two names to match but with a probability less than 100%.
This is because we need to allow the two words KANTI and K to match as middle names with probability less than 100%.

Names in this table can be matched by using the following rule:

1. For all probable match pair of record

1.1. If the middle name is empty in exactly one record in a pair

1.1.1. If the two first names, when compared, do not give adequate match probability
then carry out the following

1.1.1.1.            Concatenate the first name and middle name of the other record in the
        pair and compare this string with the first name of the record where
        middle name is blank

1.1.1.2.            If the probability in the comparison above do not give good result, see if
        the first name on the record where middle name is not blank, is a subset
        (from the beginning) of the other first name then

1.1.1.2.1.                  Consider the remaining substring from the first name where
             middle name is blank. If the length of this substring is 1 then see
             if there is an initial match between this character and the middle
             name on the other record.

There are several types of occurrences of compound words in addresses.
Let us consider the following examples:

Case #	Address Word – Record1	Matching Address Word-Record2
1	25 MAIN ROAD NEAR IIT CAMPUS	25 MAIN ROAD NEAR I I T CAMPUS
2	21 MG ROAD BOWBAZAR	21 M G ROAD BOWBAZAR
3	SCHORBACHSTRASSE 9	SCHORBACH STRASSE 9
4	NEW YORK	NEWYORK

In the first two instances (case # 1 & 2) refer to one style of issues involving compound words where abbreviations of place names using the initials are written differently.
In the next instance (case # 3) refer to another style of issues involving compound words street names and the corresponding street types are combined together.

Let us see what happens to these addresses (case # 1, 2 and 3) after proper parsing

Original Address	Hse. No.	St. Nm.	St. Typ.	Location	Landmark
25 MAIN ROAD NEAR IIT CAMPUS	25	MAIN	ROAD		IIT CAMPUS
25 MAIN ROAD NEAR I I T CAMPUS	25	MAIN	ROAD		I I T CAMPUS
21 MG ROAD BOWBAZAR	21	MG	ROAD	BOWBAZAR
21 M G ROAD BOWBAZAR	21	M G	ROAD	BOWBAZAR
SCHORBACHSTRASSE 9	9	SCHORBACHSTRASSE
SCHORBACH STRASSE 9	9	SCHORBACH	STRASSE

In the first case, we need to match IIT CAMPUS to I I T CAMPUS. We can drop the keyword CAMPUS for matching and then remove the whitespace characters if the field contains only initials.
In the second case we can adopt the same technique.
Third case is unique. This is an address example from picked up from data file from Germany.
STRASSE is a popular street type in this country which is often clubbed with the corresponding street name. One way to handle this would be if the street type is STRASSE then combine street name and street type together and compare this value to the street name of the other record in a probable match pair.

Last instance (case # 4) is an example of the city field which can be tackled using standardization.

Lastly, I have seen many cases of typo that lead to issues involving compound words in matching. I prefer using a separate match technique built on the earlier match technique where we compared two strings where each string contained one word. I will briefly discuss this in my next post.

Data Quality

Friday, July 1, 2011

Compound Words

No comments:

Post a Comment