Friday, July 1, 2011

Compound Words

[I will use many examples in this discussion. Most of these examples are taken from Indian files but a few are from international files.]

While trying to de-dupe records, issues with compound words crop up often. A nice post on this is written by Henrik Liliendahl Sørensen.Such issues come up when we need to match two field values with at least one field value consisting of more than one words. For ease of discussion, I will split the topic into two. Firstly, we will talk about compound words in name matching.
Let me give a few examples of names:

Name – Record1
Matching Name-Record2
JOHN P SMITH
JOHNP SMITH
DADAN BHAI BOTTLEWALA
DADANBHAI BOTTLEWALA
AMAL KANTI SEN
AMALK SEN

After parsing, these names will be

Name – Record1
Matching Name-Record2
First Name
Middle Name
Last Name
First Name
Middle Name
Last Name
JOHN
P
SMITH
JOHNP

SMITH
DADAN
BHAI
BOTTLEWALA
DADANBHAI

BOTTLEWALA
AMAL
KANTI
SEN
AMALK

SEN

Notice that in each of these three cases, matching names do not have a middle name. Also on the first two instances, the concatenated values of first name and middle name of the first record matches to the first name of the second record.
Names on the third row, on the other hand are a bit different. Ideally speaking the two names do not exactly match. But since we know that the use of initials for the middle names is frequent, we need to allow these two names to match but with a probability less than 100%.
This is because we need to allow the two words KANTI and K to match as middle names with probability less than 100%.

Names in this table can be matched by using the following rule:
1.      For all probable match pair of record
1.1.   If the middle name is empty in exactly one record in a pair
1.1.1.      If the two first names, when compared, do not give adequate match probability     
    then carry out the following
1.1.1.1.            Concatenate the first name and middle name of the other record in the
        pair and compare this string with the first name of the record where
        middle name is blank
1.1.1.2.            If the probability in the comparison above do not give good result, see if
        the first name on the record where middle name is not blank, is a subset
        (from the beginning) of the other first name then
1.1.1.2.1.                  Consider the remaining substring from the first name where
             middle name is blank. If the length of this substring is 1 then see
             if there is an initial match between this character and the middle
             name on the other record.
There are several types of occurrences of compound words in addresses.
Let us consider the following examples:

Case #
Address Word – Record1
Matching Address Word-Record2
1
25 MAIN ROAD NEAR IIT CAMPUS
25 MAIN ROAD NEAR I I T CAMPUS
2
21 MG ROAD BOWBAZAR
21 M G ROAD BOWBAZAR
3
SCHORBACHSTRASSE 9
SCHORBACH STRASSE 9
4
NEW YORK
NEWYORK

In the first two instances (case # 1 & 2) refer to one style of issues involving compound words where abbreviations of place names using the initials are written differently.
In the next instance (case # 3) refer to another style of issues involving compound words street names and the corresponding street types are combined together.

Let us see what happens to these addresses (case # 1, 2 and 3) after proper parsing


Original Address
Hse. No.
St. Nm.
St. Typ.
Location
Landmark
25 MAIN ROAD NEAR IIT CAMPUS
25
MAIN
ROAD

IIT CAMPUS
25 MAIN ROAD NEAR I I T CAMPUS
25
MAIN
ROAD

I I T CAMPUS
21 MG ROAD BOWBAZAR
21
MG
ROAD
BOWBAZAR

21 M G ROAD BOWBAZAR
21
M G
ROAD
BOWBAZAR

SCHORBACHSTRASSE 9
9
SCHORBACHSTRASSE



SCHORBACH STRASSE 9
9
SCHORBACH
STRASSE



In the first case, we need to match IIT CAMPUS to I I T CAMPUS. We can drop the keyword CAMPUS for matching and then remove the whitespace characters if the field contains only initials.
In the second case we can adopt the same technique.
Third case is unique. This is an address example from picked up from data file from Germany.
STRASSE is a popular street type in this country which is often clubbed with the corresponding street name. One way to handle this would be if the street type is STRASSE then combine street name and street type together and compare this value to the street name of the other record in a probable match pair.
Last instance (case # 4) is an example of the city field which can be tackled using standardization.

Lastly, I have seen many cases of typo that lead to issues involving compound words in matching. I prefer using a separate match technique built on the earlier match technique where we compared two strings where each string contained one word. I will briefly discuss this in my next post.

No comments:

Post a Comment