Monday, June 20, 2011

Phonetic Similarity

Consider the following words taken from India:

SOFIA
SOPHIA
JENA
XENA
SANTANU
SHANTANU
BIKASH
VIKAS
BAIBHAB
BAIVAB

Words in the same row are actually matching and the difference between each pair is put in bold. These differences in spelling reflect how people pronounce these words. The issue becomes convoluted if the native language of people considered is not English (i.e. the names are non English) but the impact of regional languages are obvious in the spelling.
There are some standard algorithms to handle such situation. For example, Soundex and Metaphone are two widely known and used algorithms that help bring two similar sounding names closer. There are ms other even more sophisticate algorithms to use.
But there is an issue with each of these algorithms i.e. these algorithms are sort of fixed. We cannot customize these lists/rules. And as I processed data from different countries and regions I encountered more variations than listed in these typical algorithms.

Let me give you a funny example. I came across a man DHARMENDER a few years back and he eventually represented my case in a legal matter. When I was going through the initial draft paper of my case I found his legal name was DHARMENDRA. Eventually I realized that in some specific region, personal names ending with DRA are often pronounced and written as the same name ending with DER.
We wrote such phonetically similar syllables in a table sorted on the length in a decreasing manner. Our algorithm just compared these syllables with the values in the desired field in the records and replacing them accordingly if found. We could edit this table subsequently.

Partial Parsing


Again, we are concentrating on Address Parsing and esp. on the addresses that failed the earlier parsing techniques. Note that handling name parsing through this method will be disastrous and the solution must have adequate number of name parsing rules.
But the possible number of variations in address strings is huge and it is practically impossible to have address-parsing rules covering all possible patterns.

Let us look at an address: SHIKHA TERRACE UNO 22A 27 MAGRI C OPP GAS DEPOT OFF M G RD

Suppose we use the following lookup tables:

House/Building: Mask Character H
TERRACE
BLDG
DEPOT
Landmark: Mask Character C
OPP
NEAR
OFF
Street Type: Mask Character T
RD
ST
Apartment: Mask Character A
APT
UNIT

Suppose, before parsing, we apply the following cleansing rules on the above address:
1.       UNO is replaced by UNIT
2.       Consecutive numeric characters and initials are separated.
So, the above address looks like: SHIKHA TERRACE UNIT 22 A 27 MAGRI C OPP GAS DEPOT M G RD
So, the pattern for this address will be: UHANINUICUHIIT (U denotes an unidentified component and I denotes an initial). Suppose we do not have any parsing rule defined for this pattern. So this address will remain unparsed.
But one look at the identified pattern and we will be able to identify (and extract) meaningful information out of it.
1.       The initial UH represents the building
2.       The sub-pattern AN that comes immediately after UH can be treated as the apartment information.
3.       The sub-pattern CUH indicates a landmark
4.       The sub-pattern IIT represents the street name and type and should be parsed as SST
Partial parsing is the identification of known sub-patterns from within the identified pattern and subsequent extraction of the relevant components from the original string.

There are a few points to note for partial parsing.
  1. Several known sub-patterns may be identified from within an identified pattern.
  2. For a given identified pattern ABCDEF, suppose CD is a known sub-pattern. After chopping that off from the identified pattern, the remaining pattern will be ABEF. Even if BEF is another known sub-pattern, it must not be considered as the original pattern did not contain BEF as a sub-string. When CD is removed, we are left with two sub-strings i.e. AB and EF and our search needs to continue within these sub-patterns.
  3. Let us consider the same identified pattern ABCDEF. Suppose our partial parsing rules contain two known sub-patterns viz. CD and BCD. Well, if CD comes up first in the search then we will be chopping of CD from the identified pattern and eventually, the sub-pattern BCD, in-spite of being present in our solution, will not work. One way to handle this is to sort the partial parsing rules according to the lengths of the sub-patterns in the decreasing way.

Sunday, June 12, 2011

Improvised Parsing Technique


We have already discussed the basic technique that can be used in parsing. We have also seen some of the limitations in this technique. The biggest limitation is the effort required to implement (and maintain) this basic approach in a region or country where data is semi-structured.
For example, possible number of components in an address is well over 15.
Obviously, the number of possible patterns in the address is huge. In fact, armed with around 30K address patterns, I could not parse more than 60% of Indian addresses during one implementation.
I am mentioning address for parsing here since address parsing seems to be most challenging as it comes- up with lot more variation than name parsing. Name parsing has its challenges though.

Suppose an address can be consisting of the blocks: BLK1 and BLK2
That is, an address might look like:
BLK1 or BLK2 or BLK1||BLK2 or BLK2||BLK1 (|| is the concatenation operator).
Now, there are components within each address block. Suppose these are:

There are two blocks viz. BLK1 and BLK2 in an address. BLK1 contains the components with mask characters A, B and C.  BLK2 contains the components with mask characters D, E and F. 
A, B, …, F are the mask characters (including U and I (initials) and N (numbers)).


Suppose we are creating patterns of length 7 where 4 characters correspond to BLK1 while 3 characters correspond to BLK2.
There are not more than 43 or 64 sub-patterns within BLK1 and not more than 33 or 27 sub-patterns within BLK2. So, we need to create 91 possible parsing rules.
But if we use the basic approach that was discussed earlier, we will create a maximum of 67 or 279936 possible parsing rules.
Why do we have such a huge difference?
Let us have a look at the following table which lists a few possible patterns in BLK1 and BLK2:

Pattern in BLK1
ABCA
ABCC
CBAB
BACB
Pattern in BLK2
DEF
DDF
EDF
DFF

When these are combined into BLK1||BLK2 or BLK2||BLK1 they form 16 + 16 = 32 patterns.
That is, we need to create 8 parsing rules (basically these are sub-parsing rules) to get 40 parsing rules.
Wow! This is just wonderful. But let us stop here. Are we covering everything in such improvisation?
Unfortunately not. Let us consider the pattern:  ADBEFCC
Note that in this pattern, BLK1 and BLK2 are not separated. So, this pattern will not be covered if we stick to the same definition of BLK1 and BLK2.
In reality, we select the blocks (BLK1 and BLK2 here) so that the possibility of such patterns is less.
But eventually, we will be left with some unparsed addresses. How to handle those? We will see that shortly.
In the team time, we will see how the other limitation can be handled. Earlier, we gave a few examples of situations where a word or abbreviation can have different meanings in different addresses (or names). For example, the word MD at the end of a name string may denote a job title. It may also be an abbreviation for the word MOHAMMED.  Let us see one such example in address.  Suppose, we are using the mask character T to denote a street type and U to denote an unidentified word. If we try to parse addresses from a region where street name is preceded by street type, we will encounter
sub-patterns of the form UT or UUT etc. In such cases, the U (or the UU) refers to the street name and T points to the street type. But we may as well get TUT as a sub-pattern. Initial T here points to a street type which is followed by possibly a street name and then another street type appears. This looks a bit unusual. In reality the first T may point to a word that has a typographical error or it can also be a situation like ST PETER AVENUE. Note that the first T points to the word ST which is a popular abbreviation for STREET but in this case, probably, it is an abbreviation for the word SAINT. And consequently the TUT should be having the same rule as UUT. Learning from this is anything that looks like a sure variation from the general convention may not be a variation at all. So, we need to consider the relative position of a mask character in the pattern before setting a parsing rule corresponding to it. This is applicable in either of the approaches for parsing.
The last limitation that is encountered during parsing and discussed earlier is the presence of compound words in names (personal names or otherwise).
Let us discuss this (or the possible solution to this) using the same example that was given earlier.

Name
First Name
Middle Name
Last Name
ATULPRASAD SEN
ATULPRASAD

SEN
ATUL PRASAD SEN
ATUL
PRASAD
SEN

This issue of compound words surfaces during parsing but actually impacts matching or identity resolution. In this case, we will not end up matching either the first names or the middle names (even after using cross matching) with a high probability.
During matching, we can use another value viz. first name||middle name. Note that this refers to the field (derived field) for given name. The above two records will agree on this new field. Note that such solutions are purely ad-hoc and the exact nature of such solutions varies from case to case.