Sunday, June 12, 2011

Improvised Parsing Technique


We have already discussed the basic technique that can be used in parsing. We have also seen some of the limitations in this technique. The biggest limitation is the effort required to implement (and maintain) this basic approach in a region or country where data is semi-structured.
For example, possible number of components in an address is well over 15.
Obviously, the number of possible patterns in the address is huge. In fact, armed with around 30K address patterns, I could not parse more than 60% of Indian addresses during one implementation.
I am mentioning address for parsing here since address parsing seems to be most challenging as it comes- up with lot more variation than name parsing. Name parsing has its challenges though.

Suppose an address can be consisting of the blocks: BLK1 and BLK2
That is, an address might look like:
BLK1 or BLK2 or BLK1||BLK2 or BLK2||BLK1 (|| is the concatenation operator).
Now, there are components within each address block. Suppose these are:

There are two blocks viz. BLK1 and BLK2 in an address. BLK1 contains the components with mask characters A, B and C.  BLK2 contains the components with mask characters D, E and F. 
A, B, …, F are the mask characters (including U and I (initials) and N (numbers)).


Suppose we are creating patterns of length 7 where 4 characters correspond to BLK1 while 3 characters correspond to BLK2.
There are not more than 43 or 64 sub-patterns within BLK1 and not more than 33 or 27 sub-patterns within BLK2. So, we need to create 91 possible parsing rules.
But if we use the basic approach that was discussed earlier, we will create a maximum of 67 or 279936 possible parsing rules.
Why do we have such a huge difference?
Let us have a look at the following table which lists a few possible patterns in BLK1 and BLK2:

Pattern in BLK1
ABCA
ABCC
CBAB
BACB
Pattern in BLK2
DEF
DDF
EDF
DFF

When these are combined into BLK1||BLK2 or BLK2||BLK1 they form 16 + 16 = 32 patterns.
That is, we need to create 8 parsing rules (basically these are sub-parsing rules) to get 40 parsing rules.
Wow! This is just wonderful. But let us stop here. Are we covering everything in such improvisation?
Unfortunately not. Let us consider the pattern:  ADBEFCC
Note that in this pattern, BLK1 and BLK2 are not separated. So, this pattern will not be covered if we stick to the same definition of BLK1 and BLK2.
In reality, we select the blocks (BLK1 and BLK2 here) so that the possibility of such patterns is less.
But eventually, we will be left with some unparsed addresses. How to handle those? We will see that shortly.
In the team time, we will see how the other limitation can be handled. Earlier, we gave a few examples of situations where a word or abbreviation can have different meanings in different addresses (or names). For example, the word MD at the end of a name string may denote a job title. It may also be an abbreviation for the word MOHAMMED.  Let us see one such example in address.  Suppose, we are using the mask character T to denote a street type and U to denote an unidentified word. If we try to parse addresses from a region where street name is preceded by street type, we will encounter
sub-patterns of the form UT or UUT etc. In such cases, the U (or the UU) refers to the street name and T points to the street type. But we may as well get TUT as a sub-pattern. Initial T here points to a street type which is followed by possibly a street name and then another street type appears. This looks a bit unusual. In reality the first T may point to a word that has a typographical error or it can also be a situation like ST PETER AVENUE. Note that the first T points to the word ST which is a popular abbreviation for STREET but in this case, probably, it is an abbreviation for the word SAINT. And consequently the TUT should be having the same rule as UUT. Learning from this is anything that looks like a sure variation from the general convention may not be a variation at all. So, we need to consider the relative position of a mask character in the pattern before setting a parsing rule corresponding to it. This is applicable in either of the approaches for parsing.
The last limitation that is encountered during parsing and discussed earlier is the presence of compound words in names (personal names or otherwise).
Let us discuss this (or the possible solution to this) using the same example that was given earlier.

Name
First Name
Middle Name
Last Name
ATULPRASAD SEN
ATULPRASAD

SEN
ATUL PRASAD SEN
ATUL
PRASAD
SEN

This issue of compound words surfaces during parsing but actually impacts matching or identity resolution. In this case, we will not end up matching either the first names or the middle names (even after using cross matching) with a high probability.
During matching, we can use another value viz. first name||middle name. Note that this refers to the field (derived field) for given name. The above two records will agree on this new field. Note that such solutions are purely ad-hoc and the exact nature of such solutions varies from case to case.

No comments:

Post a Comment