Data Quality: Improvised Parsing Technique

We have already discussed the basic technique that can be used in parsing. We have also seen some of the limitations in this technique. The biggest limitation is the effort required to implement (and maintain) this basic approach in a region or country where data is semi-structured.
For example, possible number of components in an address is well over 15.
Obviously, the number of possible patterns in the address is huge. In fact, armed with around 30K address patterns, I could not parse more than 60% of Indian addresses during one implementation.
I am mentioning address for parsing here since address parsing seems to be most challenging as it comes- up with lot more variation than name parsing. Name parsing has its challenges though.

Suppose an address can be consisting of the blocks: BLK1 and BLK2
That is, an address might look like:
BLK1 or BLK2 or BLK1||BLK2 or BLK2||BLK1 (|| is the concatenation operator).
Now, there are components within each address block. Suppose these are:

There are two blocks viz. BLK1 and BLK2 in an address. BLK1 contains the components with mask characters A, B and C. BLK2 contains the components with mask characters D, E and F.

A, B, …, F are the mask characters (including U and I (initials) and N (numbers)).

Suppose we are creating patterns of length 7 where 4 characters correspond to BLK1 while 3 characters correspond to BLK2.
There are not more than 4³ or 64 sub-patterns within BLK1 and not more than 3³ or 27 sub-patterns within BLK2. So, we need to create 91 possible parsing rules.
But if we use the basic approach that was discussed earlier, we will create a maximum of 6⁷ or 279936 possible parsing rules.
Why do we have such a huge difference?
Let us have a look at the following table which lists a few possible patterns in BLK1 and BLK2:

Pattern in BLK1

ABCA

ABCC

CBAB

BACB

Pattern in BLK2

DEF

DDF

EDF

DFF

When these are combined into BLK1||BLK2 or BLK2||BLK1 they form 16 + 16 = 32 patterns.
That is, we need to create 8 parsing rules (basically these are sub-parsing rules) to get 40 parsing rules.
Wow! This is just wonderful. But let us stop here. Are we covering everything in such improvisation?
Unfortunately not. Let us consider the pattern: ADBEFCC
Note that in this pattern, BLK1 and BLK2 are not separated. So, this pattern will not be covered if we stick to the same definition of BLK1 and BLK2.
In reality, we select the blocks (BLK1 and BLK2 here) so that the possibility of such patterns is less.
But eventually, we will be left with some unparsed addresses. How to handle those? We will see that shortly.
In the team time, we will see how the other limitation can be handled. Earlier, we gave a few examples of situations where a word or abbreviation can have different meanings in different addresses (or names). For example, the word MD at the end of a name string may denote a job title. It may also be an abbreviation for the word MOHAMMED. Let us see one such example in address. Suppose, we are using the mask character T to denote a street type and U to denote an unidentified word. If we try to parse addresses from a region where street name is preceded by street type, we will encounter
sub-patterns of the form UT or UUT etc. In such cases, the U (or the UU) refers to the street name and T points to the street type. But we may as well get TUT as a sub-pattern. Initial T here points to a street type which is followed by possibly a street name and then another street type appears. This looks a bit unusual. In reality the first T may point to a word that has a typographical error or it can also be a situation like ST PETER AVENUE. Note that the first T points to the word ST which is a popular abbreviation for STREET but in this case, probably, it is an abbreviation for the word SAINT. And consequently the TUT should be having the same rule as UUT. Learning from this is anything that looks like a sure variation from the general convention may not be a variation at all. So, we need to consider the relative position of a mask character in the pattern before setting a parsing rule corresponding to it. This is applicable in either of the approaches for parsing.
The last limitation that is encountered during parsing and discussed earlier is the presence of compound words in names (personal names or otherwise).
Let us discuss this (or the possible solution to this) using the same example that was given earlier.

Name	First Name	Middle Name	Last Name
ATULPRASAD SEN	ATULPRASAD		SEN
ATUL PRASAD SEN	ATUL	PRASAD	SEN

This issue of compound words surfaces during parsing but actually impacts matching or identity resolution. In this case, we will not end up matching either the first names or the middle names (even after using cross matching) with a high probability.
During matching, we can use another value viz. first name||middle name. Note that this refers to the field (derived field) for given name. The above two records will agree on this new field. Note that such solutions are purely ad-hoc and the exact nature of such solutions varies from case to case.

Data Quality

Sunday, June 12, 2011

Improvised Parsing Technique

No comments:

Post a Comment