Data Quality: Partial Parsing

Again, we are concentrating on Address Parsing and esp. on the addresses that failed the earlier parsing techniques. Note that handling name parsing through this method will be disastrous and the solution must have adequate number of name parsing rules.
But the possible number of variations in address strings is huge and it is practically impossible to have address-parsing rules covering all possible patterns.

Let us look at an address: SHIKHA TERRACE UNO 22A 27 MAGRI C OPP GAS DEPOT OFF M G RD

Suppose we use the following lookup tables:

House/Building: Mask Character H

TERRACE

BLDG

DEPOT

Landmark: Mask Character C

OPP

NEAR

OFF

Street Type: Mask Character T

Apartment: Mask Character A

APT

UNIT

Suppose, before parsing, we apply the following cleansing rules on the above address:

1. UNO is replaced by UNIT

2. Consecutive numeric characters and initials are separated.

So, the above address looks like: SHIKHA TERRACE UNIT 22 A 27 MAGRI C OPP GAS DEPOT M G RD
So, the pattern for this address will be: UHANINUICUHIIT (U denotes an unidentified component and I denotes an initial). Suppose we do not have any parsing rule defined for this pattern. So this address will remain unparsed.
But one look at the identified pattern and we will be able to identify (and extract) meaningful information out of it.

1. The initial UH represents the building

2. The sub-pattern AN that comes immediately after UH can be treated as the apartment information.

3. The sub-pattern CUH indicates a landmark

4. The sub-pattern IIT represents the street name and type and should be parsed as SST

Partial parsing is the identification of known sub-patterns from within the identified pattern and subsequent extraction of the relevant components from the original string.

There are a few points to note for partial parsing.

Several known sub-patterns may be identified from within an identified pattern.
For a given identified pattern ABCDEF, suppose CD is a known sub-pattern. After chopping that off from the identified pattern, the remaining pattern will be ABEF. Even if BEF is another known sub-pattern, it must not be considered as the original pattern did not contain BEF as a sub-string. When CD is removed, we are left with two sub-strings i.e. AB and EF and our search needs to continue within these sub-patterns.
Let us consider the same identified pattern ABCDEF. Suppose our partial parsing rules contain two known sub-patterns viz. CD and BCD. Well, if CD comes up first in the search then we will be chopping of CD from the identified pattern and eventually, the sub-pattern BCD, in-spite of being present in our solution, will not work. One way to handle this is to sort the partial parsing rules according to the lengths of the sub-patterns in the decreasing way.

Data Quality

Monday, June 20, 2011

Partial Parsing

No comments:

Post a Comment