Data Quality: Limitations in the Parsing approach (updated on 11th. June 2011)

In the earlier post, we saw how parsing can be automated (with the help of name parsing). Why did we use name parsing as example? Well… I used it because a name string has less number of possible components in an address string (usually).
It can be appreciated that the total number of possible combination increases with the number of possible components.
Let us have an idea about the magnitude of such possible combination.
Suppose the original string has n number of tokens and there are m possible components.
We also assume that a valid string will have minimum p number of components.
Case1: Each component appears only once in the string (and m > n).
Here, the total number of possible patterns is: m!/n!
Case2: Each component can be repeated.
Here, the total number of possible patterns is: mⁿ

So, we see that in either case, for fixed n, total number of possible patterns increases with m
Well… in reality, none of the above cases is applicable in full. There are complex relationships among the components though such relationships are kind of probabilistic in nature. Such as: in US a street type usually appears after a street name.

So, we see the first limitation of this kind of automation. It is that the number of possible patterns is potentially high. In fact, in developing countries i.e. where the addressing conventions are not standardized, the number of address components (and thus patterns) is very high. To give an idea, even 50K different patterns were not sufficient to process address data from India.
High number of patterns impacts an automated solution in two ways:

1. It takes more effort to develop the vocabulary and the table containing the parsing rules.

2. High volume of vocabulary and high number of entries in the table containing the parsing rules makes the processing window time (for the automated solution) high.

Apart from the above, there is one more limitation in the parsing approach described earlier.
Let us consider the initial A in an address string. Usually, it is an abbreviation for the word APARTMENT and is used as an apartment identifier. But many times it also appears to denote something else. The initial A might be a part of the house number (or street number), bock or even a street name.
Similarly, the token KUMAR may be a person’s first name or middle name or last name.
The word MD at the end of a name string may denote a job title. It may also be an abbreviation for the word MOHAMMED.

One more issue that is seen during parsing is (actually an issue in matching) the use of compound words in names (personal names as well as names of street, city, locality, company etc.). Let us give one name example:

Name	First Name	Middle Name	Last Name
ATULPRASAD SEN	ATULPRASAD		SEN
ATUL PRASAD SEN	ATUL	PRASAD	SEN

These names are taken from Indian data. Clearly, these Names are equal but a space in between ATUL and PRASAD in the given name is creating a big issue here. Even if we have a soft-key based matching we will end up getting much less than 100% comparison probability for first name comparison between these two records and much less than 100% comparison probability for first name cross middle name comparison between these two records.
But one look at the given names in these records tells us that these should match with a very high probability (close to 100%, if not 100%).

In the next post, we will see how the parsing approach can be improvised to address these limitations.

Data Quality

Friday, June 10, 2011

Limitations in the Parsing approach (updated on 11th. June 2011)

No comments:

Post a Comment