Data Quality: Discovering the hidden dimensions

We have discussed the matching or record linking so far. I have said that if n fields from a set of records are used in matching then these records can be considered as points in an n-dimensional space.
Suppose a record has n fields to start with. Parsing is a process that splits these n fields into m fields where m > n. In other words, parsing increases the granularity in a record.
For example, a record might come in with a name field. Parsing process may generate additional fields like Title, First Name, Middle Name, Last Name Prefix, Last Name, Suffix etc.
Similarly, an address field might be split into multiple granular fields.

Let us look at the following examples (three names from a US file) first and then, we will see how parsing is done when we review records manually.

Name	Title	First Name	Middle Name	Last Name	Suffix
ROBERT CANNING		ROBERT		CANNING
MR. STUART ROGER BINNY	MR.	STUART	RODGER	BINNY
ARNOLD JONES SR.		ARNOLD		JONES	SR.

If we check the first record, we find that

1. ROBERT is a standard given name and CANNING is a standard last name.

2. General convention says, last name is written after the first name.

3. Our conclusion is ROBERT is the first name and CANNING is the last name

For the second record,

1. We immediately identify MR. as a title, STUART and RODGER as both given names while BINNY remains unidentified.

2. General convention says title precedes the first name and usually middle name is written in between the first name and last name.

3. As per the general convention, the unidentified word (or token) seems to be the last name.
Two given name words follow the title; the first one is the first name while the second one is the middle name. So the entire parsing is: MR. goes in the title field, STUART in the first name field, RODGER in the middle name field and BINNY in the last name field.

As for the third record,

1. ARNOLD is a standard given name, JONES is a standard last name and SR. is a standard name suffix.

2. General convention says, last name follows the first name which is followed by the suffix.

3. Our conclusion is ARNOLD is the first name and JONES is the last name and SR. is the suffix.

From these examples, we see that for name parsing, we use two rules.

1. Initially we identify each word or token in the name as one of the name components.

2. We also use the general convent ions of writing names.

Note both these rules are dependent on the underlying region from which names are taken.

Armed with this idea, let us see how automation can be used to do name parsing.
Let us use the following three tables for the automation.

Title

MR.

MRS.

MS.

Given Name

ROBERT

STUART

RODGER

ARNOLD

Last Name

CANNING

JONES

Suffix

SR.

JR.

For each of the records, we evaluate the names with the above four tables and in the order in which these tables appear from left to right. We designate a token that matched to the title table by T, given name table by G, last name table by L and suffix table by S. We also use the symbol U to mark any unidentified token.

Once such evaluation is done, we get the patterns as displayed in the following table:

Name	Identified Pattern
ROBERT CANNING	GL
MR. STUART ROGER BINNY	TGGU
ARNOLD JONES SR.	GLS

Once this pattern identification is done, we require rules corresponding to each pattern to tell us how the pattern is to be parsed. This approach gives us ability to parse all the names with same pattern with one rule.
To create these rules (we need three rules for now) we require a few more symbols. Let T denote a title, F denote a first name, M denote a middle name, L denote a last name and S denote a suffix.
We now, build up the following name parsing rules using the general conventions of writing names in US:

Identified Pattern	Parsing Rule
GL	FL
TGGU	TFML
GLS	FLS

Using the four tables to identify tokens (viz. title table, given name table, last name table and the suffix table) along with the table with parsing rules and the, we can automate name parsing to generate results mentioned earlier.

For our reference, we will call the tables to identify tokens as vocabulary and the symbols used to represent the tokens matched to any such table (including U) as mask characters.
It is obvious that to be able to parse more and more names we need to correctly identify more and more tokens. That is, we need to add more entries to the tables in our vocabulary. This way, we will identify more patterns and in order to process those, we need to have more entries in our rule table.

As we discussed earlier, name parsing largely depend on the customs and conventions of writing names in the underlying region or country, we will see different name components. To give an example, we will see that many names in Mexico have last name prefix field. There are countries where we have two last name fields. Sometimes, you will encounter multiple names separated by some delimiter in the name field. For example, you might get names like MR. & MRS. CLARK. One way to handle such data issues is to break the original record into two having two different names. On both the records, we will keep the remaining information same.

We can easily use similar technique to automate address parsing or parsing of any other field.

Data Quality

Thursday, June 9, 2011

Discovering the hidden dimensions - Parsing

No comments:

Post a Comment