Thursday, August 18, 2011

House-holding dilemma with Indian Data

House-holding or finding the records under the same house-hold is a typical data quality activity as far as linking individual records goes. According to Wikipedia, a house-hold is defined as “the basic residential unit in which economic production, consumption, inheritance, child rearing, and shelter are organized and carried out”. Typically, it refers to a family unit that stays in the same dwelling unit.

Household matches are found out using these properties:
1.       Last Name i.e. Family Name should be the same and
2.       Address (residential) on the records should be same
Let us look at the first point that is last name (or family name) matching. This is done under the assumption that the family members share the same family name. But this often fails in Indian context such as:
1.       Muslim families (well…most of them) do not have a family name concept.
2.      Traditionally family name concept was not present in South India.  Parents in south Indian families bestowed a single name to their child at birth and appended it with many initials. The abbreviations could stand for the ancestral village and the father’s first name in Karnataka, the house name in Kerala, for the caste name in Tamil Nadu and in Andhra Pradesh, the place of family origin.
I encountered this issue while performing name parsing for south Indian names. However, if we use a name component called last name instead of the family name (or surname) and use this component for individual matching then the complexity reduces a little when cross-matching is also used covering the name components. But for house-holding, this possesses a tough challenge.
Let us now look at the issues in address matching. We need to look at this keeping in mind the issues we saw in last name matching. The biggest issue in address matching is incomplete or partial addresses.
Let us look at the following addresses:

Potentially Matching Address

Addresses on both the rows are close. But a detailed inspection reveals that the second address on these rows do not have the dwelling number. In fact, if these addresses appear on two records where names are matching then we would accept these as matches. But when there is no family name on the records then?
It’s a big question mark. Take for example the second address on the second row. It is a close match for the address
Though residential telephone numbers are of much help, presence of such incomplete addresses possesses big challenges in house-holding. According to Graham Rhind (an expert in handling international data), house-holding should be avoided as far as possible (except some traditional anglo-saxon communities) because it hardly ever works.

Note: Discussion only includes individual house-holds and not corporate house-holds

Saturday, August 6, 2011

Requirements for a Data Quality solution (updated on 6th August)

I have often come across questions like what could be the necessary properties for software that links records from my colleagues and clients and other associates. It largely depends on the objective of the record linking (please read my earlier post on Context Sensitiveness). However, I am giving a few points that can be considered as the necessary properties of such software.
For the ease of my convenience, I am dividing the properties into two exclusive sets of “Consolidation” and “Matching”. “Consolidation” is the data preparation steps that are carried out before any operation for matching begins. Steps followed in a typical “Consolidation” are:
Basic Cleanup
Here we do initial cleanup of the records like replacement or removal of special characters, replacing multiple consecutive spaces by a single whitespace.
·         Software should allow the users to write grammar rules for carrying out such cleanup. Grammar rules can be defined for any specific field.
·         Users should be able to write grammar rules applicable to the beginning/end of a field. Good matching software should be flexible enough to accept regular expressions.
Adjusting the misfielded information
Often values are put different fields such as Job Title values may be present in the name field even if there is a separate field in the system for Job Title. This routine should be able to identify such occurrences and rectify those.
·         Setting must accept at least two fields- one for the search and the other for the destination
·         Users will be able to decide in case the destination field is already populated
·         Solution can handle different requirements like pickup and adjust the search keyword or the portion of the field from the start till the keyword or the keyword till the end.
·         Users should be able to add/edit/delete different keywords.
·         Settings should be able to accept different sets of keywords for different search.
Misfielded data can be handled using parsing techniques and that is why a separate routine for handling these must be faster enough.
This routine can classify records into various types based on the presence of some keywords. An example of this would be to classify and flag individual customer records (B2C records) and Corporate/SME records (B2B records).
·         Users must be able to add/edit/delete keywords for each classification/identification.
This routine is closely related to identification. Often different type of records need to be treated differently down the line and hence need to be put in separate buckets. Branching does exactly this.

Parsing is the process of splitting the words in a field like Name or Address into multiple component fields and is discussed in detail in my earlier posts in June 2011.
·         Users must be able to add/edit/delete keywords in the lookup tables for parsing
·         Users must be able to add new mask characters, lookup tables
·         Users must be able to add/edit/delete parsing rules
·         A good software must be able to handle partial parsing (see my earlier post on the topic)
·         Users must be able to create new parsing routines for any field
·         Software must be able to generate reports of parsing so that the quality can be assessed by the users and settings may be fine-tuned.
The process of determining the gender of individual records based on various name components like Title or Given Name etc. is called Genderization.
·         Users should be able to prioritize the name components to be evaluated for Genderization
·         Users should be able to add/edit/delete gender codes corresponding various name words
·         Software must provide Genderization report that includes the percentages of records for which gender code could be assigned and the percentages of records for which gender codes could not be assigned, a sample of names where gender codes could be assigned and a sample of names where gender code could not be assigned
It is the process of enriching the data. One of the major data quality concern is the missing values for important fields. Augmentation or Enrichment is the process by which a data quality solution can take care of a portion of this. Besides filling up the missing information, this routine can also be used for validating the existing information e.g. a record might show the state name as NJ while the name of the city may be Dallas.
This routine can be divided into two sub-routines viz. internal augmentation and external augmentation.
1.       Internal Augmentation: It is the process of enriching information using the data values itself. For example, on a particular record, the name of the city may reside in the address lines instead of the city field.
2.       External Augmentation: It is the process of enriching/validating information using external data such as postal information, telephone directory etc.

·         Users must be able to select the validation/enrichment options from a list
·         Software will  flag the records where validation fails for a particular setting
·         A report must be generated showing the percentages of enriched records, percentages of records that could not be enriched, sample of enriched records etc.

It is the process of transforming similar data values into a unique format. For example, different spellings and abbreviations for city names, state names etc. are made into a standard format.
·         Users must be able to select the fields which are going to be standardized
·         For every field to be standardized, users must be able to specify lookup tables that contain the variations and the standard format.
·         Each lookup table containing possible variations and the standard format need to be editable.
Rejection Routines
This routine can reject/flag records based on user defined rules like use of profanity, having information on which validation failed etc.
·         Users must be able to define rejection rules involving one or more fields and one or more lookup tables
·         All the lookup tables must be editable by the users
·         Users must be able to specify if rejection or flagging will be done
·         Software must be able to generate reports containing sample records (that got rejected or flagged) for each setting.
Once the “Consolidation” steps are executed, steps in the “Matching” process are followed for the surviving records. Steps followed in the “matching” routine are:
Defining Match Groups/Hierarchy
Users define the match hierarchy here. For example, the matching software can find the address matches and then for the records with matching addresses, it can probe further and can find out house-holds etc. So, defining more than one match groups requires a relation among these match groups. Some of the match groups may be unrelated while some others may be related
Two-step matching
A good matching software is able to perform matching in two steps. First step is called primary matching where hard keys defined for each record are compared to arrive at a match. Readers can refer to my earlier post on key-based matching in May 2011.
1.       Primary Matching
·         Users must be able to define one or more match keys
·         Users must be able to specify, for each key, what would be done in case a key element is missing for a record.
·         Users must be able to specify the string modes in which a field (or a portion of it) should be included in a match key. Two strings can be compared in various string modes such as
o   Consonated Mode: In this case, all the vowels are dropped from the string
o   Vowelized Mode: In this case, all the consonants are dropped from the string
o   Numeric Mode: In this case, all the non-numeric characters are dropped from the string
o   Phonetized Mode: In this case, the entire string is phonetically transformed (for details, please read my earlier post in June 2011)
o   Alpha Mode: In this case, all the numeric and special characters are dropped from the string
o   Alpha-Numeric Mode: In this case, all the special characters are dropped from the string
·         Users must be able to review the primary match results by generating sample reports

1.       Fuzzy Matching
Fuzzy Matching works on the probable matches discovered after primary matching. In this matching probable match pairs are classified into three subsets of definite matches, definite non-matches and suspected matches. This type of matching is also called probabilistic matching and can be implemented in many ways.

·         Fuzzy matching should be an optional step. Users may define the primary keys in a satisfactory way and may decide to consider the output of primary matching as final set of matches.
·         This routine must allow the users to specify the action(s) in case blank values are required to be compared to another blank or non-blank value. This setting may be different in case of different fields.
·         Fuzzy matching must allow the users to specify the string modes in which fields will be compared
·         This routine must allow users to specify the possible cross-matching options (for details please my earlier post in may 2011)
·         Users must be able to decide, for each fuzzy matching rule, the associated fields to be compared
·         At the end of fuzzy matching, users should be able to generate sample reports containing the matched and/or un-matched records for each group in the hierarchy.
Manual Decision/Review
Manual review is an important part of a data quality tool. System must provide the appropriate interface to the users so that each of the matches can be reviewed and the following decisions can be made:
1.       Mark as match
2.       Mark as un-match (This option will make sure that the concerned records are never matched in future, if that is the case)
3.       Hold a case for further verification and review by higher authority

Consolidation of indirect matches
After the final matching (Fuzzy matching in case of a 2-step matching or the primary matching in case of a 1-step matching), the matched records need to be consolidated. This routine, besides doing that, must be able to assign unique cluster number to every record such that matching records in a group (or cluster) get the same number. For more detail, please read my earlier post “Indirect Matching” in June 2011.
·         At the end of this consolidation, users must be able to generate sample match/un-match reports corresponding to each match group to review the results.

RaRationalization or selecting the survivo
This routine works after the consolidation of final matches is done and entire set of records are put into several clusters where each cluster contains matching records. Obviously, each un-matched record forms a cluster of size one.
Often business requires a single record representing a cluster of matching records. This was discussed in detail in my earlier post “Constructing the Survivor Record” posted in June 2011.
·         Users must be able to write code snippets to build the routine in case, it is complicated. System must provide the users with the code editor and available functions and fields and the logical operators.
Data Profiling is an important function of a good data quality software. Requirements around this would be:
Column analysis
This routine, given a table will generate the following reports:
·         Percentages of NULL or blank
·         Percentages of initials
·         Percentages of numeric values
·         Percentages of alpha values
·         Percentages of unique values
Frequency analysis
This routine, given a table will generate the following reports:
·         Frequency Report
·         Pattern frequency Report
Table analysis
Given two tables, this routine should be able to identify:
·         Common unique key between the tables
·         Orphan records i.e. records present in child table but not in master
Generate alerts
This routine will allow the users to define business rules and will generate compliance reports. Optionally this routine can generate failure alerts and send e-mails.

There are, of course, detail level requirements around each topic mentioned above.

We have discussed so far the technical requirements of the good matching software. But there are a few more requirements for this software depending upon the context. These are:

Input and Output
System should be able to read data from a number of sources including
1.       Text files
2.       Delimited files
3.       Excel files
4.       RDBMS files
Similarly, system should be able to output data in several formats.

System must have a good GUI. The steps we discussed earlier can be arranged in any order depending upon the requirement. Good to have a GUI that is drag-drop facility.

I have mentioned requirements corresponding to a number of subject areas for the entire software. Users should be able to create workflows incorporating one or more source files and the required processes/functions and settings. These processes/settings should be flexible enough so that different workflows may use these in different order or even may not use some of these as per the context.

Such a matching tool can be used stand alone or another application may use the services.

Functional requirements for matching software require a lot of string manipulation. But the software must be quick enough to process large volume of data files.