Data Quality: Discovery Phases

Perhaps the most critical phase of any data quality implementation is “Data Discovery” where we study the sample data collected from the site with the goals:

1. Enrich metadata repository specific to the sample data

2. Profile the sample data to gain an insight with respect to the semantics of the data

3. Come up with the set of Data Quality rules for handling the sample data though the steps to be followed during the actual implementation

In the title of this post I deliberately used the term “Phases”. This indicates that there are more than one such discovery phases in practice. Besides the “Data Discovery” phase that we carry out for each implementation, we also conduct the “Market Discovery” phase when we start Data Quality related practices in a new market (i.e. country/region). “Market Discovery” is usually carried out by Data Quality product development companies while “Data Discovery” is carried out by the team responsible for data quality implementations.

I find “Market Discovery” to be very fascinating since you have almost nothing to start with. But let me talk about “Data Discovery” first as this phase is encountered frequently. We start with a set of metadata repository that we have prepared out of “Market Discovery” and enriched during previous “Data Discovery” and implementation activities.
Let me list the things that we have at the start of the “Data Discovery” phase.

1. Data Quality tool

2. Metadata Repository including:

a. Master Lookup Tables such as: Given Name, Last Name, Street Type etc.

b. Supporting Lookup Tables such as Phonetic Sounds

c. Lookup Tables for parsing

d. Basic rules for initial cleanup

e. Understanding of the address correction processes for the underlying market

3. Sample Data from the site

The process of “Data Discovery” cannot be specified and depends on the exact situation but it has to include the followings:

1. Entire sample data needs to be profiled. This will bring up many data quality issues in the sample data that needs to be handled. In case there are multiple source systems, profiling should be carried out differently for different system.

2. After the data profiling, workflows should be set up in the data quality tool and samples from all the source system needs to be processed as per the requirement. Here manual review of the intermediate results after every step in the workflow is necessary.

3. While step 2 is in progress, discussions with the business users must be carried out to finalize address correction formalities and incorporate the corresponding process in the workflow.

4. At the end of DQ processes, present the results/reports to the business users and get their feedback. Incorporate the feedback in the solution and re-generate the reports.
Remember the points:

a. This is an iterative step

b. You may have to make the business users aware of various Data Quality related concepts including the context sensitiveness of matching (Refer to my earlier post on this topic in July 2011)

c. Discuss with the client regarding the usage of external lists (such as postal tables or telephone directories etc.) in enrichment/augmentation of the address information.

At the end of “Data Discovery” you will have updated all the initial data knowledge you had earlier. But be prepared to fine tune the settings and the lookup tables during the implementation. In case, sample is not a representative one, you might have surprises. It is always a better practice to have two independent samples to start with. Use the first sample to come up with the optimum settings and apply it on the second sample and see what kind of gaps you are getting.

Now let us talk about “Market Discovery”. It is often said that the discipline data quality is a mix of art and science. The art in data quality seems to be the dominating part during “Market Discovery” phase. Goals for “Market Discovery” are basically to identify the conventions and nuances in names (including SME and Corporate names) and addresses besides building up the vocabulary and the associated rules. Let me briefly discuss the issue with respect to names:

1. Find out what are the possible components in name. Typical components could be First Name, Middle Name, Last Name, Prefixes and Suffixes. But depending on the traditions and conventions of the market, you may have to include other fields like a second Last Name field and/or a Last Name Prefix and/or a Job Title field etc.

2. For each of these fields, you need to find the vocabulary which will serve as the initial set of Lookup Tables.

3. Next step will be to figure out the standard naming conventions. Usually, names are written like Title/Salutation + First Name + Middle Name(s) + Last Name + Suffix. But such conventions may vary depending upon the conventions in the underlying country. For example, people usually write Last Name before First Name in Japan. You may have some sample data to carry out the research. It is better to take help of a local expert to understand the nuances. Such research may include consulting books and other publications.

Before carrying out this research, you may have to ensure the capability of handling DBCS or MBCS in the data quality tool (if applicable).

In case you will be using distance function based comparison for record linkage, where the relative weight of a character-mismatch depends on the position of the character in a string, we need to know the writing convention (left to right or otherwise) in the region.

Address validation/augmentation is another important thing to consider. We need to figure out various possible ways of performing this. Kind of postal tables that are available for the country, if there is any connection between telephone numbering system and state (or city etc.), if address correction tables are available etc. must be looked into and documented.

Another important activity to be carried out in this phase is to find the scope of standardization. This is the phase where the fields which need to be standardized must be identified and associated list of vocabulary should be built. A related concept is the use of nicknames and aliases.

“Phonetic Variation” depends on the culture and history of the underlying market and must be looked into during this phase. If the native language of the market is not the official language for communication then issues related to “Phonetic Variation” will be rampant. It is important not just to capture a few such examples but to understand if there is a pattern of such variations.

Data Quality

Sunday, September 18, 2011

Discovery Phases

No comments:

Post a Comment