Data Quality

Tuesday, June 18, 2013

Demographic to Psychographic – Paradigm shift in Data Quality

The relation between Data Quality and Direct Marketing is not just traditional but insurmountable too. With the flood of personal and behavioral information about the consumers, Direct Marketing has changed its focus from mass marketing to target marketing. This write-up explores the reflection of this change in Data Quality practice.

Direct Marketing in bulk, often termed as database marketing, has a Data Quality solution at its centre. This solution scrubs, augments and then de-duplicates a bunch of name and address records. This is done to minimize sending the same offer to two different consumer profiles pointing to the same real life individual. However, demographic information often finds bigger clusters of consumers such as “All male consumers from Kolkata in the age group 18-25”. It is almost impossible to send targeted promotional offers to such a cluster expecting a high taker rate.

In order to define clusters of consumers likely to purchase certain type of products, it is necessary that the clustering technique includes variables based on past purchase history and other related psychographics besides traditional demographic parameters.

Where does big data fit in all these? Big data, besides being big in volume, can be broadly categorized into two groups viz. data collected from various social networking channels such as facebook, twitter or linked-in and the data collected from various devices such as card readers, GPS tools etc. However, there is at least one common characteristic in both these groups. And this is, they contain huge amount of personal and often psychographic information that can be extracted, parsed and mined.

Direct Marketing today, is in a position to make use of this vast collection of personal and psychographic information in clustering consumers into effective and smaller target groups.

Let us have a closer look into the nature of the psychographic parameters used in this kind of clustering.

1. Social Class
It is the single most used variable for research purposes, and divides the population into groups based on the occupation of the 'Chief Income Earner' (CIE), as such it can be seen as a socio-economic scale.

2. Lifestyle
This involves classifying people according to their values, beliefs, opinions, and interests. There is no one standardized lifestyle segmentation model, instead market research firms, and advertising agencies are constantly devising new categories, which will help target possible consumers of their clients products.

3. Behavioral
This kind of parameters divide the market into groups based on their knowledge, attitudes, uses and responses to the products they use.

Keeping this trend in mind, we can realize that the shift of focus in Direct Marketing must have a significant impact on Data Quality practice itself. Indeed, Data Quality solutions now consider this external third party data for augmentation and then for de-duplication. Complexity in Data Quality with this additional information has become manifold making the journey challenging and all the more exciting.

Monday, October 31, 2011

Genderization Issues in India

Deriving gender from name is important for two reasons.

1. For database marketing, using gender information in addressing the offer letter is crucial.
For example, we can address “John Smith” as “Dear Mr. Smith” and “Peggy Smith” as “Dear MS Smith”

2. Gender code can improve matching by restricting false positives.

Usually Genderization is done using the name components. We will discuss this process in brief for Anglo-Saxon names before jumping into various issues of Genderization in Indian context.

Typical name components are: Salutation/Title, First Name, Middle Name, Last Name and Name Suffix.
Among this, a salutation or title can determine gender uniquely. As for example, values like ‘Mr.’, ‘Mrs.’ can be very helpful for gender determination. But there could be values like ‘Prof.’, ‘Dr.’, which do not give the gender information or the value in this field could be blank. In such cases, we check the first name. Typically a first name like ‘Robert’ corresponds to a male. Sometimes a first name cannot determine the gender uniquely. Then we check the middle name if that can uniquely determine gender. Usually the last name component is not used to determine gender. But name suffixes are surely helpful. Suffixes like ‘Sr.’, ‘Jr.’ point to the male gender.

Using the above logic, in most of the cases, we use the following:

1. Determine gender from title (or salutation), if possible.

2. If gender code is blank, check the suffix and assign a gender code, if possible

3. If gender code is blank then check the first name if gender code can be derived

4. If gender code is still blank then check the middle name if gender code can be derived

5. If gender code is still blank, set it to ‘U’

Above is the outline of the Genderization process for a typical Anglo-Saxon name. Now we will see how the above logic can be modified for tackling Indian names.

We will see the challenges in Indian naming system first so that deriving the gender code becomes less complicated.

1. Middle Names should not be evaluated for genderization except for the rules 6 and 7 below.
This is for the fact that people in various parts of the country mention their father’s (husband’s, in the case of a married woman) first name as the middle name.
Therefore for a name like ARUNA PRASHANT IYER, PRASHANT could be her (ARUNA is a female name) father or husband.

2. Sometimes, first names (remember, we will derive the first name after parsing) lead to the wrong gender code. In such cases, first name should be clubbed to the middle name (or the initial part of the middle name) to derive the gender code. Let us check an example of this. Consider the name DEBIKA RANJAN SEN. Our parsing rule will classify DEBIKA as the first name, RANJAN as the middle name and SEN as the last name. Note that in Indian language, the name is DEBIKARANJAN which points to the gender code ‘M’. But, DEBIKA is a female name. So the gender code from the first name will be ‘F’… (incorrect).

3. Last names might come handy in a few cases. This is unlike Anglo-Saxon names, last names like BIBI, BEGUM, DEBI, KAUR, KHATUN, SULTANA etc. indicates a female name.

4. Name Suffix is rarely used in India.

5. Presence of words like MOHD. (or any variation of this), KAZI, HAJI, SAYED etc. anywhere in the name indicates a male name.

6. If first name ends with (or if the first word in the middle name) is BHAI, it is a male name.
Consider the name DADANBHAI KADVE. Here the first name ends with BHAI. So it is likely to be a male name. This name could also be written as DADAN BHAI KADVE. In this case, entire middle name is BHAI. So the gender code derived from the middle name is ‘M’. Another name could be DADAN BHAI NIRMAL BHAI KADVE. Our parsing rule will store DADAN as the first name, BHAI NIRMAL BHAI as the middle name and KADVE as the last name.

7. If first name ends with (or if the first word in the middle name) BEN then it is a female name.
Look at the name SMITABEN V SOLANKI. In this case, the first name ends with BEN and consequently it is a female name.

8. There are some Indian names (first names) that can be used by a male as well as female. Examples of these names would be KAMAL, SUMAN etc.

Sunday, September 18, 2011

Discovery Phases

Perhaps the most critical phase of any data quality implementation is “Data Discovery” where we study the sample data collected from the site with the goals:

1. Enrich metadata repository specific to the sample data

2. Profile the sample data to gain an insight with respect to the semantics of the data

3. Come up with the set of Data Quality rules for handling the sample data though the steps to be followed during the actual implementation

In the title of this post I deliberately used the term “Phases”. This indicates that there are more than one such discovery phases in practice. Besides the “Data Discovery” phase that we carry out for each implementation, we also conduct the “Market Discovery” phase when we start Data Quality related practices in a new market (i.e. country/region). “Market Discovery” is usually carried out by Data Quality product development companies while “Data Discovery” is carried out by the team responsible for data quality implementations.

I find “Market Discovery” to be very fascinating since you have almost nothing to start with. But let me talk about “Data Discovery” first as this phase is encountered frequently. We start with a set of metadata repository that we have prepared out of “Market Discovery” and enriched during previous “Data Discovery” and implementation activities.
Let me list the things that we have at the start of the “Data Discovery” phase.

1. Data Quality tool

2. Metadata Repository including:

a. Master Lookup Tables such as: Given Name, Last Name, Street Type etc.

b. Supporting Lookup Tables such as Phonetic Sounds

c. Lookup Tables for parsing

d. Basic rules for initial cleanup

e. Understanding of the address correction processes for the underlying market

3. Sample Data from the site

The process of “Data Discovery” cannot be specified and depends on the exact situation but it has to include the followings:

1. Entire sample data needs to be profiled. This will bring up many data quality issues in the sample data that needs to be handled. In case there are multiple source systems, profiling should be carried out differently for different system.

2. After the data profiling, workflows should be set up in the data quality tool and samples from all the source system needs to be processed as per the requirement. Here manual review of the intermediate results after every step in the workflow is necessary.

3. While step 2 is in progress, discussions with the business users must be carried out to finalize address correction formalities and incorporate the corresponding process in the workflow.

4. At the end of DQ processes, present the results/reports to the business users and get their feedback. Incorporate the feedback in the solution and re-generate the reports.
Remember the points:

a. This is an iterative step

b. You may have to make the business users aware of various Data Quality related concepts including the context sensitiveness of matching (Refer to my earlier post on this topic in July 2011)

c. Discuss with the client regarding the usage of external lists (such as postal tables or telephone directories etc.) in enrichment/augmentation of the address information.

At the end of “Data Discovery” you will have updated all the initial data knowledge you had earlier. But be prepared to fine tune the settings and the lookup tables during the implementation. In case, sample is not a representative one, you might have surprises. It is always a better practice to have two independent samples to start with. Use the first sample to come up with the optimum settings and apply it on the second sample and see what kind of gaps you are getting.

Now let us talk about “Market Discovery”. It is often said that the discipline data quality is a mix of art and science. The art in data quality seems to be the dominating part during “Market Discovery” phase. Goals for “Market Discovery” are basically to identify the conventions and nuances in names (including SME and Corporate names) and addresses besides building up the vocabulary and the associated rules. Let me briefly discuss the issue with respect to names:

1. Find out what are the possible components in name. Typical components could be First Name, Middle Name, Last Name, Prefixes and Suffixes. But depending on the traditions and conventions of the market, you may have to include other fields like a second Last Name field and/or a Last Name Prefix and/or a Job Title field etc.

2. For each of these fields, you need to find the vocabulary which will serve as the initial set of Lookup Tables.

3. Next step will be to figure out the standard naming conventions. Usually, names are written like Title/Salutation + First Name + Middle Name(s) + Last Name + Suffix. But such conventions may vary depending upon the conventions in the underlying country. For example, people usually write Last Name before First Name in Japan. You may have some sample data to carry out the research. It is better to take help of a local expert to understand the nuances. Such research may include consulting books and other publications.

Before carrying out this research, you may have to ensure the capability of handling DBCS or MBCS in the data quality tool (if applicable).

In case you will be using distance function based comparison for record linkage, where the relative weight of a character-mismatch depends on the position of the character in a string, we need to know the writing convention (left to right or otherwise) in the region.

Address validation/augmentation is another important thing to consider. We need to figure out various possible ways of performing this. Kind of postal tables that are available for the country, if there is any connection between telephone numbering system and state (or city etc.), if address correction tables are available etc. must be looked into and documented.

Another important activity to be carried out in this phase is to find the scope of standardization. This is the phase where the fields which need to be standardized must be identified and associated list of vocabulary should be built. A related concept is the use of nicknames and aliases.

“Phonetic Variation” depends on the culture and history of the underlying market and must be looked into during this phase. If the native language of the market is not the official language for communication then issues related to “Phonetic Variation” will be rampant. It is important not just to capture a few such examples but to understand if there is a pattern of such variations.

Wednesday, September 7, 2011

Addresses in India

Addresses in India are so confusing and follow many patterns from region to region that it makes address matching/parsing/enrichment very challenging. It is really difficult to write an article on this. Someone who has done years of research into this will require an entire volume to come up with the results.
However, many issues that I have faced during various data quality centric implementations prompt me to write something on it.

In this post, I am going to discuss about a few complicated patterns and some of their numerous exceptions.

According to the standard addressing convention, a typical street address has three components where a street number is followed by a street name which is then followed by a street type (there are places where street type appears before the street name). I will begin my discussions with this convention. Yes in many cities in India this is a standard addressing convention.

Ariff Road

A close friend of mine lives in the address “27/2 Ariff Road” which is not very far from my home. When I visited her last, out of my curiosity, I took a tour of the entire Ariff Road and looked at the addresses written on the surrounding houses and shops.

By the way, Ariff Road is a relatively narrow lane in North Kolkata. There are a few lanes and by-lanes that originated from Ariff Road. Interestingly most of these are called Ariff Road too. At least, the address on these houses in such lanes and by-lanes bear Ariff Road name. I was rather surprised to see the numbers on these addresses. These were not just unordered but totally chaotic. The house opposite to “27/2 Ariff Road” was “1H/2A Ariff Road”. I came across another house with the address “12/7/A/1 Ariff Road”

We find such addresses in many areas in North Kolkata. We will have to refer to history of this city to find how such addresses came into existence. It is not a planned city and was formed by the British in the late 17^th. Centaury after the agent J. Charnok purchased three villages from local landlord Sabarna Chowdhury. Slowly but steadily this city as desired by the British, started to grow without any master plan.
So when the postal system was put in place, the numbers were assigned in some order. But subsequently new houses were built and house-holds/families got split requiring separate addresses. Second observation is crucial here. Like in many places, in this part of the country too, the system of joint family (extended family) was prevailing. Obviously they required large houses. But with the passage of time, this system changed and some of the family member moved out while some others continued to live under the same roof but built separate dwelling units. Some of them rented out a portion of their premises. A significant number of these old houses are now sold to the real estate developers and promoter who are building multi-storied apartments. And the entire system is becoming complicated.

Main Road & Cross Road

There are a few places (Bangalore or Bangaluru is one of those) where the entire area is divided by main roads running in one direction and cross roads running perpendicular to it. An address in such a place is described by the nearest main and cross road information besides the house name/number.

Plot/Block/Sector

Marking an address by these (or some of these) is found in the planned cities in the country. Addresses in Chandigarh (capital of two neighboring states as well as a union territory itself) contain sector information.
In Salt lake area (a suburb of Kolkata), the entire region is divided into sectors. There are blocks in each sector and plots in each block. So a typical address here looks like:
“Plot Y 14, Block – EP, Sector 5 Salt Lake”

Laxminarayan Jewelers - different entities with similar names/addresses

Few months back, I took a tour of the city Kolkata. My intention was to observe the addresses on the houses I come across. In one area, I saw a number of shops with the same name
“Laxminarayan Jewelers”. Sometimes I noticed a little variation “Laxminarayan & Sons”. These shops were located in the basement of a huge building.
It took a few weeks for me to find out the history behind this. Someone called “Laxminarayan” established a shop many years ago. But his sons got separated and started their own business under the same roof but as different entities. They all bore matching names and addresses (addresses differed by a number like UNO 1 55 XYZ Road, UNO 2 55 XYZ Road etc.)

“Diagonally Opposite to” -a land of landmarks

Last month I took a new telephone connection. In the process, one executive from this telecom company called me to verify/cross-check the address that I provided in the application. She repeatedly asked for a landmark near my house.
While profiling addresses in India, rampant usage of landmark information is noticed. Some of the identifiers for landmark are “Near”, “Opposite to”, “Behind”, “Beside”, “Next to” and not to forget “Diagonally Opposite to”

Adjacent houses or apartments

Look at this address - Office Space 2 & 5, Paramount Complex, Navelim, Goa – 403707. It is a commercial address that points to a shop. This shop, however, is spread across two shopping units in the same floor of a shopping complex.

Also, one my friend has this address: Apt 5C & 5D, 12 Mandevilla Gardens, Kolkata – 19

This possess a serious challenge for address parsing as we need to have multiple fields for containing similar information like two fields for apartment number, two fields for street number etc.

Personal Names in addresses

Many Indian addresses and esp. the ones from rural areas begin with a personal name. Usually the name of the head of the family is mentioned in the addresses. Many times, post men, in these areas, know people by the name and in-case the letter is addressed to someone else in the family who is not known to the post man, the letter gets delayed. This is the primary reason for using the name of the head of the family in the address. Most popular keyword to identify such names is C/O or “care of”.
There are other variations of C/O where the exact relation is mentioned like “or S/O or son of”, “D/O or daughter of”, “M/O or mother of” etc.

A typical address in this format looks like:

“C/O Ashim Biswas, 33 Govinda Naskar Lane, Sriharipara”

Addresses in Goa

Goa is a famous tourist spot in India. There is another equally interesting fact surrounding this place. India was dominated by the British for over two hundred years. Goa, on the other hand, was dominated by the Portuguese for over four hundred years. India got its independence in 1947 but the operation “Vijay” was carried out by Indian army in 1961 to liberate Goa.
Names including individual names as well as name of places/buildings/roads in this place sometimes follow the Portuguese style.
Here one can find a street named “18^th. June Road”

Now another controversy and debate is going on regarding renaming these streets and buildings!

Roman digits

Usage of roman digits is abundant in Indian addresses. Consider the address:
“C-1 295/296 Rohini Sector-11 Near Bay Japanese Park Back Of welcome Hotel”. In addresses like this, Sector-11 sometimes written as Sector – XI (or Sec XI). Usually sector numbers in Indian addresses at times, are written using roman digits.

New and Old

Yesterday evening, I was walking down a street named “Camac Street”. When I was looking at a new sign board which displaying another name for this famous street, one my friend happened to call me and asked me where I was. I said “Abanindranath Thakur Sarani” and he expressed his concerns that I was in a weird place. I had to tell him that the new name for the “Camac Street”, according to the Kolkata Municipal Corporation, was “Abanindranath Thakur Sarani” to settle things!
Places in India are slowly coming out of its colonial structure and conventions and as a part of this, renaming things is a commonplace now. For this reason, cities like “Bombay” has renamed as “Mumbai” or “Madras” has become “Chennai” and the list continues. Well... “Kolkata” is no different here. Few years back, this city was known as “Calcutta”. It seems that my state “West Bengal” will soon become “Pashchimbanga”.

Also, the postal department has introduced new postal code (or PIN code) numbers recently. PIN code of the place where I live is 700136. Some of the courier companies refused to deliver packages to my residence initially as they were using the old number 700059

How about this address? “Old No 160, New No 111,6th Street Extension, 100 Feet Road, Gandhipuram”

Multiple Locations

Many addresses in this country contains one or two locations. An example will be:
“36, Arakashan Road, Ram Nagar, Paharganj, Diagonally Opposite to New Delhi Railway Station”
This address points to a commercial place in “Paharganj” area of “New Delhi” (capital of India). It also says that the place is in “Ram Nagar” area which is located in “Paharganj”. Usually the first location encountered when the address is read (from left to right) is located within the second location mentioned in the address.

But one will come across plenty of addresses involving more than two locations.
An example will be:
“8A/48, W.E.A. Channa Market, Karol Bagh, Behind Pusa Road”

Incomplete or partial address

Look at the business address: “C/O Star Investments, OPP. Head Post Office Panjim, Tiswadi – 403001”
This address does not contain a street name and number and yet a valid address i.e. addressee can be reached using the postal service.
Most of addresses such as the above can be written in multiple ways which are apparently not similar.

Thursday, August 18, 2011

House-holding dilemma with Indian Data

House-holding or finding the records under the same house-hold is a typical data quality activity as far as linking individual records goes. According to Wikipedia, a house-hold is defined as “the basic residential unit in which economic production, consumption, inheritance, child rearing, and shelter are organized and carried out”. Typically, it refers to a family unit that stays in the same dwelling unit.

Household matches are found out using these properties:

1. Last Name i.e. Family Name should be the same and

2. Address (residential) on the records should be same

Let us look at the first point that is last name (or family name) matching. This is done under the assumption that the family members share the same family name. But this often fails in Indian context such as:

1. Muslim families (well…most of them) do not have a family name concept.

2. Traditionally family name concept was not present in South India. Parents in south Indian families bestowed a single name to their child at birth and appended it with many initials. The abbreviations could stand for the ancestral village and the father’s first name in Karnataka, the house name in Kerala, for the caste name in Tamil Nadu and in Andhra Pradesh, the place of family origin.

I encountered this issue while performing name parsing for south Indian names. However, if we use a name component called last name instead of the family name (or surname) and use this component for individual matching then the complexity reduces a little when cross-matching is also used covering the name components. But for house-holding, this possesses a tough challenge.

Let us now look at the issues in address matching. We need to look at this keeping in mind the issues we saw in last name matching. The biggest issue in address matching is incomplete or partial addresses.

Let us look at the following addresses:

Address	Potentially Matching Address
Y 14, BLOCK EP, SECTOR V, SALT LAKE, KOLKATA, 700091	BLOCK EP, SECTOR V, SALT LAKE, KOLKATA, 700091
16A GARIAHAT ROAD, APT 1C, KOLKATA-19	16A GARIAHAT ROAD, KOLKATA 700019

Addresses on both the rows are close. But a detailed inspection reveals that the second address on these rows do not have the dwelling number. In fact, if these addresses appear on two records where names are matching then we would accept these as matches. But when there is no family name on the records then?
It’s a big question mark. Take for example the second address on the second row. It is a close match for the address 16A GARIAHAT ROAD, APT 2B, KOLKATA 700019 too.
Though residential telephone numbers are of much help, presence of such incomplete addresses possesses big challenges in house-holding. According to Graham Rhind (an expert in handling international data), house-holding should be avoided as far as possible (except some traditional anglo-saxon communities) because it hardly ever works.

Note: Discussion only includes individual house-holds and not corporate house-holds

Saturday, August 6, 2011

Requirements for a Data Quality solution (updated on 6th August)

I have often come across questions like what could be the necessary properties for software that links records from my colleagues and clients and other associates. It largely depends on the objective of the record linking (please read my earlier post on Context Sensitiveness). However, I am giving a few points that can be considered as the necessary properties of such software.
For the ease of my convenience, I am dividing the properties into two exclusive sets of “Consolidation” and “Matching”. “Consolidation” is the data preparation steps that are carried out before any operation for matching begins. Steps followed in a typical “Consolidation” are:

Basic Cleanup
Here we do initial cleanup of the records like replacement or removal of special characters, replacing multiple consecutive spaces by a single whitespace.

· Software should allow the users to write grammar rules for carrying out such cleanup. Grammar rules can be defined for any specific field.

· Users should be able to write grammar rules applicable to the beginning/end of a field. Good matching software should be flexible enough to accept regular expressions.

Adjusting the misfielded information
Often values are put different fields such as Job Title values may be present in the name field even if there is a separate field in the system for Job Title. This routine should be able to identify such occurrences and rectify those.

· Setting must accept at least two fields- one for the search and the other for the destination

· Users will be able to decide in case the destination field is already populated

· Solution can handle different requirements like pickup and adjust the search keyword or the portion of the field from the start till the keyword or the keyword till the end.

· Users should be able to add/edit/delete different keywords.

· Settings should be able to accept different sets of keywords for different search.

Misfielded data can be handled using parsing techniques and that is why a separate routine for handling these must be faster enough.

Identification
This routine can classify records into various types based on the presence of some keywords. An example of this would be to classify and flag individual customer records (B2C records) and Corporate/SME records (B2B records).

· Users must be able to add/edit/delete keywords for each classification/identification.

Branching
This routine is closely related to identification. Often different type of records need to be treated differently down the line and hence need to be put in separate buckets. Branching does exactly this.

Parsing
Parsing is the process of splitting the words in a field like Name or Address into multiple component fields and is discussed in detail in my earlier posts in June 2011.

· Users must be able to add/edit/delete keywords in the lookup tables for parsing

· Users must be able to add new mask characters, lookup tables

· Users must be able to add/edit/delete parsing rules

· A good software must be able to handle partial parsing (see my earlier post on the topic)

· Users must be able to create new parsing routines for any field

· Software must be able to generate reports of parsing so that the quality can be assessed by the users and settings may be fine-tuned.

Genderization
The process of determining the gender of individual records based on various name components like Title or Given Name etc. is called Genderization.

· Users should be able to prioritize the name components to be evaluated for Genderization

· Users should be able to add/edit/delete gender codes corresponding various name words

· Software must provide Genderization report that includes the percentages of records for which gender code could be assigned and the percentages of records for which gender codes could not be assigned, a sample of names where gender codes could be assigned and a sample of names where gender code could not be assigned

Augmentation
It is the process of enriching the data. One of the major data quality concern is the missing values for important fields. Augmentation or Enrichment is the process by which a data quality solution can take care of a portion of this. Besides filling up the missing information, this routine can also be used for validating the existing information e.g. a record might show the state name as NJ while the name of the city may be Dallas.
This routine can be divided into two sub-routines viz. internal augmentation and external augmentation.

1. Internal Augmentation: It is the process of enriching information using the data values itself. For example, on a particular record, the name of the city may reside in the address lines instead of the city field.

2. External Augmentation: It is the process of enriching/validating information using external data such as postal information, telephone directory etc.

· Users must be able to select the validation/enrichment options from a list

· Software will flag the records where validation fails for a particular setting

· A report must be generated showing the percentages of enriched records, percentages of records that could not be enriched, sample of enriched records etc.

Standardization
It is the process of transforming similar data values into a unique format. For example, different spellings and abbreviations for city names, state names etc. are made into a standard format.

· Users must be able to select the fields which are going to be standardized

· For every field to be standardized, users must be able to specify lookup tables that contain the variations and the standard format.

· Each lookup table containing possible variations and the standard format need to be editable.

Rejection Routines
This routine can reject/flag records based on user defined rules like use of profanity, having information on which validation failed etc.

· Users must be able to define rejection rules involving one or more fields and one or more lookup tables

· All the lookup tables must be editable by the users

· Users must be able to specify if rejection or flagging will be done

· Software must be able to generate reports containing sample records (that got rejected or flagged) for each setting.

Once the “Consolidation” steps are executed, steps in the “Matching” process are followed for the surviving records. Steps followed in the “matching” routine are:

Defining Match Groups/Hierarchy
Users define the match hierarchy here. For example, the matching software can find the address matches and then for the records with matching addresses, it can probe further and can find out house-holds etc. So, defining more than one match groups requires a relation among these match groups. Some of the match groups may be unrelated while some others may be related

Two-step matching
A good matching software is able to perform matching in two steps. First step is called primary matching where hard keys defined for each record are compared to arrive at a match. Readers can refer to my earlier post on key-based matching in May 2011.

1. Primary Matching

· Users must be able to define one or more match keys

· Users must be able to specify, for each key, what would be done in case a key element is missing for a record.

· Users must be able to specify the string modes in which a field (or a portion of it) should be included in a match key. Two strings can be compared in various string modes such as

o Consonated Mode: In this case, all the vowels are dropped from the string

o Vowelized Mode: In this case, all the consonants are dropped from the string

o Numeric Mode: In this case, all the non-numeric characters are dropped from the string

o Phonetized Mode: In this case, the entire string is phonetically transformed (for details, please read my earlier post in June 2011)

o Alpha Mode: In this case, all the numeric and special characters are dropped from the string

o Alpha-Numeric Mode: In this case, all the special characters are dropped from the string

· Users must be able to review the primary match results by generating sample reports

1. Fuzzy Matching
Fuzzy Matching works on the probable matches discovered after primary matching. In this matching probable match pairs are classified into three subsets of definite matches, definite non-matches and suspected matches. This type of matching is also called probabilistic matching and can be implemented in many ways.

· Fuzzy matching should be an optional step. Users may define the primary keys in a satisfactory way and may decide to consider the output of primary matching as final set of matches.

· This routine must allow the users to specify the action(s) in case blank values are required to be compared to another blank or non-blank value. This setting may be different in case of different fields.

· Fuzzy matching must allow the users to specify the string modes in which fields will be compared

· This routine must allow users to specify the possible cross-matching options (for details please my earlier post in may 2011)

· Users must be able to decide, for each fuzzy matching rule, the associated fields to be compared

· At the end of fuzzy matching, users should be able to generate sample reports containing the matched and/or un-matched records for each group in the hierarchy.

Manual Decision/Review
Manual review is an important part of a data quality tool. System must provide the appropriate interface to the users so that each of the matches can be reviewed and the following decisions can be made:

1. Mark as match

2. Mark as un-match (This option will make sure that the concerned records are never matched in future, if that is the case)

3. Hold a case for further verification and review by higher authority

Consolidation of indirect matches
After the final matching (Fuzzy matching in case of a 2-step matching or the primary matching in case of a 1-step matching), the matched records need to be consolidated. This routine, besides doing that, must be able to assign unique cluster number to every record such that matching records in a group (or cluster) get the same number. For more detail, please read my earlier post “Indirect Matching” in June 2011.

· At the end of this consolidation, users must be able to generate sample match/un-match reports corresponding to each match group to review the results.

RaRationalization or selecting the survivo
This routine works after the consolidation of final matches is done and entire set of records are put into several clusters where each cluster contains matching records. Obviously, each un-matched record forms a cluster of size one.
Often business requires a single record representing a cluster of matching records. This was discussed in detail in my earlier post “Constructing the Survivor Record” posted in June 2011.

· Users must be able to write code snippets to build the routine in case, it is complicated. System must provide the users with the code editor and available functions and fields and the logical operators.

Data Profiling is an important function of a good data quality software. Requirements around this would be:

Column analysis
This routine, given a table will generate the following reports:

· Percentages of NULL or blank

· Percentages of initials

· Percentages of numeric values

· Percentages of alpha values

· Percentages of unique values

Frequency analysis
This routine, given a table will generate the following reports:

· Frequency Report

· Pattern frequency Report

Table analysis
Given two tables, this routine should be able to identify:

· Common unique key between the tables

· Orphan records i.e. records present in child table but not in master

Generate alerts
This routine will allow the users to define business rules and will generate compliance reports. Optionally this routine can generate failure alerts and send e-mails.

There are, of course, detail level requirements around each topic mentioned above.

We have discussed so far the technical requirements of the good matching software. But there are a few more requirements for this software depending upon the context. These are:

Input and Output
System should be able to read data from a number of sources including

1. Text files

2. Delimited files

3. Excel files

4. RDBMS files

Similarly, system should be able to output data in several formats.

GUI
System must have a good GUI. The steps we discussed earlier can be arranged in any order depending upon the requirement. Good to have a GUI that is drag-drop facility.

Workflow
I have mentioned requirements corresponding to a number of subject areas for the entire software. Users should be able to create workflows incorporating one or more source files and the required processes/functions and settings. These processes/settings should be flexible enough so that different workflows may use these in different order or even may not use some of these as per the context.

Integration
Such a matching tool can be used stand alone or another application may use the services.

Performance
Functional requirements for matching software require a lot of string manipulation. But the software must be quick enough to process large volume of data files.

Thursday, July 21, 2011

Context Sensitiveness in Matching

At present there is a serious discussion going on in the Linekdin group “Matching” (You need to be a member of the networking site Linekdin and also a member of the group “Matching” to access the thread) on the subject of Context Sensitiveness in matching. The subject is closely related to probable errors in matching. Looking at the richness of the content in this discussion and the vastness of the topic itself, I am tempted to share my understanding in this regard.

Let me begin by sharing an experience I had a few years back while implementing a data quality solution in a private bank. This bank was in the process of implementing a data quality solution for its large customer base. In order to fine-tune the matching algorithm, it gave us a control/test file (consisting of a few hundred records) and with this, we tried various possible algorithms. It took us some time before we came up with the proper match algorithm for the control file. Both the business users and the IT users were happy with the result displayed for the control file. But to our horror, the same algorithm became a disaster when a portion of the customer data was processed. We finally had to realign the algorithm from the start.

Before I explain the scenario, let me give one example of the disparity. Consider the two individual records (only a few fields) in the table below:

Name	Address	City	Tel1	Tel2
ABHISEK C KOTCHER	C TOWER, UNO 12, JEEVAN MANZIL	SURAT	1111111111	2222222222
AVISEK C	C12 OFF MG RD, NEAR JEEVAN MANZIL	SURAT	3333333333

Above two records were matched by the algorithm developed using the control file. But for the customer data integration activity these records were not a match as we realized later.

We wanted to know if this one was a one of case or there was something fundamentally wrong. To our shock, we found that the control file given by this bank was a portion taken out from their fraud detection de-duplication database which was prepared by another vendor earlier. Unfortunately this vendor did not make the bank aware of the effect of using the same or similar match algorithm under different context.

In case you can spare some time, you may refer to my earlier post “Errors in matching” posted during May 2011.
In a nutshell, there are two types of possible error when we say; there is a match (or no match) between two specific records. When the algorithm says it’s a match but actually the records represent two different entities, the error is called a false positive. And when the algorithm says that there is no match between the records but actually the records represent the same entity then the error called a false negative. Depending on the context in which the match results will be used there are two types of match objectives. One situation demands that a slight similarity should be captured by the match algorithm and thereby the corresponding objective becomes to reduce false negatives. Another type of scenario demands that two records should match only when there is strong similarity and the corresponding objective in this case becomes to reduce false positives.

In a fraud detection type of context, the objective is to capture a slight similarity so that none is escaped. But in a typical customer data integration type of context, the objective is to allow two records to match only when there is strong evidence that these represent the same entity.

I do not think there is any strategy to improve the match algorithm in a way so that both false positives and false negatives reduce (unless of course you change the input file/files!). Unfortunately there is no mathematical proof of this but experience of people in this field tells so. And that is why we have these two possible objectives rather than just one that requires reduction of both false positives and false negatives.

The idea is when one adjusts the match algorithm to reduce false positives as in the case of a typical CDI type of situation by making the match settings stricter, one increases the risk of having more false negatives. On the other hand, when one adjusts the match algorithm to reduce false negatives as in the case of a typical fraud detection type of situation by making the match settings relaxed, one increases the risk of having more false positive.

So, before you start working on the match algorithm (setting), be sure of the objective.