Sunday, September 18, 2011

Discovery Phases


Perhaps the most critical phase of any data quality implementation is “Data Discovery” where we study the sample data collected from the site with the goals:
1.       Enrich metadata repository specific to the sample data
2.       Profile the sample data to gain an insight with respect to the semantics of the data
3.       Come up with the set of Data Quality rules for handling the sample data though the steps to be followed during the actual implementation
In the title of this post I deliberately used the term “Phases”. This indicates that there are more than one such discovery phases in practice. Besides the “Data Discovery” phase that we carry out for each implementation, we also conduct the “Market Discovery” phase when we start Data Quality related practices in a new market (i.e. country/region). “Market Discovery” is usually carried out by Data Quality product development companies while “Data Discovery” is carried out by the team responsible for data quality implementations.
I find “Market Discovery” to be very fascinating since you have almost nothing to start with. But let me talk about “Data Discovery” first as this phase is encountered frequently. We start with a set of metadata repository that we have prepared out of “Market Discovery” and enriched during previous “Data Discovery” and implementation activities.
Let me list the things that we have at the start of the “Data Discovery” phase.
1.       Data Quality tool
2.       Metadata Repository including:
a.       Master Lookup Tables such as: Given Name, Last Name, Street Type etc.
b.      Supporting Lookup Tables such as Phonetic Sounds
c.       Lookup Tables for parsing
d.      Basic rules for initial cleanup
e.      Understanding of the address correction processes for the underlying market
3.       Sample Data from the site

The process of “Data Discovery” cannot be specified and depends on the exact situation but it has to include the followings:
1.       Entire sample data needs to be profiled. This will bring up many data quality issues in the sample data that needs to be handled. In case there are multiple source systems, profiling should be carried out differently for different system.
2.       After the data profiling, workflows should be set up in the data quality tool and samples from all the source system needs to be processed as per the requirement. Here manual review of the intermediate results after every step in the workflow is necessary.
3.       While step 2 is in progress, discussions with the business users must be carried out to finalize address correction formalities and incorporate the corresponding process in the workflow.
4.       At the end of DQ processes, present the results/reports to the business users and get their feedback. Incorporate the feedback in the solution and re-generate the reports.
Remember the points:
a.       This is an iterative step
b.      You may have to make the business users aware of various Data Quality related concepts including the context sensitiveness of matching  (Refer to my earlier post on this topic in July 2011)
c.       Discuss with the client regarding the usage of external lists (such as postal tables or telephone directories etc.) in enrichment/augmentation of the address information.
At the end of “Data Discovery” you will have updated all the initial data knowledge you had earlier. But be prepared to fine tune the settings and the lookup tables during the implementation. In case, sample is not a representative one, you might have surprises. It is always a better practice to have two independent samples to start with. Use the first sample to come up with the optimum settings and apply it on the second sample and see what kind of gaps you are getting.

Now let us talk about “Market Discovery”. It is often said that the discipline data quality is a mix of art and science. The art in data quality seems to be the dominating part during “Market Discovery” phase. Goals for “Market Discovery” are basically to identify the conventions and nuances in names (including SME and Corporate names) and addresses besides building up the vocabulary and the associated rules. Let me briefly discuss the issue with respect to names:
1.       Find out what are the possible components in name. Typical components could be First Name, Middle Name, Last Name, Prefixes and Suffixes. But depending on the traditions and conventions of the market, you may have to include other fields like a second Last Name field and/or a Last Name Prefix and/or a Job Title field etc.
2.       For each of these fields, you need to find the vocabulary which will serve as the initial set of Lookup Tables.
3.       Next step will be to figure out the standard naming conventions. Usually, names are written like Title/Salutation + First Name + Middle Name(s) + Last Name + Suffix. But such conventions may vary depending upon the conventions in the underlying country. For example, people usually write Last Name before First Name in Japan. You may have some sample data to carry out the research. It is better to take help of a local expert to understand the nuances. Such research may include consulting books and other publications.
Before carrying out this research, you may have to ensure the capability of handling DBCS or MBCS in the data quality tool (if applicable).

In case you will be using distance function based comparison for record linkage, where the relative weight of a character-mismatch depends on the position of the character in a string, we need to know the writing convention (left to right or otherwise) in the region.
Address validation/augmentation is another important thing to consider. We need to figure out various possible ways of performing this. Kind of postal tables that are available for the country, if there is any connection between telephone numbering system and state (or city etc.), if address correction tables are available etc. must be looked into and documented.
Another important activity to be carried out in this phase is to find the scope of standardization. This is the phase where the fields which need to be standardized must be identified and associated list of vocabulary should be built. A related concept is the use of nicknames and aliases.
“Phonetic Variation” depends on the culture and history of the underlying market and must be looked into during this phase. If the native language of the market is not the official language for communication then issues related to “Phonetic Variation” will be rampant. It is important not just to capture a few such examples but to understand if there is a pattern of such variations.

Wednesday, September 7, 2011

Addresses in India

Addresses in India are so confusing and follow many patterns from region to region that it makes address matching/parsing/enrichment very challenging. It is really difficult to write an article on this. Someone who has done years of research into this will require an entire volume to come up with the results.
However, many issues that I have faced during various data quality centric implementations prompt me to write something on it.
In this post, I am going to discuss about a few complicated patterns and some of their numerous exceptions.
According to the standard addressing convention, a typical street address has three components where a street number is followed by a street name which is then followed by a street type (there are places where street type appears before the street name). I will begin my discussions with this convention. Yes in many cities in India this is a standard addressing convention.

Ariff Road

A close friend of mine lives in the address “27/2 Ariff Road” which is not very far from my home. When I visited her last, out of my curiosity, I took a tour of the entire Ariff Road and looked at the addresses written on the surrounding houses and shops.
By the way, Ariff Road is a relatively narrow lane in North Kolkata. There are a few lanes and by-lanes that originated from Ariff Road. Interestingly most of these are called Ariff Road too. At least, the address on these houses in such lanes and by-lanes bear Ariff Road name. I was rather surprised to see the numbers on these addresses. These were not just unordered but totally chaotic. The house opposite to “27/2 Ariff Road” was “1H/2A Ariff Road”. I came across another house with the address “12/7/A/1 Ariff Road”
We find such addresses in many areas in North Kolkata. We will have to refer to history of this city to find how such addresses came into existence. It is not a planned city and was formed by the British in the late 17th. Centaury after the agent J. Charnok purchased three villages from local landlord Sabarna Chowdhury. Slowly but steadily this city as desired by the British, started to grow without any master plan.
So when the postal system was put in place, the numbers were assigned in some order. But subsequently new houses were built and house-holds/families got split requiring separate addresses. Second observation is crucial here. Like in many places, in this part of the country too, the system of joint family (extended family) was prevailing. Obviously they required large houses. But with the passage of time, this system changed and some of the family member moved out while some others continued to live under the same roof but built separate dwelling units. Some of them rented out a portion of their premises. A significant number of these old houses are now sold to the real estate developers and promoter who are building multi-storied apartments. And the entire system is becoming complicated.

Main Road & Cross Road

There are a few places (Bangalore or Bangaluru is one of those) where the entire area is divided by main roads running in one direction and cross roads running perpendicular to it. An address in such a place is described by the nearest main and cross road information besides the house name/number.

Plot/Block/Sector

Marking an address by these (or some of these) is found in the planned cities in the country. Addresses in Chandigarh (capital of two neighboring states as well as a union territory itself) contain sector information.
In Salt lake area (a suburb of Kolkata), the entire region is divided into sectors. There are blocks in each sector and plots in each block. So a typical address here looks like:
“Plot Y 14, Block – EP, Sector 5 Salt Lake”

Laxminarayan Jewelers - different entities with similar names/addresses

Few months back, I took a tour of the city Kolkata. My intention was to observe the addresses on the houses I come across. In one area, I saw a number of shops with the same name
“Laxminarayan Jewelers”. Sometimes I noticed a little variation “Laxminarayan & Sons”. These shops were located in the basement of a huge building.
It took a few weeks for me to find out the history behind this. Someone called “Laxminarayan” established a shop many years ago. But his sons got separated and started their own business under the same roof but as different entities. They all bore matching names and addresses (addresses differed by a number like UNO 1 55 XYZ Road, UNO 2 55 XYZ Road etc.)

“Diagonally Opposite to” -a land of landmarks

Last month I took a new telephone connection. In the process, one executive from this telecom company called me to verify/cross-check the address that I provided in the application. She repeatedly asked for a landmark near my house.
While profiling addresses in India, rampant usage of landmark information is noticed. Some of the identifiers for landmark are “Near”, “Opposite to”, “Behind”, “Beside”, “Next to” and not to forget “Diagonally Opposite to”
Adjacent houses or apartments


Look at this address - Office Space 2 & 5, Paramount Complex, Navelim, Goa – 403707. It is a commercial address that points to a shop. This shop, however, is spread across two shopping units in the same floor of a shopping complex.
Also, one my friend has this address: Apt 5C & 5D, 12 Mandevilla Gardens, Kolkata – 19

This possess a serious challenge for address parsing as we need to have multiple fields for containing similar  information like two fields for apartment number, two fields for street number etc.

Personal Names in addresses

Many Indian addresses and esp. the ones from rural areas begin with a personal name. Usually the name of the head of the family is mentioned in the addresses. Many times, post men, in these areas, know people by the name and in-case the letter is addressed to someone else in the family who is not known to the post man, the letter gets delayed. This is the primary reason for using the name of the head of the family in the address. Most popular keyword to identify such names is C/O or “care of”.
There are other variations of C/O where the exact relation is mentioned like “or S/O or son of”, “D/O or daughter of”, “M/O or mother of” etc.
A typical address in this format looks like:
“C/O Ashim Biswas, 33 Govinda Naskar Lane, Sriharipara”

Addresses in Goa

Goa is a famous tourist spot in India. There is another equally interesting fact surrounding this place. India was dominated by the British for over two hundred years. Goa, on the other hand, was dominated by the Portuguese for over four hundred years. India got its independence in 1947 but the operation “Vijay” was carried out by Indian army in 1961 to liberate Goa.
Names including individual names as well as name of places/buildings/roads in this place sometimes follow the Portuguese style.
Here one can find a street named “18th. June Road”

Now another controversy and debate is going on regarding renaming these streets and buildings!

Roman digits

Usage of roman digits is abundant in Indian addresses. Consider the address:
“C-1 295/296 Rohini Sector-11 Near Bay Japanese Park Back Of welcome Hotel”. In addresses like this, Sector-11 sometimes written as Sector – XI (or Sec XI). Usually sector numbers in Indian addresses at times, are written using roman digits.

New and Old

Yesterday evening, I was walking down a street named “Camac Street”. When I was looking at a new sign board which displaying another name for this famous street, one my friend happened to call me and asked me where I was. I said “Abanindranath Thakur Sarani” and he expressed his concerns that I was in a weird place. I had to tell him that the new name for the “Camac Street”, according to the Kolkata Municipal Corporation, was “Abanindranath Thakur Sarani” to settle things!
Places in India are slowly coming out of its colonial structure and conventions and as a part of this, renaming things is a commonplace now. For this reason, cities like “Bombay” has renamed as “Mumbai” or “Madras” has become “Chennai” and the list continues. Well... “Kolkata” is no different here. Few years back, this city was known as “Calcutta”. It seems that my state “West Bengal” will soon become “Pashchimbanga”.

Also, the postal department has introduced new postal code (or PIN code) numbers recently. PIN code of the place where I live is 700136. Some of the courier companies refused to deliver packages to my residence initially as they were using the old number 700059

How about this address? “Old No 160, New No 111,6th Street Extension, 100 Feet Road, Gandhipuram”

Multiple Locations

Many addresses in this country contains one or two locations. An example will be:
“36, Arakashan Road, Ram Nagar, Paharganj, Diagonally Opposite to New Delhi Railway Station”
This address points to a commercial place in “Paharganj” area of “New Delhi” (capital of India). It also says that the place is in “Ram Nagar” area which is located in “Paharganj”. Usually the first location encountered when the address is read (from left to right) is located within the second location mentioned in the address.
But one will come across plenty of addresses involving more than two locations.
An example will be:
“8A/48, W.E.A. Channa Market, Karol Bagh, Behind Pusa Road”

Incomplete or partial address

Look at the business address: “C/O Star Investments, OPP. Head Post Office Panjim, Tiswadi – 403001”
This address does not contain a street name and number and yet a valid address i.e. addressee can be reached using the postal service.
Most of addresses such as the above can be written in multiple ways which are apparently not similar.