Thursday, August 18, 2011

House-holding dilemma with Indian Data

House-holding or finding the records under the same house-hold is a typical data quality activity as far as linking individual records goes. According to Wikipedia, a house-hold is defined as “the basic residential unit in which economic production, consumption, inheritance, child rearing, and shelter are organized and carried out”. Typically, it refers to a family unit that stays in the same dwelling unit.

Household matches are found out using these properties:
1.       Last Name i.e. Family Name should be the same and
2.       Address (residential) on the records should be same
Let us look at the first point that is last name (or family name) matching. This is done under the assumption that the family members share the same family name. But this often fails in Indian context such as:
1.       Muslim families (well…most of them) do not have a family name concept.
2.      Traditionally family name concept was not present in South India.  Parents in south Indian families bestowed a single name to their child at birth and appended it with many initials. The abbreviations could stand for the ancestral village and the father’s first name in Karnataka, the house name in Kerala, for the caste name in Tamil Nadu and in Andhra Pradesh, the place of family origin.
I encountered this issue while performing name parsing for south Indian names. However, if we use a name component called last name instead of the family name (or surname) and use this component for individual matching then the complexity reduces a little when cross-matching is also used covering the name components. But for house-holding, this possesses a tough challenge.
Let us now look at the issues in address matching. We need to look at this keeping in mind the issues we saw in last name matching. The biggest issue in address matching is incomplete or partial addresses.
Let us look at the following addresses:

Address
Potentially Matching Address
Y 14, BLOCK EP, SECTOR V, SALT LAKE, KOLKATA, 700091
BLOCK EP, SECTOR V, SALT LAKE, KOLKATA, 700091
16A GARIAHAT ROAD, APT 1C, KOLKATA-19
16A GARIAHAT ROAD, KOLKATA 700019

Addresses on both the rows are close. But a detailed inspection reveals that the second address on these rows do not have the dwelling number. In fact, if these addresses appear on two records where names are matching then we would accept these as matches. But when there is no family name on the records then?
It’s a big question mark. Take for example the second address on the second row. It is a close match for the address
16A GARIAHAT ROAD, APT 2B, KOLKATA 700019 too.
Though residential telephone numbers are of much help, presence of such incomplete addresses possesses big challenges in house-holding. According to Graham Rhind (an expert in handling international data), house-holding should be avoided as far as possible (except some traditional anglo-saxon communities) because it hardly ever works.

Note: Discussion only includes individual house-holds and not corporate house-holds

1 comment:

  1. Excellent explanation. That really cleared up a lot of questions for me. I've seen those different combinations of Indian names, but never quite understood the differences.

    ReplyDelete