Data Quality: Phonetic Similarity

Consider the following words taken from India:

SOFIA	SOPHIA
JENA	XENA
SANTANU	SHANTANU
BIKASH	VIKAS
BAIBHAB	BAIVAB

Words in the same row are actually matching and the difference between each pair is put in bold. These differences in spelling reflect how people pronounce these words. The issue becomes convoluted if the native language of people considered is not English (i.e. the names are non English) but the impact of regional languages are obvious in the spelling.
There are some standard algorithms to handle such situation. For example, Soundex and Metaphone are two widely known and used algorithms that help bring two similar sounding names closer. There are ms other even more sophisticate algorithms to use.

But there is an issue with each of these algorithms i.e. these algorithms are sort of fixed. We cannot customize these lists/rules. And as I processed data from different countries and regions I encountered more variations than listed in these typical algorithms.

Let me give you a funny example. I came across a man DHARMENDER a few years back and he eventually represented my case in a legal matter. When I was going through the initial draft paper of my case I found his legal name was DHARMENDRA. Eventually I realized that in some specific region, personal names ending with DRA are often pronounced and written as the same name ending with DER.

We wrote such phonetically similar syllables in a table sorted on the length in a decreasing manner. Our algorithm just compared these syllables with the values in the desired field in the records and replacing them accordingly if found. We could edit this table subsequently.

Data Quality

Monday, June 20, 2011

Phonetic Similarity

No comments:

Post a Comment