Data Quality: The Small Steps

Earlier, we mentioned transformations while defining a match-key.

What kind of transformation? Surely, we consider transformations so that two or more apparently dissimilar strings (a finite sequence of characters) comes closer i.e. the dissimilarity reduces (if not vanishes!) provided the strings are actually matching.

Usually, two matching strings differ because of four reasons:

1. Typographical errors and spelling mistakes

2. Usage of different conventions of writing similar things

3. Words from regional languages transliterated in English and regional influences

4. Combination of some of the above.

Let us look at the following examples:

Sl. #	Similar Strings	Remark
1	HYDERABAD, HYDERAGAD, HYDRABAD	Name of a city in India
2	STREET, ST, STR	A common street type
3	CHAVEZ, SAVEZ, CHAVEJ	A popular Family Name
4	ROAD, RD, RAOD	A common street type

Three strings in the first row are matching but with obvious spelling mistakes. Matching strings in the second row are a result of using different conventions of writing the same street type. A popular
Latin American last name is written with different spelling in the third row whereas the matching strings in the fourth row display a mix of different conventions and typographical errors.

Techniques described earlier can handle typographical errors and spelling mistakes. But for handling other types of dissimilarities in matching strings, we need to use various transformations.

I will discuss t transformations that I have used in different situations. But before that, let us see what happens after matching. We will talk about indirect matching and survivor selection that takes place after matching.

Data Quality

Friday, June 3, 2011

The Small Steps – Transformations

No comments:

Post a Comment