Data Quality: Building Blocks

We discussed earlier how a match key is formed using the transformed values of several fields and the associated match techniques. Basically we need to compare two strings.
Before discussing this, it will only be fair to state here that the transformation mentioned earlier brings to apparently distant strings closer.

One such transformation could be standardization which can bring to strings CALCUTTA and KOLKATA together.

Two strings may be compared for an exact match.
Many match engines are based on such exact matches. Let us consider the following strings:

Base String	Input String
INNOCENT	INNOCEMT
EXPRESSION	EPXRESSION
PRICEWATERHOUSECOOPERS	PRICEWATERHOUSECOOPER

In each row, the two strings are close enough to conclude them to be matching but none of these pairs is an exact match. This kind of situation arises largely from typographical errors. And consequently, any match engine that uses exact match on the match keys will fail to match the corresponding keys.

A distance function may address such issues. It is a function that takes into account two strings and returns a number that represents the distance between the input strings. A distance function needs to have the following properties:

1. Distance between two exactly same strings is 0
2. Distance between two input strings is non-negative
3. Distance between two strings increases (at least non-decreasing) when the similarity between the strings decreases.

Let us define one such distance functions.

Suppose s₁ and s₂ are two strings each of which are of length l₁ and l₂. Also call the distance function between s₁ and s₂ by d (s₁, s₂). Then the above three rules can be expressed as:

d (s₁, s₁) = 0
d (s₁, s₁) ≥ 0
d (s₁, s₁) ≥ d (s₁, s₃) when s₁ is more similar to s₃ than s₂

Let us define one such function here.
Let m be the number of position-wise matching characters in s₁ and s₂

Case1: l₁= l₂

Here we can consider d (s₁, s₂) = 1 – m/l

Note that the maximum possible value for m is l and that happens when s₁ and s₂ are exactly same. In such a case the distance becomes 0.

Case2: l₁ > l₂

Here we consider d (s₁, s₂) = 1 – m / [l₂(l₁ – l₂ + δ)] where δ > 0 a constant.

Note that the maximum possible value for m is l₂ and that happens when s₂ is a sub-string of s₁.
While fixing a value for δ, it has to be kept in mind if the length of s₁ is one more than the length of s₂ and s₂ is a sub-string of s₁ then d (s₁, s₂) = 1-1/(1 + δ)
If we want two strings as above (lengths differ by one and the smaller one is a sub-string of the other) to differ by 5 unit then 1-1/(1 + δ) = 0.05 => δ = 0.05 (appx.). In fact, we see that in order to have such strings closer, we need to set δ as a very small positive number.

The distance function defined above lies between 0 and 1 and a match probability can be defined using this distance function. For example, the match probability can be p (s₁, s₂) = 1 - d (s₁, s₂)

However, the above distance function is only an indicative one and can be further improved.

Data Quality

Monday, May 30, 2011

Building Blocks – String Matching

No comments:

Post a Comment