We discussed earlier how a match key is formed using the transformed values of several fields and the associated match techniques. Basically we need to compare two strings.

Before discussing this, it will only be fair to state here that the transformation mentioned earlier brings to apparently distant strings closer.

Before discussing this, it will only be fair to state here that the transformation mentioned earlier brings to apparently distant strings closer.

One such transformation could be standardization which can bring to strings CALCUTTA and KOLKATA together.

Two strings may be compared for an exact match.

Many match engines are based on such exact matches. Let us consider the following strings:

Two strings may be compared for an exact match.

Many match engines are based on such exact matches. Let us consider the following strings:

Base String | Input String |

INNOCENT | INNOCEMT |

EXPRESSION | EPXRESSION |

PRICEWATERHOUSECOOPERS | PRICEWATERHOUSECOOPER |

In each row, the two strings are close enough to conclude them to be matching but none of these pairs is an exact match. This kind of situation arises largely from typographical errors. And consequently, any match engine that uses exact match on the match keys will fail to match the corresponding keys.

A distance function may address such issues. It is a function that takes into account two strings and returns a number that represents the distance between the input strings. A distance function needs to have the following properties:

1. Distance between two exactly same strings is 0

2. Distance between two input strings is non-negative

3. Distance between two strings increases (at least non-decreasing) when the similarity between the strings decreases.

1. Distance between two exactly same strings is 0

2. Distance between two input strings is non-negative

3. Distance between two strings increases (at least non-decreasing) when the similarity between the strings decreases.

Let us define one such distance functions.

Suppose s

_{1}and s_{2}are two strings each of which are of length l_{1}and l_{2}. Also call the distance function between s_{1}and s_{2}by d (s_{1}, s_{2}). Then the above three rules can be expressed as:- d (s
_{1}, s_{1}) = 0 - d (s
_{1}, s_{1}) ≥ 0 - d (s
_{1}, s_{1}) ≥ d (s_{1}, s_{3}) when s_{1}is more similar to s_{3}than s_{2}

Let us define one such function here.

Let m be the number of position-wise matching characters in s

Let m be the number of position-wise matching characters in s

_{1}and s_{2}__Case1:__l

_{1 }= l

_{2 }

Here we can consider d (s

_{1}, s_{2}) = 1 – m/lNote that the maximum possible value for m is l and that happens when s

_{1}and s_{2}are exactly same. In such a case the distance becomes 0.__Case2:__l

_{1}> l

_{2 }

Here we consider d (s

_{1}, s_{2}) = 1 – m / [l_{2 }(l_{1}– l_{2}+ δ)] where δ > 0 a constant.Note that the maximum possible value for m is l

While fixing a value for δ, it has to be kept in mind if the length of s

If we want two strings as above (lengths differ by one and the smaller one is a sub-string of the other) to differ by 5 unit then 1-1/(1 + δ) = 0.05 => δ = 0.05 (appx.). In fact, we see that in order to have such strings closer, we need to set δ as a very small positive number.

The distance function defined above lies between 0 and 1 and a match probability can be defined using this distance function. For example, the match probability can be p (s

_{2}and that happens when s_{2}is a sub-string of s_{1}.While fixing a value for δ, it has to be kept in mind if the length of s

_{1}is one more than the length of s_{2}and s_{2}is a sub-string of s_{1}then d (s_{1}, s_{2}) = 1-1/(1 + δ)If we want two strings as above (lengths differ by one and the smaller one is a sub-string of the other) to differ by 5 unit then 1-1/(1 + δ) = 0.05 => δ = 0.05 (appx.). In fact, we see that in order to have such strings closer, we need to set δ as a very small positive number.

The distance function defined above lies between 0 and 1 and a match probability can be defined using this distance function. For example, the match probability can be p (s

_{1}, s_{2}) = 1 - d (s_{1}, s_{2})However, the above distance function is only an indicative one and can be further improved.