Monday, May 30, 2011

Building Blocks – String Matching

We discussed earlier how a match key is formed using the transformed values of several fields and the associated match techniques. Basically we need to compare two strings.
Before discussing this, it will only be fair to state here that the transformation mentioned earlier brings to apparently distant strings closer.
One such transformation could be standardization which can bring to strings CALCUTTA and KOLKATA together.

Two strings may be compared for an exact match.
Many match engines are based on such exact matches. Let us consider the following strings:
Base String
Input String
INNOCENT
INNOCEMT
EXPRESSION
EPXRESSION
PRICEWATERHOUSECOOPERS
PRICEWATERHOUSECOOPER


In each row, the two strings are close enough to conclude them to be matching but none of these pairs is an exact match. This kind of situation arises largely from typographical errors. And consequently, any match engine that uses exact match on the match keys will fail to match the corresponding keys.
A distance function may address such issues. It is a function that takes into account two strings and returns a number that represents the distance between the input strings. A distance function needs to have the following properties:


1. Distance between two exactly same strings is 0
2. Distance between two input strings is non-negative
3. Distance between two strings increases (at least non-decreasing) when the similarity between the strings decreases.
Let us define one such distance functions.
Suppose s1 and s2 are two strings each of which are of length l1 and l2. Also call the distance function between s1 and s2 by d (s1, s2). Then the above three rules can be expressed as:
  1. d (s1, s1) = 0
  2. d (s1, s1) ≥ 0
  3. d (s1, s1) ≥ d (s1, s3) when s1 is more similar to s3 than s2
Let us define one such function here.
Let m be the number of position-wise matching characters in s1 and s2
Case1: l1 = l2
Here we can consider d (s1, s2) = 1 – m/l
Note that the maximum possible value for m is l and that happens when s1 and s2 are exactly same. In such a case the distance becomes 0.
Case2: l1 > l2
Here we consider d (s1, s2) = 1 – m / [l2 (l1 – l2 + δ)] where δ > 0 a constant.
Note that the maximum possible value for m is l2 and that happens when s2 is a sub-string of s1.
While fixing a value for δ, it has to be kept in mind if the length of s1 is one more than the length of s2 and s2 is a sub-string of s1 then d (s1, s2) = 1-1/(1 + δ)
If we want two strings as above (lengths differ by one and the smaller one is a sub-string of the other) to differ by 5 unit then 1-1/(1 + δ) = 0.05 => δ = 0.05 (appx.). In fact, we see that in order to have such strings closer, we need to set δ as a very small positive number.

The distance function defined above lies between 0 and 1 and a match probability can be defined using this distance function. For example, the match probability can be p (s1, s2) = 1 - d (s1, s2)
However, the above distance function is only an indicative one and can be further improved.

No comments:

Post a Comment