Data Quality: Errors in matching

As a continuation of my previous post, here we discuss the possible errors in a matching that fits the above framework.

In fact, if we divide the possible pairs into two groups M (group of matching pairs) and U (group of
non-matching pairs) then there are two types of errors.
Type I error or false positive matches and type II error or false negative matches.
A false positive match P_F ε MU_P and a false negative match N_F ε UM_PLet the average cost of a false positive match be C_P and the average cost of a false negative match be C_NSo the total cost of matching error can be defined as:
E (L) = [C_PN(P_F) + C_NN(N_F) ] [2]
where N (P_F) denotes the number of false positive matches and N(N_F)is the number of false negative matches.
However, in reality, actual values of N(P_F) and N(N_F) will be very difficult to obtain and hence, we will estimate these values by executing this matching on a smaller number of representative set of records.

In case, the match engine divides the possible pairs into three groups M, S and U as mentioned in the framework, there will be one more component in the error expression. This new component will be contribution of S.
Suppose the cost of processing/resolving a suspect match (a member of S) is C_S and the number of pairs in S are N(P_S) then this error component will be C_SN(P_S)
And hence, the error expression becomes:
E (L) = [C_PN(P_F) + C_NN(N_F) + C_SN(P_S)] [3]

Let us concentrate on the expression [3] because as per the basic framework, we have produced three subsets of the possible pairs as the output of matching.

Obviously, we would want to reduce the matching error or E (L). Note that the variables in this expression are N(P_F), N(N_F), N(P_S) i.e. the number of false positive pairs, number of false negative pairs and the number of suspected matching pairs.

The number of false positive matches can be reduced by making the match criteria (or the match keys) more stringent. But this ensures that some genuine matches are identified as non-matches i.e. this action increases the number of false negative matches.

Similarly, the number of false negative matches can be reduced by making the match criteria (or the match keys) more relaxed. But this ensures that some genuine non-matches are identified as matches i.e. this action increases the number of false positive matches.

So, the match rules are made stringent or relaxed based on the relative values of the cost of a false positive (C_P) and the cost of a false negative (C_N)

The last variable that contributes to the matching error E(L) is the number of suspected matching pairs i.e. N(P_S).
Obviously it depends on the value of (M-m) i.e. the length of the suspect interval. Apart from this, it depends on the following factors:

Before trying to reduce the number of suspect matches, let us stop here and investigate why do we have the suspect matches in the first place. Matching records should look similar and non-matching records should not look similar. Ideally yes! But there are reasons why the distinction between a match and non-match is blurred.
Let us look at some of those reasons:

Accidental closeness of the records (the values in the fields)
As an example, the name strings TIRTHANKAR and DIPANKAR are close enough. A good amount of similarity in the surname clubbed with address information in two records with the given name TIRTHANKAR and DIPANKAR may very well put the underlying pair into the set of suspected matching pairs.

Cultural Mix
These days, the effect of this is proving to be costly. Let me give an example, a bit extreme though.
Suppose we are processing data from a country where the popular nickname BILL does not mean WILLIAM. Unfortunately someone from the USA has settled in this country. This person has a name WILLIAM.
Since, the rule does not allow nickname matching, we do not match BILL and WILLIAM and hence the two records (both corresponding to the immigrant from the USA), instead of going into the set of definite matching pairs, land up in suspected matching pairs.

Missing Values and Typographical Errors
Missing Values, at times do not allow otherwise matching records to be close enough. As an example, consider a pair where one record does not have the given name (or any other critical) field filled-in.
In such a case, the match score, instead of being high, will be comparatively lower which may result in the pair being landed up in the set of suspected matching pairs. Sometimes, a missing value (or missing values) may bring two otherwise dissimilar records closer, may be in the set of suspected matching pairs. Similar observations can be made for typographical errors.

Insufficient match settings
Factors that drive match settings can be discussed at length. But without taking the deep dive it can be said that incorrect settings (may be incorrect parsing rule) can increase or decrease the match probability and thus bring a pair of records to the set of suspected matching pairs instead of the two other sets.

From the above discussion, we see that a considerable portion of the variable N(P_S) are dependent on factors beyond our control. Besides, the reduction of N(P_S) may result in an increase in N(P_F) and/or N(N_F). It is better to reduce N(P_F) and/or N(N_F) by fine-tuning the match settings instead.

Data Quality

Wednesday, May 25, 2011

Errors in matching

No comments:

Post a Comment