At present there is a serious discussion going on in the Linekdin group “Matching” (You need to be a member of the networking site Linekdin and also a member of the group “Matching” to access the thread) on the subject of Context Sensitiveness in matching. The subject is closely related to probable errors in matching. Looking at the richness of the content in this discussion and the vastness of the topic itself, I am tempted to share my understanding in this regard.
Let me begin by sharing an experience I had a few years back while implementing a data quality solution in a private bank. This bank was in the process of implementing a data quality solution for its large customer base. In order to fine-tune the matching algorithm, it gave us a control/test file (consisting of a few hundred records) and with this, we tried various possible algorithms. It took us some time before we came up with the proper match algorithm for the control file. Both the business users and the IT users were happy with the result displayed for the control file. But to our horror, the same algorithm became a disaster when a portion of the customer data was processed. We finally had to realign the algorithm from the start.
Before I explain the scenario, let me give one example of the disparity. Consider the two individual records (only a few fields) in the table below:
Name | Address | City | Tel1 | Tel2 |
ABHISEK C KOTCHER | C TOWER, UNO 12, JEEVAN MANZIL | SURAT | 1111111111 | 2222222222 |
AVISEK C | C12 OFF MG RD, NEAR JEEVAN MANZIL | SURAT | 3333333333 | |
Above two records were matched by the algorithm developed using the control file. But for the customer data integration activity these records were not a match as we realized later.
We wanted to know if this one was a one of case or there was something fundamentally wrong. To our shock, we found that the control file given by this bank was a portion taken out from their fraud detection de-duplication database which was prepared by another vendor earlier. Unfortunately this vendor did not make the bank aware of the effect of using the same or similar match algorithm under different context.
In case you can spare some time, you may refer to my earlier post “Errors in matching” posted during May 2011.
In a nutshell, there are two types of possible error when we say; there is a match (or no match) between two specific records. When the algorithm says it’s a match but actually the records represent two different entities, the error is called a false positive. And when the algorithm says that there is no match between the records but actually the records represent the same entity then the error called a false negative. Depending on the context in which the match results will be used there are two types of match objectives. One situation demands that a slight similarity should be captured by the match algorithm and thereby the corresponding objective becomes to reduce false negatives. Another type of scenario demands that two records should match only when there is strong similarity and the corresponding objective in this case becomes to reduce false positives.
We wanted to know if this one was a one of case or there was something fundamentally wrong. To our shock, we found that the control file given by this bank was a portion taken out from their fraud detection de-duplication database which was prepared by another vendor earlier. Unfortunately this vendor did not make the bank aware of the effect of using the same or similar match algorithm under different context.
In case you can spare some time, you may refer to my earlier post “Errors in matching” posted during May 2011.
In a nutshell, there are two types of possible error when we say; there is a match (or no match) between two specific records. When the algorithm says it’s a match but actually the records represent two different entities, the error is called a false positive. And when the algorithm says that there is no match between the records but actually the records represent the same entity then the error called a false negative. Depending on the context in which the match results will be used there are two types of match objectives. One situation demands that a slight similarity should be captured by the match algorithm and thereby the corresponding objective becomes to reduce false negatives. Another type of scenario demands that two records should match only when there is strong similarity and the corresponding objective in this case becomes to reduce false positives.
In a fraud detection type of context, the objective is to capture a slight similarity so that none is escaped. But in a typical customer data integration type of context, the objective is to allow two records to match only when there is strong evidence that these represent the same entity.
I do not think there is any strategy to improve the match algorithm in a way so that both false positives and false negatives reduce (unless of course you change the input file/files!). Unfortunately there is no mathematical proof of this but experience of people in this field tells so. And that is why we have these two possible objectives rather than just one that requires reduction of both false positives and false negatives.
The idea is when one adjusts the match algorithm to reduce false positives as in the case of a typical CDI type of situation by making the match settings stricter, one increases the risk of having more false negatives. On the other hand, when one adjusts the match algorithm to reduce false negatives as in the case of a typical fraud detection type of situation by making the match settings relaxed, one increases the risk of having more false positive.
I do not think there is any strategy to improve the match algorithm in a way so that both false positives and false negatives reduce (unless of course you change the input file/files!). Unfortunately there is no mathematical proof of this but experience of people in this field tells so. And that is why we have these two possible objectives rather than just one that requires reduction of both false positives and false negatives.
The idea is when one adjusts the match algorithm to reduce false positives as in the case of a typical CDI type of situation by making the match settings stricter, one increases the risk of having more false negatives. On the other hand, when one adjusts the match algorithm to reduce false negatives as in the case of a typical fraud detection type of situation by making the match settings relaxed, one increases the risk of having more false positive.
So, before you start working on the match algorithm (setting), be sure of the objective.
No comments:
Post a Comment