Friday, May 27, 2011

Restricting False Positives – Cut-Off point for each match-key


While discussing the basic framework, we saw the two cut-off points m and M between 0 and 1 which are applied on the match probability λ(p) in a way that if λ(p)> M then we conclude that p ε M, if λ(p) < m then p ε U and if m < λ(p) < M then p ε S

Here the pre-defined points m and M are called cut-off points.
Cardinality of the set of matching pairs reduces when the cut-off point M increases. So, one way of reducing the number of false positive matches seems to be increasing the value of cut-off point M.
But it could result in some genuine matches to land up in the set of suspected matches and thereby the total cost of error gets increased.
This is a typical issue encountered while implementing data quality solutions.
One way to address this issue is to introduce cut-off points for the individual match-keys.

While discussing match-key, we talked about the probability/indicator returned by each match-key during a comparison involving a pair p. These probabilities/indicators are denoted as λi where the suffix i runs from 1 to k in a system with k match-keys.

Besides having a cut-off point M for the composite probability/indicator, we can define cut-offs for each match-key and call those Mi such that if λi ≥ Mi for all i = 1 to k then only the composite probability/indicator for the underlying pair is calculated otherwise it is set to 0.

By introducing the Mi’s we will be able to restrict false positive matches.

No comments:

Post a Comment