Data Quality: Restricting False Positives – Cut-Off point for each match-key

While discussing the basic framework, we saw the two cut-off points m and M between 0 and 1 which are applied on the match probability λ(p) in a way that if λ(p)> M then we conclude that p ε M, if λ(p) < m then p ε U and if m < λ(p) < M then p ε S

Here the pre-defined points m and M are called cut-off points.
Cardinality of the set of matching pairs reduces when the cut-off point M increases. So, one way of reducing the number of false positive matches seems to be increasing the value of cut-off point M.
But it could result in some genuine matches to land up in the set of suspected matches and thereby the total cost of error gets increased.
This is a typical issue encountered while implementing data quality solutions.
One way to address this issue is to introduce cut-off points for the individual match-keys.

While discussing match-key, we talked about the probability/indicator returned by each match-key during a comparison involving a pair p. These probabilities/indicators are denoted as λ_i where the suffix i runs from 1 to k in a system with k match-keys.

Besides having a cut-off point M for the composite probability/indicator, we can define cut-offs for each match-key and call those M_i such that if λ_i ≥ M_i for all i = 1 to k then only the composite probability/indicator for the underlying pair is calculated otherwise it is set to 0.

By introducing the M_i’s we will be able to restrict false positive matches.

Data Quality

Friday, May 27, 2011

Restricting False Positives – Cut-Off point for each match-key

No comments:

Post a Comment