Data Quality: Constructing the Survivor Record

So far, we have discussed many things about matching or record linking. At the end of this process, we get a few groups of matching records besides the bunch of non-matches.
Let us look at the following example:

Id.	Record
1	Record1
2	Record2
3	Record3
4	Record4
5	Record5
6	Record6
7	Record7
8	Record8

Suppose after matching, we see that record1, record4 and record6 are matching while record3 and record5 are matching and record2, record7 and record8 are unmatched.
This is expressed by assigning the same identifier to the matching records as in the following table:

Master Id.	Id.	Record
1	1	Record1
3	2	Record2
2	3	Record3
1	4	Record4
2	5	Record5
1	6	Record6
4	7	Record7
5	8	Record8

Now, if we are told to give the record corresponding to master id. 1, we are at a loss as we have three such records.
What we need is to have one representative record for each master id. This is a non issue in case of an unmatched record like record7 in the above table. But it is really a challenging task for the cases where we have a cluster of records under one master id. Such record is called a surviving record or survivor.
There are ways of building this survivor record corresponding to a cluster of matching records depending upon the situation or context. It is a business decision.
We will briefly discuss one such method.
Suppose records are coming in from several channels. In an environment where these records are customer records captured by a manufacturing firm, channels could be different types of POS (point of sale) or in a mass-mailing scenario, channels could be various third-party organizations etc. Let us also suppose that each record bears a time-stamp representing time date-time when it was last updated.
Let us suppose that there are fields like Name, Address, City, State, Post Code, Phone and E-Mail in a record (in reality there could be much more).

For a cluster, Name field of the survivor record will be built using a logic similar to the following example:

Use the name from the latest channel1 record (if it is not blank) or if there is no channel1 record in the cluster then take the name from the latest channel3 record (if it is not blank) else take the name from the latest record.

This way we can have rules defined for each field on the survivor record. Such rules, obviously, are business decisions and must be defined along with the users of the system.

Data Quality

Monday, June 6, 2011

Constructing the Survivor Record

No comments:

Post a Comment