Consider the following names:
Sl. # | Given Name | Middle Name | Last Name Prefix | Last Name |
1 | GABRIEL | MARQUIZ | DE | PEREIRA |
2 | MARQUIS | GEBRIEL | DE | PERERA |
Suppose the address information is matching on these two records. A manual inspection tells us that the records are indeed matching.
How to establish the match in names?
Well, the Last Names do match (well…close enough) and we do not compare the Prefix. But what about the Given Name and Middle Name? Apparently, they have switched places besides initial being used on the second record.
Only way to find the match in Given Name and Middle Name is to perform a cross-matching between these fields. That is, the match keys should be defined in a way that Given Names are also compared with the corresponding Middle Names and vice-versa.
This is a good time to introduce the reader to a matching objective i.e. House-Holding.
Here we try to cluster (or link) the records belonging to the same house-hold. Well…in many cases, we define a house-hold to be comprised of members sharing the same Last Name, Address and the residential Phone Number.
How to establish the match in names?
Well, the Last Names do match (well…close enough) and we do not compare the Prefix. But what about the Given Name and Middle Name? Apparently, they have switched places besides initial being used on the second record.
Only way to find the match in Given Name and Middle Name is to perform a cross-matching between these fields. That is, the match keys should be defined in a way that Given Names are also compared with the corresponding Middle Names and vice-versa.
This is a good time to introduce the reader to a matching objective i.e. House-Holding.
Here we try to cluster (or link) the records belonging to the same house-hold. Well…in many cases, we define a house-hold to be comprised of members sharing the same Last Name, Address and the residential Phone Number.
Now, let us see the following two names and consider these for house-holding:
Sl. # | Given Name | Middle Name | Last Name |
1 | JONATHAN | A | ABOTT |
2 | ABOTT | MARGARET |
Suppose the address information is matching on these two records. A manual inspection tells us that the records belong to the same house-hold. How is the Last Names matching? Well…Last Name on the first record matches to the Given Name on the second record. Again name components have switched places. But there is a little difference in the way we do cross-matching from the earlier example.
In the earlier example, we checked if the Given Name on one record matches to the Middle Name on the other record and the Middle Name on the second record matches to the Given Name on the first record. But for the house-hold matching example, we check if the Last Name on one record matches to the Given Name on the other record.
In DQ terminology these are known to be 2-way cross-matching and 1-way cross-matching.
Let us look at one more example where cross-matching has occurred involving three name fields:
In the earlier example, we checked if the Given Name on one record matches to the Middle Name on the other record and the Middle Name on the second record matches to the Given Name on the first record. But for the house-hold matching example, we check if the Last Name on one record matches to the Given Name on the other record.
In DQ terminology these are known to be 2-way cross-matching and 1-way cross-matching.
Let us look at one more example where cross-matching has occurred involving three name fields:
Sl. # | Given Name | Middle Name | Last Name |
1 | RAJ | SINGH | THAKOR |
2 | THAKORE | RAJ | SINGH |
Though theoretically such component switching can take place between any two (or more) fields in the database, we usually keep the cross-matching involving a few pair of fields.
Cross matching can be implemented when we use soft match-keys as discussed earlier. But there are many data quality tools out there which are built using hard match-keys alone.
So what do we do if one such tool is being used?
There is an alternative route that takes care of the cross matching in a limited way. Suppose we have two similar fields Field1 and Field2 which we want to cross match. In this case, first we make sure that there is no record with non blank Field2 and blank Field1. Then we make two copies of the records where both these fields are non blank. One such set of records is allowed to pass through as it is but for the records in the second set, we copy Field2 and overwrite Field1 with the values in Field2. Finally we append these records together to make a bigger set of records. Obviously, we put some identifier for each record to identify if they came from set1 or set2. Hard-key based match can be performed on this bigger set of records to arrive at the exactly same conclusions.
However, there is one limitation in this approach. For each such pair of fields we have to add new records in the database and this takes time/effort besides disc space.
No comments:
Post a Comment