Often, while integrating data from different sources to implement a data warehouse, organizations become aware of potential systematic differences or conflicts. Such problems fall under the umbrella-term data heterogeneity. Data cleaning, or data scrubbing, refer to the process of resolving such identification problems in the data. There are two types of data heterogeneity: structural and lexical. Structural heterogeneity occurs when the fields of the tuples in the database are structured differently in different databases. For example, in one database, the customer address might be recorded in one field named, say, addr, while in another database the same information might be stored in multiple fields such as street, city, state, and zipcode. Lexical heterogeneity occurs when the tuples have identically structured fields across databases, but the data use different representations to refer to the same real-world object (e.g., StreetAddress=44 W. 4th St. vs.StreetAddress=44 West Fourth Street).
The problem has been known for more than five decades as the record linkage or the record matching problem in the statistics community. The goal of record matching is to identify records in the same or different databases that refer to the same real world entity, even if the records are not identical. In slightly ironic fashion, the same problem has multiple names across research communities. In the database community, the problem is described as merge-purge, data deduplication, and instance identification; in the AI community, the same problem is described as database hardening and name matching. The names coreference resolution, identity uncertainty, and duplicate detection are also commonly used to refer to the same task.
|