“Virtually everything in business today is an undifferentiated commodity, except how a company manages its information. How you manage information determines whether you win or lose. How you use information may be the one factor that determines its failure or success — or runaway success.”
The quote above is from a Bill Gates interview in The Sunday Times in 1999. It is one of the many insights he had into modern business, and it is even more relevant today than it was 16 years ago. What Bill doesn’t state is the important tenet that data must be accurate and correct in order to be useful.
So, how do we ensure our data integrity so that we can head toward that “runaway success” that Bill spoke about? As the Quality Capability leader here at Merkle, I can take you on a tour of the data validation that we do to ensure the highest integrity and quality for marketing data.
Row Count — the most basic of data integrity checks
This one seems obvious, but it’s the first step to ensuring that we are working with a complete set of data. During the data loading process, we may be loading millions of records from dozens (or hundreds) of domains (sources). The sum total of those sources is our universe of data that drives the solution — missing records from any source could manifest as orphans, incorrect aggregates, or improper campaign targeting down the road. The initial step in ensuring data integrity is counting the rows that we’ve loaded into the staging environment and ensuring that our sum total of rows across all domains is represented in staging (including records discarded or flagged as errored, according to ETL rules). An error at this stage should be cause for alarm and be investigated immediately.
Completeness of the data refers to when all of the data that is in the source is present in the target tables. The move from source to target is accompanied by processing (ETL) that frequently manipulates the data in ways that are critical to the client solution. For example, we may see length and type conversions, data substitutions for look-up values (looking up a description based on a numeric code, for example), normalization of phone numbers and zip codes, combining of duplicate records, and more. Completeness checks are driven by the business rules, and also account for mismatched records, missing records, or extra records.
Tests for consistency ensure that the source data and the target data agree with each other. In other words, they do not contain conflicting facts, which may be introduced by flaws in the transformation and integration logic. An example of a failure in consistency might be having two Products in the Product table that have the same description, but different Product_IDs. In this case, data consistency checks would seek out duplicate records (or alternately, ensure the uniqueness of certain fields in the records) within critical tables according to the requirements and design specifications.
Validity checks rely on business knowledge and the intended use of the client solution as a base for data integrity. If the solution contains auto insurance data, for example, a failed validity check might show that Joe Smith is insuring 9,999,999 cars on his policy. Or that customer Joe Smith was born in 1792. In both cases, the data is complete and consistent (numeric, non-negative, etc.), but not valid (falling outside an acceptable range). Strong knowledge of the client’s business and requirements for the solution are necessary for validity checking.
RedundancyTesting for data redundancy means looking for duplicates in the data. These duplicates could be physical duplicates or logical duplicates. Physical duplicates are two or more records in which every field is exactly the same as the corresponding field in a different record. This might be employee table data where all fields (Emp_ID, FirstName, LastName, Department) are the same in two or more records, and this situation violates the solution requirements or design.
Logical duplicates, by contrast, contain duplicate values only in the fields that are used as a business key for that record. Using the physical example with the employee table we might see FirstName, LastName, and Department being the same for two or more records, but Emp_ID is not. In this case we clearly have one employee in the system with TWO Emp_IDs, in violation of the business requirements.
Simply put, a database that passes the referential integrity check is one that has no orphan records. As an example, consider an order processing database where we have Order records (who ordered, and when) and Order_Detail records (what was ordered, quantity, etc) linked by an Order_ID field. Orphans exist when an Order record exists without an associated Order_Detail record, or Order_Detail records exist with out a corresponding Order parent record. A system that contains this type of orphan can lead to invalid aggregation of orders and inconsistency in sales and order reporting.
Throughout a client solution, the data representing domain values must remain valid. If we are loading data from the Department feed (lists all valid departments within a company) into a Department_lookup table, we need to ensure that all of the values that get populated after ETL processing are valid for that domain. If values like “JohnSmith” or “Polo Shirt, blue” end up in this table, we have a clear violation of domain integrity which needs to be fixed. Other domain integrity violations might include insertion of NULL values or truncation of department names in this example.
Accuracy and Usability
All of the above data integrity validations lead us to verifying the accuracy and usability of the client solution. If the data in the solution mirrors real-world business scenarios and satisfies the client’s daily business functional needs, then we can say that we’ve built an accurate and usable system.
Failure in any of the integrity checks can have a dramatic effect on the accuracy and usability of the system, since incorrect data is propagated to downstream processes and systems used in reporting, campaigns, and forecasts.
If you want that runaway success, use your data wisely and ensure its data integrity.