So you want data quality…

Virtually everything in business today is an undifferentiated commodity except how a company manages its information. How you manage your information determines whether you win or lose?

Bill Gates
In a world of big data, identifying patterns in the data is business critical. However, the biggest challenge for your defined patterns is the quality of your data and, more importantly, a set of early warning systems that will help you ensure that your data remains of high quality.

What is good data quality?

The trusted adviser is an individual that is considered the go-to for a business. Like the trusted adviser, the data we manage is provided to our business. Whether they come to us is dependent on if they consider our raw data & actionable information acceptable for making decisions, projections, and tactics. Only then does it become considered be of good or high quality, or a trusted system.

How is good determined and why is this important?

The measurements are based on level of completeness, validity, consistency, timeliness, and accuracy (among other things). The impact of bad data quality is that teams spend time reconciling conflicting reports and/or making decisions with outdated or incorrect information and conclusions. The most costly outcome is that additional systems are built because the existing system does not meet my needs.

This pattern increases the cost of integration across data sources that are not in sync and is considered to be extremely high as end-users attempt to work within the system to get the results that they want.

What questions should we ask to make a difference?

Physical

    • Completeness
      • Do I have the right number of rows? – Comparison between source and target systems (missing records, extra records, mismatched records)
      • Is the data that I have the same as the data that the source has? – Column comparison that data in source system is represented in the target system (common examples are zip code, date, currency, or use of double byte character sets)
    • Performance
      • Does the load happen in a reasonable time (related to timeliness)
      • Do user queries finish in reasonable time (related to usability)
    • Redundancy
      • When dealing with multiple sources, are we seeing duplicates coming from a single source or multiple sources and how are they being handled. This can be for all columns of a row (physical duplicate) or for business specific columns (logical) duplicates.
    • Stress
      • What is the impact of increasing data loads from their known state of X to 3X? Are known performance levels maintained?
    • Timeliness
    • Is the data getting to the users in time for them to make decisions?
    • Are service levels being met?
    • How old is the data that is available?
    • When was the source last refreshed? Slowest updating source is the freshness value.
    • How does our system handle race conditions of data arriving (or not arriving) on-time

Logical

    • Consistency
    • Target systems should not have conflicting values that are held to be true in a source system
    • If discrepancies exist between multiple sources, determination of true master would be required. If true source cannot be determined, variance reporting should be implemented.
    • Referential integrity
    • Target system should not have orphaned records.

Business

    • Accuracy
    • Domain value and record count comparisons to ensure that the values seen by end users in the source system remain true in the target system.
    • Domain integrity
    • How often do Null, blank, space checking, unknown, default values incidences occur?
    • Ensure that source domain values are still enforced in the target
    • Usability
    • Do we get meaningful reports for the business at the end of the process?
    • Is the information relevant and we have supporting data elements pulled in from sources?
    • Validity
    • Domain values should be representative in the target data set.
    • How are business rules applied, is it consistently done, and can we measure it being done?
    • Do we track the modification by business rules and the impact on the data set as it is being modified?

These are just a few examples of some of the questions we could be asking. Using this very basic set of questions, we can begin creating reports and monitors to tell us the eventual state of our own system.

What kind of questions would you ask?