How to measure the quality of data?

Apr 26, 2024 9:40:18 AM

Data quality issues are present in all data warehouses. But how can you tell if the quality of your data is good? In this blog, I will go through data quality dimensions commonly referenced in the literature that provide you with the criteria to recognise and monitor issues with data quality. I will also discuss how testing these data quality dimensions could be applied in different data warehouse zones.

Data quality dimensions

You should first approach data quality management by defining appropriate dimensions for your use case. This set of data quality dimensions is the foundation for systematic data quality management. You can implement standardized tests and templates for testing similar data quality characteristics over different entities.

Several reference frameworks can help you to get started. For example, UK Government Data Quality Hub lists the following data quality dimensions:

  1. Accuracy - Is the data correct?
  2. Completeness - Are the expected records and attribute values available and complete?
  3. Uniqueness - Are there duplicates?
  4. Consistency - Are there conflicting values within a record or across different data sets?
  5. Timeliness - Is the data available when expected?
  6. Validity - Does the data conform to the expected format, type, and range?

Finding a fitting set of dimensions depends on your use case and business requirements. And naturally, this typically evolves over time.

 

Measuring data quality in a data warehouse

Typical zones of a data warehouse

The next step is to define how you test the data quality dimensions in your data warehouse. Data warehouses contain zones/layers for different purposes. Usually, there are at least the following zones in a data warehouse:

Zone Description
Staging
  • Landing zone for incoming source data
  • Source file format specific flat tables where source files can be loaded
  • Attributes and data types matching the source data
Data warehouse
  • Storage of data and its change history
  • Modeled with e.g. Data Vault methodology
  • If Data Vault is used, DW could consist of Raw Data Vault for data storage and Business Data Vault for applying business rules.
Publish
  • End-use of data
  • Modeled according to the requirements of the use case
  • Often e.g. star schema for BI tools or flat tables for other purposes

 

puolipalstaa_1

Psst,

Our DataOps maturity test lets you analyze the current state of data quality, capabilities, ways of working, tech stack, culture, and more.

Take 3 minutes to answer a set of questions. Get your DataOps maturity score with our recommendations on how to improve data quality consistently.


puolipalstaa_2

 

What to test and where?

You can run data quality tests in each zone of a data warehouse for somewhat different purposes. You could, for example, be interested in analysing and comparing the quality of incoming data from different source systems or monitoring the status of your publish models used for reporting. This table gives examples of how you could test the different data quality dimensions in the data warehouse zones:

Data quality dimension Staging Data warehouse Publish
Accuracy
  • Monitoring source system data accuracy
  • Option to discard incorrect data from the data warehouse
  • Business-rule based tests
  • Publish data accuracy
  • Business-rule based tests
  • Option to discard incorrect data from publish
Completeness
  • Monitoring source system data completeness
  • Do expected attributes contain values (e.g. data warehouse key attributes)?
  • Are all records available?
  • Referential integrity
  • Is expected data available for applying business rules?
  • Are values seen in transactional data sources also found in master data?
  • Referential integrity
  • Is expected data available for end-use?
  • Option to discard incomplete data from publish
Uniqueness
  • Monitoring source system duplicates
  • Duplicate tests
  • Logical duplicates (e.g. these two customer ids should be the same customer)
  • Duplicate tests
  • Option to discard duplicates from publish
Consistency
  • Monitoring source system data consistency
  • Business-rule based tests
  • Are values consistent with each other?
  • Are different datasets consistent (e.g. data from separate source systems)?
  • Business-rule based tests
  • Are values consistent with each other?
Timeliness
  • Monitoring that incoming data is loaded as expected
  • Timestamp-based testing to check that data is up to date
  • Pipeline lead time monitoring
  • Pipeline lead time monitoring
  • Pipeline lead time monitoring
Validity
  • Monitoring source system data validity
  • Format validity tests
  • Option to discard invalid data from the data warehouse
  • Monitoring validity of transformed data
  • Monitoring validity of transformed data

Redundancy in testing adds an extra dimension to the monitoring of data quality

Business requirements should guide your decisions on how much data quality testing you do and in which zones these tests will be applied. As testing requires many kinds of resources, it also has cost and performance implications. Testing the same characteristics over different data warehouse zones might seem excessive. Still, the redundancy can give insight into how effectively data quality is improved when data gets transformed through the data warehouse. This quality improvement could also happen, for example, when cleaned up data from a master data management system is introduced in the data warehouse zone. Multi-zone testing also often makes it easier to pinpoint root causes for data quality problems.

Typically you would also build some monitoring outside of the data warehouse. For example, the solution could include monitoring source data integrations for the timeliness and completeness of incoming data or end-use data in the BI/other tools. When multiple components run tests with some overlap, you can catch critical issues even if one system is temporarily down.

Data quality monitoring with Agile Data Engine

We have seen and built various ways to manage data quality in our customer environments. The solutions range from custom business rules and mapping tables in the data warehouse load logic to implementing dedicated master data management systems. We compiled our best practices for data quality monitoring with Agile Data Engine into a simple guide.

Also see the next blog in the series: Data quality monitoring for short-term and long-term purposes.