authors are vetted experts in their fields and write on topics in which they have demonstrated experience. All of our content is peer reviewed and validated by Toptal experts in the same field.
Alexander has more than 20 years of experience in data warehousing in various roles and environments. He has extensive experience in insurance and banking.
Data quality in data warehouse systems is getting more and more important. Increasing regulatory requirements, but also the growing complexity of data warehouse solutions, force companies to intensify (or start) a data quality (DQ) initiative.
This article’s main focus will be on “traditional” data warehousing, but data quality is also an issue in more “modern” concepts such as data lakes. It will show some main points to consider and also some common pitfalls to avoid when implementing a data quality strategy. It does not cover the part on choosing the right technology/tool to build a DQ framework.
One of the most obstructive problems of a DQ project is the fact that at first sight, it creates a lot of work for the business units without providing any extra functionality. A data quality initiative usually only has strong proponents if:
DQ’s treatment is similar to that of testing in software development—if a project runs out of time and/or budget, this part tends to be reduced first.
This, of course, is not the whole truth. A good data quality system helps detect errors early, thus speeding up the process of delivering data of “good enough” quality to the users.
Before discussing the topic, a common understanding of the terms used is important.
A data warehouse (DWH) is a non-operational system mainly used for decision support. It consolidates the data of the operational systems (all of them or a smaller subset) and provides query-optimized data for the users of the DWH system. The data warehouse should provide “a single version of truth” within the enterprise. A data warehouse is usually built of stages/layers:
The operational data is stored mostly unchanged into a staging layer. The core layer contains consolidated and unified data. The next optional stage is a derivation area, providing derived data (for example, a customer score for sales) and aggregations. The data mart layer contains data optimized for a given group of users. Data marts often contain aggregations and lots of derived metrics. Data warehouse users often work only with the data mart layer.
Between each stage, some kind of data transformation takes place. Usually, a data warehouse is periodically loaded with delta extractions of the operational data and contains algorithms to keep historical data.
Data quality is usually defined as a metric on how well a product meets user requirements. Different users might have different requirements for a product so the data quality implementation depends on the user’s perspective, and it is important to identify these needs.
Data quality does not mean the data has to be completely or almost error-free—it depends on the users’ requirements. A “good enough” approach is a good choice to start with. Nowadays, bigger companies have “a data (or information) government policy,” and data quality is a part of it. A data government policy should describe how your company deals with data and how it makes sure that data has the right quality and that data privacy rules are not violated.
Data quality is an ongoing topic. A DQ circuit loop has to be implemented (see next chapter). Regulatory requirements and compliance rules also have an impact on the data quality needed, such as TCPA (US Telephone Consumer Protection Act) or GDPR in Europe for privacy issues, but also industry-specific rules like Solvency II for insurances in the EU, BCBS 239 and others for banking, and so on.
As with all quality topics, DQ is an ongoing activity designed to maintain satisfactory quality. As a result of a DQ project, a circuit loop similar to the one below has to be implemented:
The steps within this loop will be described in the next chapters.
Next, we will see how to improve data quality in data warehouse contexts as well as how to implement a successful DQ initiative. To do this, the following roles are needed:
To ensure success, it is important to have these roles clearly defined and widely accepted within your organization in the early stages of your DQ project. It is equally important to find competent data specialists for these roles who support the project.
The data quality/data warehouse relationship is quite strong as defining the rules of the former requires a good understanding of the latter.
As discussed earlier, data users (and the data owner) are responsible for data use and therefore also for the needed level of data quality. Data users should have a good understanding of their data so they can give the best input for useful data quality rules.
They are also the ones who analyze the results of the data quality rules, so it is always a good idea to let them define their own rules. This further enhances the acceptance to check and rate the result of the DQ rules assigned to a data user unit (see “Analyze” chapter).
The drawback of this approach is that data users normally only know the data mart layer, not the earlier layers of the data warehouse. If data was corrupted in the “lower” stages, this won’t be detected by checking just the “top” layer of your data warehouse.
What kind of known errors might occur in a data warehouse?
These problems are often caused by people lacking the appropriate know-how and skills to define, implement, run, and work with a data warehouse solution.
DQ dimensions are a common way to identify and cluster DQ checks. There are many definitions, and the number of dimensions varies considerably: You might find 16, or even more dimensions. From a practical perspective, it is less confusing to start with a few dimensions and find a general understanding of them among your users.
Data generated by the data warehouse load process can be helpful as well.
Keep in mind that each data quality check has to be analyzed by at least one data user (see “Analyze” chapter) in case errors are found, for which you’ll need someone responsible and available to look after every check implemented.
Within a complex data warehouse, you might end up with many (sometimes thousands) DQ rules. The process to execute data quality rules should be robust and fast enough to handle this.
Don’t check facts that are guaranteed by technical implementation. For example, if the data is stored in a relational DBMS, it is not necessary to check if:
That said, always keep in mind that a data warehouse is in constant change and that the data definition of fields and tables might change over time.
Housekeeping is very important. Rules defined by different data user units might overlap and should be consolidated. The more complex your organization, the more housekeeping will be needed. Data owners should implement a process of rules consolidation as a kind of “data quality for data quality rules.” Also, data quality checks might become useless if the data is no longer used or if its definition has changed.
Data quality rules can be classified based on the type of test.
Once you have defined what to check, you’ll have to specify how to quantify the identified issues. Information such as “five data rows violate the DQ rule with ID 15” makes little sense for data quality.
The following parts are missing:
Metadata is important to route the “Analyze” and monitor the phases of the data quality control loop.
(*) Data lineage shows the flow of data between two points. With data lineage, you can find all data elements influencing a given target field within your warehouse.
Using data lineage to assign users to rules can be problematic. As mentioned before, business users usually know only the data mart layer (and the operating system), but not the lower levels of the data warehouse. By mapping via data lineage, data users will be assigned rules they’re not familiar with. For the lower levels, IT staff may be needed to evaluate a data quality finding. In many cases, a manual mapping or a mixed approach (mapping via data lineage only within the data mart) can help.
Measuring data quality means executing the available data quality rules, which should be done automatically, triggered by the load processes of the data warehouse. As we’ve seen before, there might be a remarkable number of data quality rules, so the checks will be time-consuming.
In a perfect world, a data warehouse would be loaded only if all data is error-free. In the real world, this is seldom the case (realistically, it is almost never the case). Depending on the overall loading strategy, your data warehouse data quality process should or should not (the latter is far more likely) rule the load process. It is a good design to have data quality processes (job networks) parallel and linked to the “regular” data warehouse load processes.
If there are defined service-level agreements, make sure not to thwart the data warehouse loads with the data quality checks. Errors/abends in data quality processes should not stop the regular load process. Unexpected errors within the data quality processes should be reported and shown up for the “Analyze” phase (see next chapter).
Keep in mind that a data quality rule might crash because of unexpected errors (maybe the rule itself was wrongly implemented, or the underlying data structure changed over time). It would help if your data quality system provided a mechanism to deactivate such rules, especially if your company has few releases per year.
DQ processes should be executed and reported as early as possible—ideally, right after the data checked was loaded. This helps detect errors as early as possible during the load of the data warehouse (some complex warehouse system loads have a duration of several days).
In this context, “analyze” means reacting to data quality findings. This is a task for the assigned data users and the data owner.
The way to react should be clearly defined by your data quality project. Data users should be obligated to comment on a rule with findings (at least rules with a red light), explaining what measures are being taken to handle the finding. The data owner needs to be informed and should decide together with the data user(s).
The following actions are possible:
In a perfect world, every data quality problem would be fixed. However, lack of resources and/or time often results in workarounds.
To be able to react in time, the DQ system must inform the data users about “their” rules with findings. Using a data quality dashboard (maybe with sending messages that something came up) is a good idea. The earlier the users are informed about findings, the better.
The data quality dashboard should contain:
The dashboard should also show the current status of the recent data warehouse load process, giving the users a 360-degree view of the data warehouse load process.
The data owner is responsible for making sure that every finding was commented on and the status of the data quality (original or overruled) is at least yellow for all data users.
For a quick overview, it would help to build a kind of simple KPIs (key performance indicators) for data users/data owner. Having an overall traffic light for all associated rules’ results is quite easy if each rule is given the same weight.
Personally, I think computing an overall value of data quality for a given data domain is rather complex and tends to be cabalistic, but you could at least show the number of overall rules grouped by result for a data domain (e.g., “100 DQ rules with 90% green, 5% yellow, and 5% red results”).
It is the data owner’s task to ensure that the findings will be fixed and data quality improved.
As the data warehouse processes often change, the data quality mechanism also needs maintenance.
A data owner should always take care of the following points:
Monitoring the entire data quality process helps to improve it over time.
Things worth watching would be:
Many of the following points are important in any kind of project.
Anticipate resistance. As we have seen, if there is no urgent quality issue, data quality is often viewed as an additional burden without offering new functionality. Keep in mind that it might create additional workload for the data users. In many cases, compliance and regulatory demands can help you to convince the users to see it as an unavoidable requirement.
Find a sponsor. As noted above, DQ is not a fast-selling item, so a powerful sponsor/stakeholder is needed—the higher in the management, the better.
Find allies. As with the sponsor, anyone who shares the idea of strong data quality would be most helpful. The DQ circuit loop is an ongoing process and needs people to keep the circuit loop alive.
Start small. If there’s been no DQ strategy so far, look for a business unit that needs better data quality. Build a prototype to show them the benefit of better data. If your task is to improve or even replace a given data quality strategy, look at things working well/being accepted in the organization, and keep them.
Don’t lose sight of the whole picture. Although starting small, keep in mind that some points, especially the roles, are prerequisites for a successful DQ strategy.
Once implemented, don’t let go. The data quality process needs to be part of data warehouse use. Over time, focus on data quality tends to get a bit lost, and it’s up to you to maintain it.
Data quality is determined based on data quality rules, which are defined by people who know and work with the data. Data quality checks should be defined for every relevant data object.
Good data quality means the data stored in the data warehouse is of sufficient quality for the user’s needs and applicable regulatory requirements.
Erroneous data causes false estimations and bad business decisions. In addition, a lack of data quality can result in serious regulatory compliance issues.
A data warehouse is a non-operational system mainly used for decision support. It consolidates the data of the operational systems (all of them or a smaller subset) and provides query-optimized data for the users of the data warehouse system.
Located in Stuttgart, Baden-Württemberg, Germany
Member since March 4, 2020
Alexander has more than 20 years of experience in data warehousing in various roles and environments. He has extensive experience in insurance and banking.
12
World-class articles, delivered weekly.
World-class articles, delivered weekly.
Join the Toptal® community.