Good data quality management is expensive. Poor data quality management is “expensiver”.
According to a recent Gartner report, data quality issues cost large organizations an average of USD 12.9 million per year. In other words, having good data is one way to avoid losing money!
What does data quality mean?
Before highlighting the key best practices of data quality management, it’s important to understand what this concept of data quality means in a business scenario.
There is unfortunately no universally accepted definition of data quality. This is perhaps because quality is itself contextual. Think about what “high quality” could mean for a satellite imagery dataset. High resolution, lack of cloud cover, frequency of data acquisition, number of bands — these are only a few dimensions in which we could consider just this one type of data to display quality or lack thereof. But depending on application, only a subset of these dimensions may be relevant: timeliness for change detection, resolution and lack of obstruction for accurate mapping, or availability of specific bands for specialized remote sensing applications.
In the business context we may therefore argue that data quality is the degree to which data contributes to the goals of an organization. Some companies, governments or non-profits might look at the timeliness and relevancy of the data they gather, while others might focus on consistency and discoverability. What quality means depends on each entity and its objectives. Each organization faces a potentially daunting amount of choice when it comes to crafting their own bespoke definition of quality: there are actually more than 60 dimensions which can be taken into consideration!
Best practices in data quality management
The experience of bad data quality management is ubiquitous. Newsletters that fail to deliver or actually turn out to annoy people due to emails being doubled or incorrect. Sending the wrong products to customers because of addresses or preferences that were not updated. Not being able to gauge the profitability of sales strategies because the data is not traceable. Did we mention standards? The list can go on and on.
All these types of situations can add up to huge, and arguably needless, losses of revenue, time and employee neurons. So, what can you do to avoid them? Here are some of the key best practices that you can put into place to improve your data quality management.
If something is not measured it cannot be improved. The first and most important step when it comes to data quality management is to have a clear set of metrics to assess the data you collect. Like we have said before this depends on the dimensions you deem as being important for your business. For example, you can look at operationalizing concepts such as timeliness, accuracy, consistency, and discoverability.
Timeliness and reliability
An area where recent developments in the world of data infrastructure come in quite handy. In particular, the standardization of observability metrics through systems like Prometheus is very promising in providing a unified view of data streams — an idea which only a few years ago may have seemed a mere pipe dream.
Measuring the reliability of data streams is only one key area to consider. There is also an issue of quantifying “inherent” data quality. Good data should be both accurate and consistent.
Accuracy and consistency
When it comes to accuracy, we could think of putting a number to the extent to which a dataset can be trusted to be an accurate representation of some externally verifiable fact: for instance, we could consider how much error is involved in obtaining data from one set of cheap-but-cheerful sensors, compared to accurate-but-expensive alternatives.
Consistency refers to the underlying semantics of datasets. We may ask, are data points good representations of the same construct? This is harder than it looks. Consider for instance the problem of comparing regions in different countries. Germany and France are roughly similar in size, but one is divided in 16 Bundesländer, and the other in 101 departments. How should we define consistent regional areas in both countries? For this particular problem there are admirable efforts at harmonization, such as the Nomenclature of Territorial Units for Statistics (NUTS), or the GADM database. Such definitional problems abound wherever humans rely on high-level concepts to make sense of the world: “region”, “discipline”, “syndrome”, etc. — all incredibly difficult to nail down. We may say that all data is ultimately inconsistent to some extent, and ask whether we can quantify the bias due to inconsistency (e.g. by measuring the influence of outliers).
Best-practices-in-data-quality-management Another dimension of data quality, related to the human factors involved in making data valuable. This is perhaps the area where more efforts at quantification can help the most. The development of internal data platforms and the rise of people analytics opens an entire new world for the modelling and quantification of internal business processes that increasingly rely on data. In particular, when internal data platforms are websites, and employees are also website users, an entire suite of well-established constructs (daily actives, click-through-rates, session length, etc.) become accessible as useful metrics.
Look at failures as opportunities
Bad data happens, no matter how good or foolproof you think your system is. Databases get corrupted, pipelines stall, hard drives fail, sensors age. Bad data is often a symptom of an underlying failure that has bigger consequences, and detecting bad data can provide the opportunity to fix bigger problems.
Data auditing, broadly defined as the systematic examination of current data flows, assets and processes, is the best form of prevention when it comes to quality management. Auditing allows you to spot potential problems before they happen and take the necessary steps to mitigate them. As your data volume and sources increase so to should the number of times you conduct audits. Audits can serve as a particularly useful backstop for problems that arise in poorly-measured areas of data quality. In this case, the frequency of audits can have a decisive effect on an organization’s ability to solve lingering data quality issues.
Focus on culture
Best-practices-in-data-quality-management We have talked about the importance of culture before when it comes to data science initiatives. In the end, the success of your data quality management efforts is as good as the people who are involved. That is why it is important to invest in employee development and education.
We have argued that data quality is contextual. People handling data in the organization need to be familiar not only with the basic concepts of data quality, but also with the meaning of data quality for your organization, and with the risks inherent in what you see as bad data. They need to be able to spot poor data and fix it.
The impact data of quality management
To paraphrase Tolstoy, all excellent datasets are alike, but each bad dataset is bad in its own way. There are clearly many ways to think about data quality, especially when it is lacking. This diversity of concerns makes the topic particularly ambiguous and challenging. It is therefore essential to prioritize the most important aspects of what quality may mean to your organization, and we hope this examination has helped with some ideas of where your journey might begin.
Looking for an efficient open-source solution to manage data in your project? Aorist is a tool for managing data for your ML project. It produces readable, intuitive code that you can inspect, edit, and run yourself. You can then focus on the hard parts while automating the repetitive parts. To get this, you just need a description of how your data is formatted and organized, and where it needs to go.
Image by pch.vector on Freepik