Data drives business decisions that determine how well business organizations perform in the real world. Vast volumes of data are generated every day, but not all data is reliable in its raw form to drive a mission-critical business decision.
Today, data has a credibility problem. Business leaders and decision makers need to understand the impact of data quality, especially within your own organization.
In this article, we will discuss what Data Quality means, particularly in the world of enterprise IT. Then, we’ll look at some best practices that ensure and maximize high data quality.
What is data quality?
Data quality refers to the utility of data as a function of attributes that determine its fitness and reliability to satisfy the intended use.
These attributes—in the form of metrics, KPIs, and any other qualitative or quantitative requirements—may be subjective and justifiable for a unique set of use cases and context.
If that feels unclear, that’s because a single formal definition of data quality doesn’t exist. (The way you define a quality dinner, for instance, may be different from a Michelin-starred chef.) Instead, data is perceived differently depending on the perspective:
- And others
In order to understand the quality of a dataset, a good place to start is to understand the degree to which it compares to a desired state.
For example, a dataset free of errors, consistent in its format, and complete in its features, may meet all requirements or expectations that determine data quality.
(Understand how data quality compares to data integrity.)
Defining data quality in enterprise IT
Now let’s discuss data quality from a standards perspective, as it is widely used particularly in the domains of:
- Database management
- Big data
- Enterprise IT
Let’s first look at the definition of ‘quality’ according to the ISO 9000:2015 standard:
Quality is the degree to which inherent characteristics of an object meet requirements.
We can apply this definition to data and the way it is used in the IT industry. In the domain of database management, the term ‘dimensions’ describes the characteristics or measurable features of a dataset.
The quality of data is also subject to external and extrinsic factors, such as availability and compliance. So, here’s holistic and standards-based definition for quality data in big data applications:
Data quality is the degree to which dimensions of data meet requirements.
It’s important to note that the term dimensions does not refer to the categories used in datasets. Instead, it’s talking about the measurable features that describe particular characteristics of the dataset. When compared to the desired state of data, you can use these characteristics to understand and quantify data quality in measurable terms.
For instance, some of the common dimensions of data quality are:
- Accuracy. The degree of closeness to real data.
- Availability. The degree to which the data can be accessed by users or systems.
- Completeness. The degree to which all data attributes, records, files, values and metadata is present and described.
- Compliance. The degree to which data complies with applicable laws.
- Consistency. The degree to which data across multiple datasets or range complies with defined rules.
- Integrity. The degree of absence of corruption, manipulation, loss, leakage or unauthorized access to the dataset.
- Latency. The delay in production and availability of data.
- Objectivity. The degree with which data is created and can be evaluated without bias.
- Plausibility. The degree to which dataset is relevant for real-world scenarios.
- Redundancy. The presence of logically identical information in the data.
- Traceability. The ability to verify the lineage of data.
- Validity. The degree to which data complies with existing rules.
- Volatility. The degree to which dataset values change over time.
DAMA-NL provides a detailed list of 60 Data Quality Dimensions, available in PDF.
Best practices for data quality
Data quality can be improved in many ways.
First and foremost, data quality depends on how you’ve selected, defined, and measured the quality attributes and dimensions.
In a business setting, there are many ways to measure and enforce data quality. IT organizations can take the following steps to ensure that data quality is objectively high and is used to train models that produce the profitable business impact:
- Find the most appropriate data quality dimensions from a business, operational, and user perspective. Not all 60 data quality dimensions are necessary for every use case. Likely, even the 12 included above are too many for one use case.
- Relate each data quality dimension to a greater objective and goal. This goal can be intangible, like user satisfaction and brand loyalty. The dimensions can be highly correlated to several objectives—IT should determine how to optimize each dimension in order to maximize the larger set of objectives.
- Establish the right KPIs, metrics, and indicators to accurately measure against each data quality dimension. Choose the right metrics, and understand how to benchmark them properly.
- Improve data quality at the source. Enforce data cleanup practices at the edge of the network where data is generated (if possible).
- Eliminate the root causes that introduce errors and lapses in data quality. You might take a shortcut when you find a bad data point, correcting it manually, but that means you haven’t prevented what caused the issue in the first place. Root cause analysis is a necessary and worthwhile practice for data.
- Communicate with the stakeholders and partners involved in supplying data. Data cleanup may require a shift in responsibility at the source that may be external to the organization. By getting the right messages across to data creators, organizations can find ways to source high quality data that favors everyone in the data supply pipeline.
Finally, identify and understand the patterns, insights, and abstraction hidden within the data instead of deploying models that churn raw data into predefined features with limited relevance to the real world business objectives.
It’s easy to use SaaS options that have predefined data features—but it can hinder our full and deep understanding of data and the business.
- BMC Machine Learning & Big Data Blog
- 3 Keys to Building Resilient Data Pipelines
- Data Management vs Data Governance: A Comparison
- Data Architecture Explained: Components, Standards & Changing Architectures
- Big Data vs Analytics vs Data Science: What’s The Difference?
- Data Visualization Guide, a series of tutorials on graphs, charts, Tableau Online & more