Designing a Data quality index.
Updated: Jul 12
To get dataset score that matters to Users
Measuring data quality is not something new. there are many data profiling tools available on the market that help data analysts understand gaps in their data and dig into root - causes.
You can read here how to create a data quality dashboard
Despite having various sets of tools it's not easy to answer questions like :
- How good overall data quality of this set is?
- Did data quality improved over the last month?
- How trustworthy is this data set?
With data Lakes and warehouse's high importance and a growing number of activities around data, data quality is something that not only experts users should be aware of. Emerging of modern BI and Self-service analytics roles like data analyst, data scientist, or data engineer that are not into data quality details and could use simple metrics to get a quality overview of dataset they want to use.
How to design a good data quality score
Should be seen from different angles and covered different dimensions the formula is not so clear. Let's see the requirements it shall fulfill:
1. Simple to understand. A user looking in to catalog of large number data sets should quickly get an initial understanding of how trustworthy it is without drilling down to details
2. Scaling proof - if the score was run on a smaller but representative sample it should more or less similar.
3. Comparable with other data quality scores. Metrics can be different for different datasets but it should give users high-level comparison even if that sets are much different in size.
4. Normalized - Clearly provided highest and lowest score and benchmark to see what can be expected and how far we are from perfect
Data quality indicator should show how the dataset is matching the expectations of the user. What it means is that we can find a lot of missing values for attributes but we may expect this or even don' care about it.
This is the foundation that we build Data quality scores on.
Expectations for columns/attributes :
- marked as mandatory to be completed
- completed inline with the definition of valid value
- completed with values defined in reference data source
- relation between data sets setting the dependencies or correlations between column
How to define Data quality issues - report of data quality problem type on attribute or record or group of elements, if 15 out of 100 mandatory values are missing then we can say data quality is 85%. Confidence represents the probability that data quality issue is a real business problem
Data quality for a single attribute in the record (for one cell)
- it's True or false value is either fulfilling standard or not.
Data Quality index for attribute
- certain attribute or column score based on rules set for these attributes.
Data Quality index of a record :
There are few approaches that can be used and they should reflect data usage by businesses:
- Basic average of attributes score, if 1 out of 5 required attributes is violating data rules we can say data is 80% compliant
- Calculate only fully compliant records, so if the record has 0 incompliant attributes then the record is ok otherwise its score is 0.
-Minimum required qualifications for a record to use
For example :
Record with 10 attributes but 4 of them Minimum Quality Requirements
if one of those 4 is wrong then the record is flag wrong.
- weighted average so some attributes are more important than others
Data Quality index for data seat should be based on the business use case but the most common approach would be the average value of all records score
In the end, a non-data summarized quality score can give a useful overview of how reliable the data set is to non-data quality experts. Quality data leads to quality information and making quality decisions. Indicators can show an analyst where he stands from the start.
Subscribe for more and follow me on twitter.