How to Build an Automated Data Quality Scoring Pipeline

How to Build an Automated Data Quality Scoring Pipeline

41

Have you ever spent hours analysing a dataset only to find that your insights were wrong because of data quality issues? Or you've built a machine learning model that performed poorly in production because the training data wasn't representative of real-world conditions. These scenarios happen all too often in data science projects.

Take an example. Think about the last time you worked with a spreadsheet of customer information. You might have noticed missing phone numbers, dates entered in different formats (12/05/2023 vs. 05/12/2023), or duplicate entries for the same person with slightly different spellings of their name ("John Smith" vs "Jon Smith"). When you tried calculating something simple like average customer age or total purchases per region, these inconsistencies led to incorrect results.

Or you've built a basic sales prediction model using historical data, only to discover later that some store locations reported their numbers in thousands while others reported exact figures. Your model learned patterns from this inconsistent data, making its predictions unreliable when applied to new data.

These everyday data quality issues can significantly impact even straightforward analyses. As you work on complex projects, it becomes very important for you to focus on data preprocessing in machine learning – cleaning, standardising, and transforming your data into a format that’s easy for AI (be it a machine learning/ deep learning model that you’re going to feed this data to) to consume.

In this guide, we'll build a practical data quality scoring pipeline that helps identify and quantify quality issues before they affect your analysis. We’ll look at what makes a good quality pipeline and go over a code snippet to learn how to implement one key component in detail.

Why Data Quality Often Gets Overlooked

Data quality is often treated as an afterthought in many organisations. It's not as exciting as building models or creating visualisations, but it's actually the foundation that makes those activities worthwhile.

Think about it: if you're building a house, you wouldn't skip checking the foundation just to get to the more interesting parts like designing the kitchen. Yet, in data science, we often jump straight to analysis without verifying our data's integrity.

This happens because:

  1. Quality issues aren't immediately visible (unlike a broken visualisation)
  2. The consequences of poor quality often appear downstream
  3. Data quality work can feel tedious compared to analysis

Imagine a scenario where you calculate the average home price in a neighbourhood as $300,000, but this figure is skewed by one $3,000,000 mansion incorrectly entered as $30,000,000. Without outlier detection, your estimate is completely wrong, isn’t it?

Or let’s say you're analyzing customer support tickets to improve response times. Your dataset contains several duplicate entries where the same customer issue was logged multiple times due to system glitches or multiple customer follow-ups. When you calculate metrics like "average time to resolution," these duplicates make it appear that you're handling more tickets than you actually are, and they artificially deflate your resolution time statistics.

The truth is, as a data scientist, you’d spend up to 80% of your time cleaning data. Having a good quality scoring pipeline as part of your workflow can dramatically reduce this percentage by identifying issues early and systematically.

The Seven Critical Dimensions of Data Quality

When we talk about "data quality," we're actually referring to a collection of different characteristics that together determine whether data is fit for its intended purpose. Think of it like evaluating a car. You wouldn't just check one aspect like fuel efficiency, right? You'd also look at safety ratings, reliability, comfort, and performance.

In the same way, when we talk about data quality, it isn't a single attribute but a combination of distinct dimensions that must be measured separately. Each dimension reveals different types of potential issues that could affect your analysis or business operations.

For example, your customer dataset might be 100% complete (no missing values), but contain invalid email addresses. Or your sales data might be perfectly accurate but include duplicate transactions that inflate your revenue figures. By examining each dimension individually, you can pinpoint specific problems and address them appropriately.

And always remember this when you’re working on data quality: problems in them translate to direct business impact. For example, poor completeness might mean missing out on potential customers. Invalid data could cause marketing emails to bounce. Inconsistent data might lead to shipping products to outdated addresses.

Let's explore the seven critical dimensions that form the foundation of our quality scoring pipeline. Understanding these will help you develop a comprehensive view of your data's health and identify specific areas for improvement.

1. Completeness

What it measures: Whether required data is present (not missing or null)

Real-world examples:

  • A healthcare dataset where 30% of patients are missing blood pressure readings, making it impossible to accurately assess cardiovascular risk
  • A marketing dataset where high-value customers have incomplete demographic profiles because they opted out of providing certain information
  • An IoT sensor stream with intermittent data gaps during peak usage hours, creating blind spots in monitoring
  • E-commerce product data where luxury items have missing specifications because they were imported from a different catalogue system

Why it matters: Missing data forces difficult choices, i.e. imputation might introduce bias, while dropping records reduces sample size and representativeness.

2. Validity

What it measures: Whether data follows the expected format and rules

Real-world examples:

  • A CRM system containing birthdates in the future ("2030-05-15") because users misunderstood the date format
  • Temperature readings showing values of "-999" where sensors malfunctioned, rather than actual measurements
  • Survey responses with satisfaction scores of "11" on a 1-10 scale due to data entry errors
  • Website analytics containing impossible session durations (e.g., 86400 seconds) due to tracking bugs

Why it matters: Invalid data can mess up the calculations and produce analysis results that appear plausible but are fundamentally flawed.

3. Consistency

What it measures: Whether related data points agree with each other

Real-world examples:

  • A retail database showing order delivery dates that precede order placement dates for 5% of international orders
  • Customer profiles showing contradictory information across systems (married in the CRM, single in the marketing database)
  • Financial data where the sum of departmental budgets doesn't match the company-wide total due to currency conversion issues
  • Product dimensions listed in centimetres in one table but in inches in another without clear labelling

Why it matters: Inconsistencies create logical contradictions that undermine the credibility of your entire dataset and any analysis built upon it.

4. Accuracy

What it measures: How well data values reflect real-world entities

Real-world examples:

  • GPS location data showing customers shopping in the middle of the ocean due to coordinate errors! (You might have seen this happen recently!)
  • A customer listed as 25 years old but has a 30-year credit history
  • Inventory system showing negative stock levels for popular items due to theft or scanning errors
  • Website traffic attributed to Internet Explorer on iPhone devices (an impossible browser/OS combination)

Why it matters: Inaccurate data leads to false insights and business decisions based on conditions that don't actually exist.

5. Timeliness

What it measures: Whether data is current enough for its intended purpose

Real-world examples:

  • A recommendation system using 6-month-old purchase history to suggest winter coats in summer
  • Risk models using pre-pandemic customer financial data to assess current creditworthiness
  • A fraud detection system using behavioral patterns from last year's transactions to flag suspicious activity
  • Supply chain optimisation using last month's supplier lead times during rapidly changing market conditions

Why it matters: Even perfectly accurate data can lead to wrong decisions if it doesn't reflect current conditions.

6. Uniqueness

What it measures: Absence of duplicates in data that should be unique

Real-world examples:

  • Customer records duplicated with slight variations ("John Smith" vs "John A Smith"), leading to redundant marketing efforts
  • Sales transactions recorded twice when syncing between online and in-store systems
  • The same user feedback appearing multiple times in the analysis because it was collected across different channels
  • Website visitors counted multiple times due to cookie resets or cross-device usage

Why it matters: Duplicates artificially inflate metrics and can lead to double-counting in financial or performance reporting.

7. Referential Integrity

What it measures: Whether relationships between data entities are maintained

Real-world examples:

  • Analytics dashboards showing "Unknown" as the top product category because product IDs in sales data don't match the current catalogue
  • Customer support tickets referencing order numbers that don't exist in the order management system
  • Employee performance reports, including results for departments that have been reorganised or no longer exist
  • Mobile app events referencing user IDs that aren't in the user database due to account deletion

Why it matters: Broken data relationships create "orphaned" records that can't be properly joined or analysedanalyzed, leading to incomplete insights.

How To Implement Completeness Check

Let's see how to implement a robust completeness check for our quality pipeline. You can work on other checks that we discussed above in a similar manner.

Before diving into building our data quality pipeline, let's make sure you have the right tools installed. These packages form the foundation of our quality assessment framework. Once you have these packages installed, your environment is ready for implementing the data quality pipeline.

Step 1: Install the dependencies

 


		# Install the required packages
pip install pandas numpy great_expectations pytest schedule pymongo sqlalchemy
# For visualisation capabilities
pip install matplotlib seaborn plotly
	 

 

First, we'll create a function to assess completeness:

Step 2: Creating a function that’ll check completeness

 

import pandas as pd
import numpy as np
import logging

def check_completeness(df, required_columns=None, row_threshold=None):
    """
    Check data completeness and calculate detailed metrics.
    
    Parameters:
    -----------
    df : pandas.DataFrame
        The dataset to check
    required_columns : list, optional
        Specific columns to check; if None, checks all columns
    row_threshold : float, optional
        If specified, flags rows with completeness below this threshold
        
    Returns:
    --------
    dict
        Detailed completeness metrics
    """
    # Set up logging
    logger = logging.getLogger(__name__)
    
    # Default to all columns if not specified
    if required_columns is None:
        required_columns = df.columns.tolist()
        
    # Basic statistics
    total_rows = len(df)
    total_cells = total_rows * len(required_columns)
    
    results = {
        "total_rows": total_rows,
        "total_columns_checked": len(required_columns),
        "column_details": {},
        "missing_patterns": {},
        "problem_rows": []
    }

 

In this section, we set up the function with imports and parameters. It initialises a logger to track issues and creates a results dictionary to store our findings. The function accepts three inputs: the DataFrame to check, which columns to verify, and an optional threshold for flagging problematic rows. The initialisation gives us a foundation for the analysis and creates a structured results object that makes it easy to understand and act on the findings. We're preparing a "report card" for our data's completeness.

Step 3: Checking for missing columns

 


		# Check which required columns are missing entirely
    missing_columns = [col for col in required_columns if col not in df.columns]
    for col in missing_columns:
        logger.warning(f"Required column '{col}' not found in dataset")
        results["column_details"][col] = {
            "present_in_dataset": False,
            "completeness_score": 0,
            "missing_count": total_rows
        }
    
    # Calculate completeness for existing columns
    existing_columns = [col for col in required_columns if col in df.columns]

	

 

In this section, we’re identifying any columns that are completely missing from the dataset (not just containing null values, but absent entirely). It logs a warning for each missing column and marks it with a completeness score of 0 in the results.

Missing columns represent a structural issue with your data. If your analysis requires a "customer_email" column but it's not in the dataset at all, you need to know this before proceeding. This check acts as an early warning system for schema mismatches or data extraction errors.

Step 4: Checking for column-level completeness

 


		# Column-level completeness
    for column in existing_columns:
        missing_count = df[column].isna().sum()
        completeness_score = ((total_rows - missing_count) / total_rows) * 100
        
        results["column_details"][column] = {
            "present_in_dataset": True,
            "completeness_score": completeness_score,
            "missing_count": missing_count,
            "missing_percentage": (missing_count / total_rows) * 100
        }
        
        # Detailed logging for columns with issues
        if missing_count > 0:
            logger.info(f"Column '{column}' missing {missing_count} values ({results['column_details'][column]['missing_percentage']:.2f}%)")

	

 

For each column that exists in the dataset, this code counts how many null values it contains using the isna() function. It then calculates a completeness score as a percentage (100% means no missing values) and logs information about columns with missing data.

Column-level completeness tells you which specific fields have data quality issues. For example, you might find that your "phone_number" column is only 60% complete while your "email" column is 95% complete. This helps you prioritise data collection or cleaning efforts and understand potential biases in your analysis.

Step 5: Calculating the overall score

 



		# Calculate overall completeness score
    present_columns = [col for col in required_columns if col in df.columns]
    if present_columns:
        column_scores = [results["column_details"][col]["completeness_score"] 
                        for col in present_columns]
        results["overall_column_score"] = sum(column_scores) / len(column_scores)
    else:
        results["overall_column_score"] = 0
	

 

In the above code snippet, we calculate the overall completeness score by averaging the individual completeness scores of all columns. If no columns are present, it returns a score of 0. The overall column score gives you a single metric to track data quality over time. Instead of monitoring dozens of individual columns, you can watch this summary statistic to detect general improvements or degradations in your data completeness.

Step 6: Identifying the common missing value patterns

 


		# Identify common missing value patterns
    if existing_columns:
        # Get missing value indicators (True where value is missing)
        missing_indicators = df[existing_columns].isna()
        
        # Count rows with different numbers of missing values
        missing_counts = missing_indicators.sum(axis=1).value_counts().sort_index()
        
        for count, frequency in missing_counts.items():
            if count > 0:  # Only include patterns with missing values
                percentage = (frequency / total_rows) * 100
                results["missing_patterns"][int(count)] = {
                    "rows_affected": int(frequency),
                    "percentage": percentage
                }
                
                # Log significant patterns
                if percentage > 5:
                    logger.info(f"{frequency} rows ({percentage:.2f}%) are missing exactly {count} values")

	

 

In this snippet, we’re looking for patterns in how values are missing across rows. It creates a matrix of missing value indicators, then counts how many rows are missing exactly 1, 2, 3, etc. values. It highlights significant patterns that affect more than 5% of your data. Pattern identification is crucial for understanding the structure of your missing data. If you find that 20% of your rows are missing exactly 3 values, this suggests a systematic issue rather than random missing data. For example, you might discover that whenever a customer's address is missing, their phone number and email are usually missing too.

Step 7: Checking row-level completeness

 


		# Row-level completeness assessment
    if row_threshold is not None and existing_columns:
        row_completeness = (1 - missing_indicators.mean(axis=1)) * 100
        problem_rows = df.index[row_completeness < row_threshold].tolist()
        
        results["problem_rows"] = {
            "threshold": row_threshold,
            "count": len(problem_rows),
            "percentage": (len(problem_rows) / total_rows) * 100,
            "indices": problem_rows[:100]  # Limit to first 100 for practicality
        }
        
        if problem_rows:
            logger.warning(
                f"Found {len(problem_rows)} rows ({results['problem_rows']['percentage']:.2f}%) "
                f"with completeness below {row_threshold}%"
            )
	

 

If you specified a row_threshold parameter (e.g., 80%), this code identifies rows that have too many missing values. It calculates a completeness percentage for each row and flags those below the threshold as "problem rows." It returns the counts and indices of these problematic records. Some records might be so incomplete that they should be excluded from analysis or flagged for review. For instance, a customer record with only an ID but missing all other information isn't useful for most analyses. This check helps you identify and potentially filter out these low-quality records.

Step 8: Checking overall dataset completeness

 


		# Calculate overall dataset completeness
    total_missing_cells = sum(results["column_details"][col]["missing_count"] 
                             for col in existing_columns)
    results["overall_dataset_completeness"] = ((total_cells - total_missing_cells - len(missing_columns) * total_rows) 
                                              / total_cells) * 100
    
    return results

	

 

This final section calculates an overall completeness score for the entire dataset by considering all cells (row × column combinations). It counts the total number of missing values across all columns and computes the percentage of cells that contain data.

The overall dataset completeness provides a single, comprehensive metric of your data's quality. This is particularly useful for tracking quality over time or comparing different datasets. A score of 99% might be acceptable for some applications, while critical systems might require 99.9% or higher.

Let’s put it all together now.

  • Column-level analysis: Which fields have missing values?
  • Row-level analysis: Which records are problematic?
  • Pattern identification: Are there systematic issues with missing data?
  • Overall metrics: How complete is the dataset as a whole?

By breaking down completeness in this structured way, you can:

  • Prioritise which data quality issues to fix first
  • Understand the root causes of missing data
  • Track improvements over time
  • Set appropriate thresholds for data acceptance

This is just one dimension of data quality, but implementing a thorough check like this for each dimension gives you a comprehensive view of your data's health.

Step 9: Creating a dummy dataset

 


			import pandas as pd
import numpy as np
import logging

# Set up logging
logging.basicConfig(level=logging.INFO, 
                   format='%(asctime)s - %(levelname)s - %(message)s')

# Create sample customer data with various completeness issues
customer_data = {
    "customer_id": ["C001", "C002", "C003", "C004", "C005", None, "C007", "C008", "C009", "C010"],
    "name": ["John Smith", "Jane Doe", None, "Robert Johnson", "Sarah Williams", "Michael Brown", "Emily Davis", None, "David Miller", "Lisa Wilson"],
    "email": ["john@example.com", None, "bob@example.com", "rob@example.com", None, None, "emily@example.com", "james@example.com", None, "lisa@example.com"],
    "phone": ["555-1234", "555-5678", None, None, "555-9012", None, "555-3456", None, "555-7890", None],
    "address": ["123 Main St", "456 Oak Ave", None, None, "789 Pine Rd", None, None, "321 Maple Dr", None, "654 Cedar Ln"],
    "signup_date": ["2022-01-15", "2022-02-20", "2022-03-10", "2022-04-05", None, "2022-06-25", "2022-07-12", "2022-08-30", None, "2022-10-05"],
    "last_purchase": ["2023-01-10", "2022-12-15", None, "2023-02-01", "2022-11-20", None, "2023-03-05", None, None, "2023-01-25"]
}

# Convert to DataFrame
customer_df = pd.DataFrame(customer_data)

# Display the first few rows to see what we're working with
print("Sample Customer Dataset:")
print(customer_df.head())
		

 

Here we’re creating a dummy dataset upon which we can test out our function.

Step 10: Execute

 


		# Run the completeness check
completeness_results = check_completeness(
    customer_df,
    required_columns=["customer_id", "name", "email", "phone", "address", "signup_date", "last_purchase"],
    row_threshold=70  # Flag rows with less than 70% completeness
)

# Print summary results in a user-friendly format
print("\n==== COMPLETENESS CHECK RESULTS ====")
print(f"Overall Dataset Completeness: {completeness_results['overall_dataset_completeness']:.2f}%")
print(f"Average Column Completeness: {completeness_results['overall_column_score']:.2f}%")

print("\nColumn Completeness Breakdown:")
column_details = completeness_results["column_details"]
for column in sorted(column_details.keys(), 
                     key=lambda x: column_details[x].get("completeness_score", 0)):
    details = column_details[column]
    if details["present_in_dataset"]:
        print(f"  {column:12}: {details['completeness_score']:6.2f}% complete ({details['missing_count']} missing)")
    else:
        print(f"  {column:12}: MISSING COLUMN")

print("\nMissing Value Patterns:")
for missing_count, pattern in sorted(completeness_results["missing_patterns"].items()):
    print(f"  {pattern['rows_affected']} rows ({pattern['percentage']:.2f}%) missing exactly {missing_count} values")

if completeness_results.get("problem_rows", {}).get("count", 0) > 0:
    problem_count = completeness_results["problem_rows"]["count"]
    problem_pct = completeness_results["problem_rows"]["percentage"]
    problem_threshold = completeness_results["problem_rows"]["threshold"]
    print(f"\nFound {problem_count} problematic rows ({problem_pct:.2f}%) with less than {problem_threshold}% completeness")
    
    # Show the first few problematic rows
    problem_indices = completeness_results["problem_rows"]["indices"][:3]
    if problem_indices:
        print("\nExample problematic rows:")
        print(customer_df.loc[problem_indices])
	

 

Now, let's run our completeness check function on this dataset:

Output and Analysis

Let's analyse what these completeness check results tell us:

  1. Overall Dataset Health:
    • The dataset is 74.29% complete overall
    • This indicates a moderate level of missing data that needs attention
  2. Column-by-Column Analysis:
    • Phone numbers are the most problematic field (50% missing)
    • Customer IDs are fairly complete (90%)
    • The breakdown helps prioritise which fields to focus on first
  3. Missing Value Patterns:
    • Only 1 row is missing a single value (10%)
    • 30% of rows are missing exactly 2 values
    • There's one row missing 5 values, which is concerning
  4. Problematic Rows:
    • 3 rows (30% of the dataset) fall below our 70% completeness threshold
    • The example shows that customer C003, the customer with no name, and customer C009 all have significant completeness issues
    • These rows might need special handling or exclusion from certain analyses

Based on these findings, here are actions you might take:

  1. Immediate Data Quality Fixes:
    • Investigate why phone numbers are so often missing (50% of records)
    • Check if there's a data collection issue for addresses (60% complete)
  2. Review Process Improvements:
    • Consider making customer_id and name mandatory fields in your data collection process
    • Implement validation for email addresses at the point of entry
  3. Analysis Adjustments:
    • For analyses requiring complete contact information, filter out rows with missing phone/email
    • Create a "data quality tier" field to mark records as gold/silver/bronze based on completeness
  4. Monitoring Setup:
    • Create alerts if completeness drops below 70% for critical fields

Now that you've seen how to implement a robust completeness check, you can apply similar approaches to build checks for the other six dimensions we discussed. Each dimension will require its own specialised logic, but the overall pattern remains the same: assess the data, calculate scores, identify problematic areas, and report detailed findings.

Conclusion

As you've seen from our completeness check example, building a data quality pipeline doesn't have to be complicated. You can take a similar approach to check the other quality dimensions we discussed:

  • Validity: Are your values in the correct format and range?
  • Consistency: Do related fields make logical sense together?
  • Uniqueness: Are there duplicates where there shouldn't be?
  • And others that matter to your specific needs

You don't need to tackle all dimensions at once. Start with what impacts your work most directly, then expand your pipeline over time. Remember, it’s quite rare to have data that’s 100% perfect! Your goal is to understand where your data has issues so you can work around those limitations or fix the most important problems first.

 

Post Comments

Call Us