Handling Missing Data in Machine Learning: Patterns and Techniques
September 08, 2025
243

Handling Missing Data in Machine Learning: Patterns and Techniques

As every analyst knows, the quality of your data can make or break your analysis. A data quality scoring pipeline helps you evaluate completeness, validity, consistency, and other critical dimensions to understand your data’s health.

Let’s look deeper into one of the major aspects of data quality: data completeness. You've experienced this before. You're all set to run your carefully crafted model when you suddenly get an error because a column contains null values. Or perhaps you've been forced to postpone a critical analysis because you discovered gaps in a key variable before a presentation. Missing data creates those painful workflow interruptions, right?

Picture this scenario: A marketing team has collected customer purchase data for a campaign effectiveness study. They’re ready to calculate ROI by segment when they discover that 15% of records are missing purchase amounts and 8% lack campaign attribution codes. What do they do now? They can't just ignore a significant chunk of this data!

Or consider this healthcare example: When analysing patient recovery times, 20% of follow-up visit dates are missing. Without those data, the entire analysis could be biased toward patients who completed all check-ups, and this could potentially lead to overestimating treatment effectiveness.

In this guide, we’ll work on building a practical, automated missing value handler to tackle one of the most common data quality issues you'll face in your real-world projects.

Understanding Missing Data Patterns in Your Dataset

Before you start fixing missing values, you need to understand why they're missing in the first place. This step is a key part of data preprocessing in machine learning. The pattern of missingness dramatically affects how you should handle it. Let's explore the three fundamental patterns with clear examples that you'll likely encounter in your own work.

Missing Completely at Random (MCAR)

When your data is Missing Completely at Random, it means exactly what it sounds like: the missing values occur entirely by chance. There's no relationship between whether a value is missing and any other values in your dataset (either observed or missing).

Fig 1: Example of MCAR

In this MCAR example (See Fig 1), the missing values (marked as ???) occur completely randomly. If you analyse the patterns, you'd find no relationship between which values are missing and the patients' other characteristics. MCAR is the "best" kind of missing data you can hope for because it doesn't introduce bias. With MCAR, you can often use simpler imputation methods like mean or median substitution without skewing your analysis results too much.

Fig 2: Visualising MCAR

Fig 2 shows a grid of blue dots with missing values (hollow circles) scattered randomly throughout the entire grid. There's no discernible pattern to the missing values as they appear to be distributed evenly across all areas of the grid. This illustrates how in MCAR, missing values occur purely by chance with no relationship to other variables.

Simple imputation methods like mean, median, or mode often work well here. For example, you might replace missing blood pressure values with the average blood pressure from your dataset.

When dealing with MCAR data in your work, you can:

  • Use mean imputation for normally distributed numerical data (like height, weight)
  • Use median imputation for skewed numerical data (like income, cost)
  • Use mode imputation for categorical data (like product category, customer segment)
  • Consider simple interpolation for time series data

Missing at Random (MAR)

Missing at Random is a bit misleadingly named. When you encounter MAR in your data, missing values aren't completely random. They're related to other observed variables in your dataset but not to the missing value itself. Think of MAR like a survey where younger people in your sample are less likely to answer income questions. The missingness depends on age (which you can observe), not on the actual income (which is missing).

Here's a MAR example in a sales dataset you might be analysing:

Fig 3: Example of MAR

In this hypothetical customer satisfaction survey (See Fig 3), men are less likely to answer questions about their shopping experience. The missingness is related to gender (which you know) but not to their actual satisfaction level. You might notice that income data is missing primarily for male customers. The missingness isn't related to the actual income amounts (which you don't know) but to gender (which you do know).

MAR data can introduce bias if you don't account for the relationships between observed variables and the missingness pattern. If you simply use means across your entire dataset, you'll ignore the underlying structure and potentially skew your results.

Fig 4: Visualising MAR

Fig 4 shows a grid of green dots where missing values (hollow circles) gradually increase from left to right. The left side of the grid has very few missing values, while the right side has significantly more. This pattern illustrates how in MAR, missing values are related to other observed variables (represented by position on the x-axis) but not to the value itself. You’ll also see how the gradient pattern suggests a relationship between missingness and some other measured variable.

Methods that leverage relationships between variables work better here, like KNN imputation or regression-based approaches. In our example, you might use other customer attributes to predict the missing income values more accurately.

When handling MAR data in your projects:

  • Group your data based on the related variables (like separate imputations for men and women)
  • Use KNN imputation to find similar records for filling missing values
  • Consider regression models that can predict the missing values based on other variables
  • Use multiple imputation methods that account for uncertainty
  • Maintain the relationship between the variables that influence the missingness

The central challenge with MAR is recognising which variables are related to the missingness pattern. Look for strong correlations between missing values and other variables in your dataset, which provide clues about the MAR structure.

Missing Not at Random (MNAR)

Missing Not at Random is the most problematic pattern you'll encounter. Here, the missingness is directly related to the value that would have been provided. Think of MNAR like a survey where people with very high incomes in your sample simply refuse to report their income. The missingness directly depends on the value you're trying to collect.

Here's an MNAR example in a credit risk dataset you might be evaluating:

Fig 5: Example of MNAR

In this example (See Fig 5), credit scores are missing specifically for high-risk applicants with high debt levels in your dataset. The missingness is directly related to having a poor credit score (which you don't know).

MNAR creates serious challenges because the missing mechanism itself contains information. Standard imputation methods will likely introduce significant bias in your results.

Fig 6: Example of MNAR

Fig 6 shows a grid of red dots where missing values (hollow circles) increase from top to bottom. The top portion of the grid has a few missing values, while the bottom portion has many more. This pattern illustrates how, in MNAR, the likelihood of a value being missing is directly related to what that value would have been. The concentration of missing values at the bottom suggests that higher values (if the y-axis represents value magnitude) are more likely to be missing.

This requires careful modelling and often external information; sometimes your best approach is to flag the missingness itself as a feature. In your credit risk example, the fact that a credit score is missing might be just as informative as the score itself.

For MNAR data in your analyses:

  • Create "missing value indicators" as new features in your dataset
  • Use domain knowledge to understand why values might be missing
  • Consider multiple imputation with sensitivity analysis
  • When possible, try to collect additional data to better understand the missingness mechanism
  • Be transparent about potential biases in your analysis results

MNAR is particularly challenging because the missing data mechanism itself contains information that you don't have. You should be especially cautious about conclusions drawn from datasets with MNAR patterns and consider how the missingness might impact your findings.

Visualising Missing Data Patterns in Your Analysis

Before diving into the specifics of each visualisation, let's understand why these visual tools are so critical for your data analysis:

  1. Early Problem Detection: Visualisations reveal the extent and pattern of missing data before you waste time on analyses that might be compromised.
  2. Imputation Strategy Selection: Different missing data patterns (MCAR, MAR, MNAR) require different handling approaches, and visualisations help you identify which pattern you're dealing with.
  3. Bias Assessment: Visual patterns help you determine if missingness might introduce bias into your analysis, affecting the validity of your conclusions.

Bar Chart of Missing Values

Fig 7: Using Bar Charts as a visualisation technique

What it shows: The count and percentage of missing values for each variable in your dataset, sorted from highest to lowest missingness.

Why it's important: This visualisation serves as your first step in missing data analysis, helping you:

  • Quickly identify which variables have the most completeness issues
  • Prioritise which variables need attention or special handling
  • Decide if certain variables should be excluded entirely due to excessive missingness
  • Set expectations about potential information loss when removing incomplete records

In our example (See Fig 7), Purchase History shows the highest rate of missingness (21.5%), while Housing has minimal missing data (3%). Variables with very high missingness rates might need to be excluded or require more sophisticated imputation techniques.

Missing Value Patterns (Heatmap)

Fig 8: Using Heatmaps as a visualisation technique

What it shows: A detailed record-by-record view of which values are present (green) versus missing (red) across your variables.

Why it's important: This visualisation reveals:

  • Whether missingness occurs in isolation or in groups of variables
  • If certain records have systematic patterns of missingness
  • Potential relationships between variables that tend to be missing together
  • Records that might need to be excluded due to excessive missing values

Look for horizontal patterns (records with many missing values) and vertical patterns (variables frequently missing together). In our example (See Fig 8), we can see that Purchase History and Income often have missing values in the same records (Records 7 and 8), suggesting a potential relationship in how these data are collected or reported.

Missing Value Matrix

Fig 9: Using Matrix as a visualisation technique

What it shows: A broader view of missingness across a larger sample of records, with white spaces indicating missing values and a sparkline showing completeness by row.

Why it's important: This visualisation helps you:

  • Identify systematic patterns across a larger sample
  • Spot clusters or streaks of missing values that might indicate data collection issues
  • Assess overall data completeness across your dataset
  • Identify problematic records with many missing values

In our example (See Fig 9), you can see distinct patterns where certain variables frequently have missing values together. The sparkline on the right shows which records have more complete data (fewer missing values) to help you identify potentially problematic cases.

Correlation of Missing Values

Fig 10: Using Correlation as a visualisation technique

What it shows: How missingness in one variable relates to missingness in others, with colour intensity indicating correlation strength.

Why it's important: This visualisation is crucial for:

  • Determining if your data follows MAR patterns (missing at random but correlated with other variables)
  • Identifying which variables could potentially be used to predict missing values
  • Planning imputation strategies that leverage relationships between variables
  • Assessing potential bias in your dataset

Strong positive correlations (red) (See Fig 10) indicate variables that tend to be missing together. In our example, the 0.75 correlation between Income and Purchase History suggests that when Income is missing, Purchase History is also likely to be missing. The 0.64 correlation between Marital Status and Children also reveals a strong relationship in their missingness patterns.

Missing Data by Variable Groups

Fig 11: Using Grouping as a visualisation technique

Essential Techniques for Your Missing Data Toolkit

After understanding and visualising your missing data patterns, here are the key techniques you should have in your missing value handler pipeline:

For Numeric Data

1. Mean Substitution: Replace missing values with the average of available values.

df['income'].fillna(df['income'].mean(), inplace=True)

2. Median Substitution: Better for skewed distributions.

df['income'].fillna(df['income'].median(), inplace=True)

3. K-Nearest Neighbours (KNN) Imputation: Impute based on similar records.

from sklearn.impute import KNNImputer 

imputer = KNNImputer(n_neighbors=5) 
df[['age', 'income']] = imputer.fit_transform(df[['age', 'income']])

KNN imputation works like asking the most similar people for help when you're missing information. Here's how it works in everyday terms:

Imagine you're missing someone's income in your dataset, but you know their age, education level, and job title. Instead of just using the average income for everyone, KNN finds the most similar people based on the information you do have, and uses their incomes to make an educated guess.

For example, if you're missing the income for a 35-year-old software engineer with a master's degree, KNN would find other 35-year-old software engineers with master's degrees in your dataset and use their average income as your estimate.

When implementing KNN imputation, you'll want to:

  1. Include all relevant variables that might relate to the missing values
  2. Choose an appropriate number of neighbours (K) - typically 5 to 10 works well
  3. Apply the imputation to groups of related variables together to maintain their relationships

This approach creates customised estimates for each person in your dataset rather than using one-size-fits-all values.

For Categorical Data

Mode Imputation: Use the most common category

# Replace with the most frequent value
df['product_category'].fillna(df['product_category'].mode()[0],
inplace=True)

For Time Series Data

Forward/Backward Fill: Copy the last/next known value

# Useful for stock holdings, account balances, etc.
df['daily_balance'].fillna(method='ffill', inplace=True)

Forward fill replaces missing values with the last known value that came before them. It's based on the assumption that values persist until they change.

For example, if you're tracking inventory levels:

  • Monday: 500 units
  • Tuesday: ??? (missing)
  • Wednesday: ??? (missing)
  • Thursday: 450 units

Forward fill would assume the inventory stayed at 500 until Thursday:

  • Monday: 500 units
  • Tuesday: 500 units (filled forward)
  • Wednesday: 500 units (filled forward)
  • Thursday: 450 units

You should use forward/backward fill when:

  • The most recent known value is the best predictor of the current state
  • Changes occur as discrete events rather than continuous processes
  • You need to preserve the exact reported values without interpolation
  • Missing values occur in short streaks between known values

Interpolation: Estimate values based on surrounding points

# Linear interpolation for time series
df['temperature'].interpolate(method='linear', inplace=True)

Interpolation estimates missing values by drawing a line (or curve) between known values on either side of the gap.

For example, if you're tracking temperature readings:

  • 9:00 AM: 72°F
  • 10:00 AM: ??? (missing)
  • 11:00 AM: 78°F

Linear interpolation would estimate the 10:00 AM value as 75°F, i.e., exactly halfway between the known points.

Again, depending on your problem, you can have different kinds of interpolations:

  • Linear interpolation: Draws straight lines between points. Best for short gaps and relatively steady changes.
  • Polynomial interpolation: Fits curves that can capture more complex patterns. Useful for smoother processes.
  • Spline interpolation: Creates smooth curves with continuous derivatives. Excellent for natural physical processes
  • Time-weighted interpolation: Accounts for irregular time intervals between observations.

You can use interpolation when:

  • Values change gradually over time (temperatures, prices, rates)
  • You have known values on both sides of the gap
  • The underlying process is continuous rather than discrete
  • The gap isn't too large relative to the rate of change

Conclusion and Next Steps

Now that you've explored these visualisation techniques, you’ll see how each pattern revealed through these charts points toward specific handling strategies that can strengthen your analysis.

The next steps would be to take what you've learned about missing data patterns and apply it to your current project. Begin with the bar chart to understand the scope, explore relationships using the correlation heatmap, and examine record-level patterns with the detailed visualisations.

That said, you’ll need to note that the real value of these visualisations goes way beyond understanding what's missing. It’s also how you approach the steps of cleaning, imputation, and transformation on the data you have and ultimately, the conclusions you can draw from it to add real business value.

 

Post Comments

Call Us