How To Use Data Visualization to Validate Imputation Tasks
TL;DR
Imputation is a useful tool for machine learning, but validating results can be difficult. We can improve imputation tuning by applying more advanced data visualization techniques as shown in this article.
Imputation in Data Science
Data imputation is a common practice in machine learning. At a basic level, imputation is the practice of replacing a missing value with an estimated value, usually through mathematical inference. Within machine learning, there are many useful applications for imputation, including:
- Validating an existing model
- Filling in missing values in raw data (data cleaning)
- Using small amounts of data to generate unlimited amounts of data like it
- Smoothing of features in a dataset
For more details on how to apply imputation, check out this post. In this article we’ll be looking at different methods of visualizing imputation in practice. Data visualization can provide a deeper understanding of how accurately your imputation results mirror raw data features.
Visualizing missing values
While many options exist for visualizing data in Python, we like to use Altair for data exploration. We use Altair for a number of reasons: it relies on the simplicity of the Vega-Lite visualization grammar, has built-in interactivity, can be shared as HTML files, and uses a modular approach to creating subplots and dashboards.
The following examples will walk through a few methods to visualize imputation using Altair plots. Let’s take a look at a basic example first: say you have a set of raw data features that you want to use to train a classification model. We will use weather for simplicity. Most features have consistent data, but a few of them have missing or messy values. We can use imputation to fill these in and increase the accuracy of the model.
But before we can create brand new values, we want to make sure that our imputation can consistently predict values based on input data. To do this, we can redact rows within the dataset and then “fill them in” with imputation. We will compute these values using an HMM (for more applications of using HMM imputation, see Imputation and its Applications).
After imputation, we can use a standard scatter plot to compare what these new imputed values look like against the “true” values that were redacted. Here’s what we get from using Matplotlib to measure imputation results for a set of weather features: temperature, cloud cover, and energy produced.
This provides a general idea of how your imputed values compare to reality, but it’s difficult to identify any real pattern in the data. Moreover, the results get more difficult to interpret when we apply them to non-quantitative features such as weather summaries like “rainy” or “clear”.
As we can see above, the plot starts to get even more cluttered. The overlapping of “actual” and “imputed” values make it difficult to tell how well the imputation performed.
Using Altair to visualize imputation
Now let’s try updating our Matplotlib figures with Altair instead. We know that these features are all indexed by time. When visualizing timeseries data, it can help to maintain the continuous nature of the data by using lines instead of points.
That being said, if we were to connect every point exactly with a line, we will likely generate a lot of visual noise. Instead, we can rely on Altair’s interpolation feature to add a line to the plot that focuses more on the trend of the data, and less on the exact points.
As we can see, our new version gives us a few advantages: the timeseries nature of the data is now apparent and we can focus on the overall “signal” in our data rather than focusing too much on outliers. However, retaining the dots at a reduced opacity allows us to keep the exact data points while emphasizing the line more to the viewer’s eye.
To construct this plot, we rely on the layering features of the Altair library. Our scatter plot and line plot are effectively two separate charts overlaid onto one another. Here’s the feature dataset:
Our first step is to format the data in a way that Altair can read. Altair is designed to receive data in long-form format where each row represents a single observation. We will have to reshape our dataframes accordingly, since most machine learning tasks use data in the above wide-form format where each row contains measurements of multiple independent variables (for more on the difference between long-format and wide-format data, see here).
Using our imputed and redacted datasets, we can use the Pandas method pd.melt() to reshape a wide-format dataset into long-format:
Now our dataset will look like this:
We’re ready to start plotting! To create our scatter plot, we start with a simple Altair object using mark_circle(). Crucially, we only want to look at one feature in this plot, so we can use the built in transform_filter() in Altair to grab a single variable like so:
A scatter plot created with Altair.
Now we need to add the interpolation line in order to better highlight the signal in these patterns.
We now have two separate Altair objects stored in `circles` and `lines`. Because both charts use the same dataset, we can use Altair’s layering feature to simply combine the plots into a new variable by stacking them together.
chart = circles + lines
By calling your new chart, you should see the layered result. We can also append a title to our chart object with `properties(title=”My Title”)`:
Visualizing categorical imputation
So what happens when a feature contains categorical information instead of quantitative? Our original weather dataset contains a column titled “summary” with strings such as “rain”, “clear”, and “cloudy” to describe the weather at each timestamp.
Instead of treating this data the same way, we can measure our overall accuracy by aggregating across the time range of the dataset. Again, we care more here about our overall imputation performance, and less about the difference at each timestamp between “actual” and “imputed”.
First we need to reshape our categorical data. We can do this by:
- isolating our DataFrame to only rows with var == 'summary'
- using a Pandas pivot table to count instances of each weather summary for ‘actual’ and ‘imputed’ respectively
- melt this pivot table into a long-format for Altair.
And now we construct the bar chart. This will require using Altair’s row feature to effectively create “mini bar charts”, one for each category, and then stack them on top of each other.
Calling bars will now give you this:
As we can see, a clear comparison emerges between our “actual” and “imputed”. Unlike the first type of plot, this method allows us to see, for example, how our imputation might be favoring the “Clear” label more so than others.
When validating imputation results, it’s useful to generate some metrics to measure success. Above, we are validating based on actual counts of data, but we can also score our imputation using calculated metrics. Our bar plots show us how many times we imputed the correct summary label, but it doesn’t necessarily tell us how accurate (or in this case, at what point in time) we labeled this data.
At a basic level, we want to ask the question: how well did I impute compared to if I had just done it randomly? We can then compute a ratio of raw accuracy compared to expected accuracy, which compares how well the imputations performed relative to just filling in the most common value into each empty spot.
When combined with our bar plot, this new metric can give us the context we need to better validate our imputation results. We can also normalize our metric from a score of 0 to 1 for simplicity.
As we can see, the subplot at the bottom now reveals more information. Our normalized score measures against “random guessing” as a worst-case baseline, so we put this at the zero mark. And since these metrics are all relative, we remove the number labels at the ticks for simplicity. Adding labels for a “minimum acceptable” and “best possible” score also provides a helpful context when sharing this plot with other team members unfamiliar with the data. Our model performed considerably better than filling in these summary labels at random.
We can apply this same validation plot technique to our numeric variables too. For these features, we can measure success with a metric related to the average z score and another metric related to the average log likelihood.
Again, we see that our model performed considerably better than random in both metrics.
The dot range plot displaying our metrics is a useful tool to output validations in a more visual format. Adding the labels requires a bit more lifting using Altair, but can be done by layering each element (just like making a chart) of the plot on top of one another using Altair’s mark_text method. Here’s how to create the basic dot range plot using Altair:
Conclusion
Imputation is a valuable technique that can be applied across a wide variety of tasks. Smart visualization of these results can help you better understand and improve your model results. Moreover, when you design plots for others as well as yourself, you can increase collaboration across the team and reinforce confidence in your model among stakeholders.
For more articles on using imputation, checkout our posts on Imputation and its Applications.