Understanding Data Drift
Let's begin with an example to understand data drift. Imagine a bookstore that uses machine learning to optimise its book inventory. Let’s consider an example where the key variable is the age distribution of its customers.
They trained their model on sales and customer data where the predominant age group of customers was between 50-70 years, as shown in the leftmost plot. This influenced the selection of books, focusing more on genres and topics that appealed to this demographic.
However, post-deployment of the model in a live setting, a notable shift in the customer age profile was observed. The middle plot reveals that the primary age group of bookstore visitors has shifted to between 20-50 years. This change leads to the model underpredicting the book preferences and quantities for this younger demographic, thereby indicating a decline in the model's accuracy and effectiveness.
The rightmost plot highlights a notable difference in the customer age distribution. The mean age in the original dataset (blue plot) is around 58, whereas in the new data (orange plot), it is closer to 32. This change represents data drift.
More formally, data drift occurs when the statistical properties of the data used to train a machine-learning model change over time. This means that the data the model is currently processing differs from the data it was trained on, potentially impacting its accuracy.
Why is Data Drift Important?
Data drift can significantly impact a model's performance, similar to a professional football player suddenly being asked to play cricket without any additional training. The model, like the athlete, struggles when the type of data it 'understands' changes, leading to decreased accuracy and effectiveness.
Understanding and monitoring this drift is crucial for maintaining the efficacy of machine-learning models in dynamic environments. Giskard offers tools to identify these changes, ensuring that models remain reliable and accurate over time.
Before diving deeper, let’s preview what this tutorial will cover:
- Types of Data Drift – Understanding the different types of data drift.
- Drift Tests available in Giskard – Exploring the drift tests available in Giskard.
- Building Test Suites for Various Use Cases – Creating test suites to monitor data drift in different scenarios.
- Input Feature vs Target Feature Drift Detection – Understanding the relationship and differences between input feature drift and prediction drift.
Types of Data Drift
There are two primary types of drifts in data:
1. Concept Drift
Concept Drift is identified when the relationship between the input data and the output changes over time. That is for a given input X what is the output Y, is it similar to before or has it changed?
This phenomenon can be categorised into four types, as outlined in "Learning under Concept Drift: A Review" by Jie Lu et al.:
- Sudden Drift: This is characterised by an abrupt change in data patterns, where the new data distribution differs significantly from the old one.
- Gradual Drift: Here, the data evolves slowly over time, leading to a gradual change in the model's performance.
- Incremental Drift: This involves a progressive shift in the data, where the change is neither sudden nor gradual but occurs incrementally.
- Recurring Concepts: In this case, previously seen data patterns re-emerge over time, requiring the model to re-adapt to known conditions.
2. Covariate Drift
Covariate drift refers to changes in the distribution of the input variables (covariates) of the model, without any change in the relationship between input and output.
Conceptual Difference
- Change in Relationship vs. Change in Data: The key difference lies in what changes. In concept drift, the fundamental relationship between input and output changes. In covariate drift, the relationship stays the same, but the type or range of input data the model sees has changed.
- Adaptation Strategies: Addressing concept drift often requires retraining the model with new data that reflects the changed relationships. For covariate drift, ensuring that the model is exposed to and trained on a representative range of input data is crucial.
In this tutorial, we'll be focusing on exploring various tests used for detecting both covariate and concept drift.
Different Data Drift Tests in Giskard
There are four types of tests available in Giskard for both input features and target variables:
1. Kolmogorov-Smirnov Test (KS test)
The KS Test is a non-parametric method used to compare two sample distributions to assess if they originate from the same underlying population distribution. Its non-parametric nature makes it suitable for all distribution types, without needing the sample data to adhere to, for instance, a normal distribution.
This test is only applicable to numeric data types, such as float and int.
A key outcome of the KS test is the p-value. A p-value below 0.05 usually indicates strong evidence against the null hypothesis, suggesting that the samples are drawn from different distributions. However, it's important to remember that the p-value is an indicator of the test's statistical significance and not a direct measure of drift magnitude.
In our experiments, we use a default significance level of 0.95. Therefore, a p-value less than 0.05 is treated as an indication of data drift. You can change the significance level based on your use case.
2. Chi-Square Test
The Chi-Square Test checks if the actual data (observed frequencies) matches what we would typically expect (expected frequencies). It helps us figure out if any differences are just by chance or if they're important and meaningful.
The Chi-Square Test can be particularly useful for evaluating if changes in the distributions of categorical data are significant. It compares the new data with what was expected from the old data.
3. Population Stability Index (PSI)
The Population Stability Index (PSI) is a widely used tool for detecting data drift in numerical and categorical data. Unlike tests that yield a p-value, PSI generates a number that begins at 0 and can increase indefinitely. This number is useful not just for identifying the presence of drift, but also for gauging its magnitude - the greater the PSI value, the more pronounced the difference in distributions.
For drift detection, we identify a notable drift in features when the PSI exceeds 0.2.
4. Earth-Mover Distance (Wasserstein distance)
The Earth Mover's Distance (EMD) is used to measure the difference between two distributions, or how one distribution differs from another. This test applies to numeric data types. To understand it in simple terms, let's use an analogy:
Imagine you have two piles of dirt representing two different distributions. Each pile has various amounts of dirt at different locations. Now, your task is to reshape the first pile to make it look exactly like the second pile.
The Earth Mover's Distance is like measuring the least amount of work you need to do to accomplish this. In this context, "work" refers to how much dirt you move and how far you move it. If the piles are very similar, to begin with, you won't have to move much dirt very far, so the EMD will be small. But if they're very different, you'll have to do a lot more work, moving lots of dirt over greater distances, resulting in a larger EMD.
For drift detection, a threshold of 0.1 is set, meaning that changes amounting to 0.1 standard deviations are significant enough to be noted.
Implementing Data Drift Monitoring
Let's begin by exploring the dataset we're going to use in this tutorial. First, we'll learn about reference and current datasets. After that, we'll dive into implementing individual tests and then put together test suites, which are collections of these tests.
About the Dataset
We'll be using the Bike Sharing Dataset from the UCI Machine Learning Repository. The dataset contains the hourly and daily count of rental bikes between 2011 and 2012 in the Capital Bikeshare system with the corresponding weather and seasonal information. Capital Bikeshare is a bicycle-sharing system that offers short-term bike rentals to individuals in the Washington, D.C. metropolitan area. In this tutorial, we'll be using the daily dataset for this tutorial.
Understanding Reference and Current Datasets
Understanding data drift involves observing how data changes over time. To make this comparison effective, we essentially need two key elements: an old dataset, which serves as a historical baseline, and a new dataset, reflecting the most recent data.
The old dataset, or the Reference Dataset, is essentially the training dataset used to train the model. Why focus on this dataset? It's simple: this is the dataset from which the model learned its initial patterns and behaviours.
The new dataset is commonly known as the Current Dataset. It is the new data that our model sees after we start using it in the real world. It shows us what's happening now, and it might be different from the data the model was trained on.
The question arises: how does this new dataset compare to our original training dataset? This comparison helps us identify and understand the shifts and trends in the data over time.
For simplicity, since we don't have a system in place to get production data or new data points, we'll split our bike-sharing dataset into two to mimic these two datasets.
However, in the real world, we would use the training dataset as the Reference Dataset and extract the dataset from production for the Current Dataset.
Setup
If you haven't installed giskard already, you can do so by running the following command: pip install giskard -U
Importing Libraries
We'll be using the following libraries for this tutorial:
Downloading the Dataset
We'll be downloading the dataset from the UCI Machine Learning Repository using the following code:
Let’s look at the top 5 rows in the dataset:
The dataset includes information about the date, season, and whether the day is a holiday or a working day. It also includes weather information and environmental factors such as temperature, humidity, and wind speed. Plus, it shows how many people are renting bikes, separating them into casual riders and those who are registered. For more information check out Bike Sharing Dataset.
Preprocessing the Dataset
For simplicity, we'll drop the following columns: `instant`, `dteday`, `casual`, and `registered`.
Since we're using giskard for monitoring, we must convert our pandas dataframe to a giskard `Dataset`.
Now that we have our dataset ready, we can start implementing the tests.Covariate Drift TestsLet's begin by exploring how to conduct covariate drift tests using giskard. After that, we'll progress to understanding and performing concept drift tests.Performing Drift Test on a Single FeatureWe'll start by implementing the test on a single feature since each test in giskard works on a single feature at a time. We'll be using the temp
feature and performing the KS test on it.
Each test in giskard is under the testing
module. We'll be using the test_drift_ks
method to perform the KS test. It takes the following arguments:
actual_dataset
: The current datasetreference_dataset
: The reference datasetcolumn_name
: The column name on which the test will be performedthreshold
: The threshold value for the test (default value is 0.05). We can adjust this value based on the significance level we want.
If the threshold is set to 0.05, the test fails when the p-value of the KS test of the numerical variable between the actual and reference datasets is less than 0.05.
Performing Drift Tests on Multiple Features
We've successfully conducted a drift test on one feature. Next, we'll apply the drift test to multiple features. To demonstrate we'll be using the `temp` and `workingday` features.
To manage and run these tests efficiently, we'll use the Test Suite in Giskard. It allows the grouping of various tests and runs them simultaneously.
We'll utilize the `Suite` class from the `giskard` library to implement the test suite. To run the test suite we can use the `run` method.
The test suite's output displays the outcome of each test it contains. It shows us the type of each test, whether the feature tested is numeric or categorical, and the specific metric measured by the test. Additionally, each test is marked with a status: either 'pass' or 'fail'. The overall status of the test suite is determined by these individual test results, which depend on the measured metrics and their respective threshold values.Performing Drift Tests on All Features in the DatasetWe can also perform the test on all the features in our dataset. We'll define some utility functions to help us with this.
The _check_test
function checks the feature's type and returns the suitable test based on this type. We've also included a prediction_flag argument to decide whether to perform the test on the model predictions. Additionally, we've implemented conditional logic that selects the test depending on the dataset's size, in terms of the number of rows. The reasoning behind this logic will be explained in the following section.
The create_drift_test_suite
function is used to create the test suite. It takes the following arguments:
suite_name
: The name of the test suitewrapped_ref_dataset
: The wrapped reference datasetcols
: The list of columns on which the test will be performed (default value is an empty list)prediction_col_type
: The type of the target variable (default value is None)prediction_col
: The name of the target variable (default value is None)
Now that we have our utility functions ready, we can create our test suite.
Below you'll see the output of the test suite (cropped for brevity).
In the output shown above, we notice that the psi test for the categorical variable season
failed. This happened because the measured metric (17.84403) is above the threshold value for the PSI test which is 0.2. Similarly, we can intercept the results of the tests in the test suite.
It's important to mention that when creating the tests, we don't need to specify the actual_dataset
and reference_dataset
arguments, as seen in the _check_test
method. Instead, we provide these arguments later, when we run the test suite using the run
method.
Why is this useful? Let's say we want to periodically run the test suite on our production data. We can simply create the test suite and save it somewhere. Then we can load the test suite and run it on every batch of data we get from production. We are not restricted to running the test suite on the original dataset we used to create the test suite.
Isn't that cool?
Performing Drift Test at Dataset Level
Individual tests and tests in a test suite tell us if there is a drift in a particular feature. But what about detecting drift across the entire dataset?
Firstly, let's understand what data drift means for an entire dataset. It happens when the number of features experiencing drift is above a certain threshold. For instance, imagine we have a dataset with 10 features. If 7 out of these 10 features show data drift, that's 70% (7/10) of the dataset affected. If our threshold for significant drift is 50%, then having 70% of features drifting indicates data drift at the dataset level.
We can use the following function to implement this logic:
We can view the test suite's results through the `results` attribute. This gives us a list of tuples, where each tuple includes the test's name, its outcome, and other relevant details.
We can adjust the `threshold` value based on our use case. If we want to be more strict we can set the threshold to 0.8 or 0.9. If we want to be more lenient we can set the threshold to 0.3 or 0.4.
Determining the Right Test for Each Feature
We have multiple tests available for both numeric and categorical features. But how do we determine which test to use for a particular feature?
Test performances can vary based on the number of rows in the dataset, the type of the feature, assumptions about the underlying distribution of the feature, etc. For instance, the KS test is non-parametric and it works across all types of distribution. However, the Chi-Square test assumes that the feature follows a normal distribution.
To deal with this, we defined the _check_test
function earlier in this tutorial. The function returns the appropriate test based on the type of the feature and the number of rows in the reference dataset. We can also add more logic to this function based on the use case.
Performing Drift Tests on Important Features
Earlier, we learned how to conduct drift tests on every feature in our dataset to see if there's any overall drift. But do we need to test all features for drift? Are all features equally informative? What if we just want to check for drift in the most important features?
Let's try this out by building a simple linear regression model using the bikeshare dataset. We'll identify the key features and then create a test suite specifically to monitor data drift in these important features.
Let's check the importance of the features based on the model we trained earlier.
We can see that the temp
, atemp
, and windspeed
features are the most important features. Let's create a test suite with only these features. We can also use the dataset_drift_test
function to check what percentage of drift occurred in the important features.
Concept Drift
In this section, we'll explore how to run a drift test on model predictions to determine if there's been any drift. To quickly recap, concept drift happens when the relationship between input and output changes.
Performing Drift Test on Model Predictions
Why test model predictions for drift? We aim to see if the outcomes are significantly different when the model is used on current data compared to reference data.
For prediction drift tests, we require both the reference and current datasets, as well as the trained model in use in production or real-world settings. Why the model? Let's look at the basic steps of what happens in this test:
- The process starts with the reference dataset. The
model.predict()
method is used to generate predictions for this data, which we'll call ref_pred. - Next, it processes the current dataset, again using
model.predict()
to obtain predictions for the new data. These will be referred to as curr_pred. - The test then compares ref_pred and curr_pred to identify any drift. It uses standard tests like PSI, Chi-square, or others, depending on whether the model is for classification or regression. This process checks for changes in the distribution of the two sets of prediction values.
An additional point to note is that for model prediction tests, we use test_drift_prediction_ks
instead of the standard test_drift_ks
we used for input features. This distinction can be found in the _check_test
method we discussed earlier.
We already have the reference and current datasets prepared in the Giskard Dataset
format. For this test, we'll utilise the same linear regression model developed earlier. To integrate this model into our testing framework, it must be wrapped using the Model
class in giskard, with the prediction_function
option for wrapping.
Let's create a test suite to perform the test.
Below you'll see the output of the test suite (cropped for brevity).
To identify the results of the prediction test separately, we notice some distinct features in the report. The result includes model information, and its title differs from others. Unlike other tests, this one doesn't have a column name listed in the results.
We can also pass the model to the `run` method. This is useful when we want to run the same test suite on an updated model.
Input Features vs Target Feature Drift Detection
Understanding the implications of drift test results on both input features and the target variable is crucial. What does it mean when these tests indicate different types of drift?
Scenario 1: The input features have drifted but the target variable hasn't drifted.
Should we be worried if the input features have drifted but the target variable hasn't drifted? There are two ways to interpret this:
- The Model is Doing Fine: It might be that the model is strong enough to handle these changes in the input features. Machine learning models are built to find and use patterns, so they can often deal with some changes. Also, the changes might be happening in parts of the data that don’t affect the model's predictions. This could potentially mean our test for drift in the input is too sensitive to less important features.
- There’s a Problem with the Model: The other possibility is that the model should be adjusting to these changes in the input features, but it’s not. This could mean the model isn’t strong enough to handle the changes. For example, the model might not be good at dealing with new kinds of data. In this case, we might need to retrain the model or make a new one.
Scenario 2: The input features have not drifted but the target variable has drifted.
When the target variable changes but the input features don't, it's a clear sign that we need to investigate further. This situation usually points to a potential issue like a bug, a problem with data quality, or a drift detector that's not set up correctly.
In such cases, the model might become less accurate. This is because it's still using old patterns to make predictions, but the actual relationship between the inputs and outputs has changed. For instance, consider a model trained to predict video views based on the game played and video length. If the game and video length stays the same but the number of views changes, it suggests that something else is influencing views now, which the model isn't catching.
Scenario 3: The input features and the target variable have drifted.
When both the input features and the target variable show signs of drift, it usually suggests the model might be struggling with these changes. However, there are two possible interpretations:
- The Model is Handling the Drift Well: In this case, even though there's been a change in the input features, the model's output is adjusting appropriately. This indicates that the model is robust enough to handle the drift. While this is a positive sign, continuous monitoring of the model's performance is still necessary to ensure ongoing accuracy.
- The Model is Struggling with the Drift: Here, the model's predictions become unreliable and erratic because it can't cope with the changes in the input features. This suggests the model isn't robust enough. In such situations, it's crucial to investigate the cause of the drift. We may need to retrain or even rebuild the model to make it more adaptable to these changes.
Conclusion
In this tutorial, we delved into various aspects of data drift, including its types and the tests available in Giskard to detect it. We learned how to build test suites tailored for different scenarios and the importance of selecting appropriate tests based on specific use cases. Additionally, we covered methods to test drift in the target variable and key features, along with strategies for interpreting test results effectively.
We encourage you to further explore Giskard and see how it can improve your model validation and testing processes.
If you found this helpful, consider giving us a star on Github and becoming part of our Discord community. We appreciate your feedback and hope Giskard becomes an indispensable tool in your quest to create superior ML models.