Data is one of the most invaluable assets in today's world, with a staggering 328.77 million terabytes generated daily, encompassing a diverse array of content, from videos and text to spoken words. This data, spanning personal and supplementary information about individuals, can unveil profound insights about a person's identity.
However, these data patterns often need a more crucial context for our behaviors and interpersonal interactions as human beings. When fed into machine learning algorithms, they perpetuate societal assumptions, generating predictions that raise legitimate concerns about privacy and fairness. These concerns extend to their impact on diverse groups of people across various facets of life, affecting things like how insurance costs are calculated, credit scores for different groups, and even health predictions.
Because of these concerns, it's important to make sure that machine learning models built with this data are fair and don't favor one group over another in ways that can cause long term harm.
This article will:
- Introduce unethical practices in machine learning.
- Develop a model for salary predictions to introduce classical model evaluation.
- Talk about the problems with traditional model evaluations that don't consider fairness.
- Introduce Giskard as a tool to ensure machine learning models are fair in their predictions.
Now, let's dive in!
☯ ️Machine Learning bias, and unethical practices: Ensuring AI Fairness
To understand the concept of data bias and its potential impact on individuals, let’s examine a scenario involving insurance payments.
Insurance provides financial security in unforeseen situations like accidents, but it has faced criticism globally for how insurers distribute premium payments across people in different geographical locations. When you pay premiums to an insurance company, they use various factors to calculate coverage, including age, gender, location and health status, through using ML to scale the process for calculating insurance premiums and coverages
Since the data that insurers use contains human bias, they may carry assumptions using protected and proxy attribute correlations (see Table 1) within the dataset without understanding the context. This can invariably lead to discrimination, potentially disadvantaging certain income or racial groups.
For instance, an insurance company may use ZIP codes as proxy variables to determine premium costs. However, these algorithms lack the understanding that ZIP codes can often be linked to socioeconomic factors like race and income.
Fairness Implication — This potentially leads to unfair and discriminatory pricing practices that may violate anti-discrimination laws wherever these algorithms are deployed. This also reduces transparency and accountability because it makes it difficult for people to understand why they are paying higher than other people from different zip codes.
This can happen in other domains like, banking, hiring and salary distribution, education, etc., and might have already affected you in one way or another. It means that there’s a need to evaluate ML models to understand how they make predictions and make them better by improving fairness.
🧪 Overview of classical Machine Learning model evaluation
Traditional model evaluation techniques tend to focus on assessing the overall predictive performance of a model without delving deeply into the fairness or potential biases associated with specific variables, including protected and proxy variables.
In many cases, classic model evaluation primarily emphasizes metrics such as accuracy, precision, recall, F1-score, and ROC AUC, among others, to gauge how well a model performs in making predictions. These metrics generally evaluate the model's overall effectiveness in terms of correctly classifying outcomes but may not thoroughly examine how the model treats different subgroups or the fairness of its predictions with respect to protected or proxy variables.
However, when some of these metrics are combined and analyzed alongside demographic or protected variables, they can provide a more comprehensive view of a model's behavior and fairness.
Table 1. Showing the difference between protected and proxy variables based on definition, examples and usage
💰 Use Case: Evaluating a model trained on adult income data for salary prediction
To show how models might seem to perform well using standard evaluation methods but exhibit biases when fairness is taken into account, the adult income dataset from Kaggle is used.
This dataset is notorious for its inherent bias (just as in the insurance use case discussed earlier), particularly due to its imbalanced nature. It serves as an ideal example to underscore the critical significance of fairness, particularly in a sensitive domain like predicting salaries. Much like how insurance predictions can significantly influence people's lives, this particular use case provides a good basis for discussing fairness considerations.
It is also a popular dataset for building a binary classifier that predicts if a person makes over $50,000 a year or not, given their demographic variation.
With this background on the data and the aim of training the model, you can kick off by installing and importing relevant libraries. Note that Python is the primary programming language, and Google Colab Notebooks is the coding environment for this walkthrough.
📚 Install Giskard and import libraries to evaluate a model
If you do not have the giskard library and its dependencies installed, you can do that with the following:
What’s Giskard? Testing framework for model evaluation
Giskard is a testing framework created to reduce the risks associated with biases, performance issues, and errors, improving the reliability of machine learning models. With the aid of this tool, you can find hidden flaws in your machine learning models like performance bias, unrobustness, data leakage, overconfidence, stochasticity, and unethical behavior. Giskard helps you automatically scan your models for vulnerabilities and offers concise descriptions of these risks if they are present.
Let’s import the relevant libraries including giskard:
Load and Preprocess the dataset
Download the CSV file programmatically and proceed to preprocessing in the Google Colab coding environment.
The data used in this exercise was sourced from here. It contains some rows with missing values, poorly formatted column names, and certain columns that won't be necessary for our purposes.
You can take a look at information about the dataset by using the df.info() pandas command to determine:
- Variables you won’t need
- Categorical variables
- Numerical variables
After specifying the variables you can start preprocessing and preparing data for training.
Set constant variables you will need for splitting the data for simplicity and then split your data into training and testing.
This appears to be a typical method for splitting data into training and testing sets. However, it's important to recognize that this method might not necessarily enhance the model's performance on subgroups, especially when dealing with imbalanced data. Even when various techniques, such as oversampling, undersampling, or employing SMOTE (Synthetic Minority Over-sampling Technique), are utilized to address imbalanced data, there can still be lingering data sampling bias that traditional model evaluations may not detect.
Classic Model Evaluation Pitfall #1: Addressing data sampling bias
Data sampling bias occurs when the data collected doesn't accurately represent the entire population you want to make predictions about. For the "Adult" dataset, one might assume that this is a fair representation of society, but when you take a closer look, you might find that certain racial groups are underrepresented or overrepresented.
In other words, the data we've collected doesn't accurately reflect the diversity of the population. This is where sampling bias creeps in. Also, splitting the dataset might exacerbate this bias by creating an imbalance in subgroup representation, leading to unintended bias in your model's evaluation and predictions.
The consequences of such bias can be profound. Imagine a scenario where a model with this bias is used to determine eligibility for loans or job opportunities. It could disproportionately deny opportunities to certain racial groups, perpetuating social disparities.
Initiate bias mitigation: Wrap your Dataset with Giskard for ML model evaluation
Wrapping your dataset with Giskard is the first step towards preparing to scan your model for performance issues. l. Datasets represent a potential major source of bias for ML models. Bias mitigation can be done by carefully selecting features to train on.
Classic Model Evaluation Pitfall #2: Difficulty in enabling efficient feature engineering based on fairness
Traditional model evaluations are blind when it comes to group disparities in predictions made for demographic groups. A model may achieve high accuracy while still treating certain groups unfairly. It doesn’t tell you much about the data it uses to come to that generalization.
When working with classical model evaluations, many ML practitioners check the dataset used for developing the model only when their model performance is low or too high. This leaves you with a blindspot and implicitly reduces the interpretability of the model. Measures you might take here include but are not limited to manually slicing the data, evaluating the specific subsets through the model and writing logs before fixing the problems.
As we’ll see later, giskard automatically slices and tests the data slices by subgroups to give you valuable information needed for feature engineering. You can investigate perturbations that might be harming the quality and performance of your model on the subgroups you are interested in and this invariably helps you enhance the model when you make the right changes.
Train the ML model
Here, the data is coded with OneHotCoder and then passed into the Pipeline method and trained with RandomForestClassifier.
Output:
A test accuracy of 0.82 might give confidence to most data scientists. They might be happy to move on and consider that their model is good enough to perform in production. . However, looking at accuracy metrics alone may not unveil potential fairness, robustness, over/underconfidence, spurious correlations or a whole host of other issues that could pop up when confronting the model to the real world.
Ensure AI fairness: Wrap and Scan your model for ML model evaluation
Just like the dataset, giskard library is used to wrap the model in order to prepare the scan.
Output:
Result: Wrapped Test accuracy: 0.82
After wrapping the model then you can scan your model to check for vulnerabilities.
Output:
The giskard library produces a report to help understand the different vulnerabilities of our model. Here we notice that the main issues detected are performance biases and underconfidence.
This report highlights the significance of evaluating machine learning model fairness by combining classic metrics with their global counterparts. It provides insights into whether the model's overall performance aligns with its performance across various demographic subgroups.
This is critical because it offers a proactive approach to addressing fairness concerns in machine learning models. Instead of data scientists needing to manually investigate their data to identify these issues, which can be time-consuming and faulty, the giskard scan report streamlines the process. Data scientists and stakeholders quickly pinpoint potential fairness concerns or irregularities tied to specific subgroups in the data to make fast informed actions, saving them the headache of sifting through piles of data and blindly trying out different methods to anticipate and mitigate a model’s performance biases.
Table 2. Shows how metrics for a subgroup combined with their global metrics can reveal fairness concerns
Classic Model Evaluation Pitfall #3: Lack of Fairness and Interpretability
Traditional model evaluation relies on common metrics like accuracy, precision, recall, and F1-score to assess a model's performance. These are handy for telling us how well a model is doing in general, but they don't really explain whether the model is being fair to different groups of people. This is a problem because it means we might not realize when our models are being unfair, which can lead to unfair treatment based on things like race, gender, or age without us even knowing it.
Giskard empowers users with actionable insights for improving fairness. The vulnerabilities detected suggest potential interventions and adjustments to reduce disparities and promote equitable outcomes. In essence, Giskard bridges the gap in fairness interpretability by providing a transparent and actionable framework for assessing and addressing fairness concerns, ensuring that machine learning models adhere to ethical and equitable standards.
Generate a test suite from the Scan
The results generated from the scan can serve as building blocks to create a comprehensive test suite that incorporates domain-specific challenges and considerations, enhancing the overall testing process of your ML model.
Understanding Test Suites in ML model training and evaluation
Test suites are organized collections of reusable components designed to make the evaluation and validation processes for machine learning models more efficient and consistent.
They include various test cases, each tailored to assess specific aspects of a model's performance. The main goal of using such test suites is to improve the efficiency and consistency of testing.
Additionally, they help you maintain consistent testing practices, enable model comparisons, and quickly identify any unexpected changes or issues in the behavior of your machine learning models.
Output:
After running your first test suite, this report tells you that when the model's performance is assessed specifically for individual groups for example, the groups with "Unmarried" based on their relationship status, the Recall metric is quite low (0.2449), and it does not meet the expected performance level, as indicated by the threshold ( i.e failed).
This suggests that the model may not be effectively capturing all relevant positive cases for this subgroup, indicating a potential area for improvement.
This report hints that you might want to tweak your data or try out a different model to make sure it meets fairness standards.
Customize your suite by loading objects from the Giskard catalog
Test suites can be customized in the giskard library. Giskard’s catalog provides you with the capability to import various elements like:
- Tests, encompassing types like metamorphic, performance, prediction, and data drift tests, as well as statistical assessments.
- Slicing functions, which include detectors for attributes like toxicity, hate speech, and emotion.
- Transformation functions, offering features such as generating typos, paraphrasing, and fine-tuning writing styles.
The code below adds an F1 test in the suite.
The report below shows that our model failed this newly added test.
Output:
Moving forward, the logical progression involves addressing and rectifying the issues identified by the giskard scan to reduce model bias. To achieve this, you can initiate the Giskard Hub, a platform that allows you to debug your failing tests, thoroughly examine and assess your data and models for fairness, and compare different versions of your model to ensure you choose the best eliminating guesswork involved in the process.
✅ AI Safety with Giskard Hub: Enhance Machine Learning model evaluation and deployment
Giskard Hub serves as a collaborative platform designed for debugging failing tests, curating specialized tests tailored to your specific domain, facilitating model comparisons, and gathering domain expert feedback on your machine learning models. It plays a pivotal role in enhancing AI safety and accelerating the deployment process.
The Giskard Hub can be deployed either through HuggingFace spaces, on-premise, on the cloud or through Giskard’s custom SaaS offering. In real-world scenarios where sensitive information is meant to be accessed on-premise, we will opt to use the Giskard hub on-premise.
1. Install giskard server and all its requirements on your local machine
Ensure docker and all other requirements are installed on your systems. Check here for how to install this on either Linux, macOS machine, or WSL2 (Windows Subsystem for Linux) in Windows.
2. Start the server on your local machine by inputting the following command on your terminal:
When it runs, it will provide you with a localhost address http://localhost:19000/.
If you are new to using giskard, you will need to set up an ngrok account which creates a layer of security when uploading objects from your Colab notebook to the Giskard server.
3. Request a free trial license
Launch Giskard!
You are set to upload all your project assets from your Colab notebook.
4. Set up an ngrok account and generate your ngrok API token, then expose the giskard server to the internet to allow you upload objects to the Giskard Hub.
Output:
5. Go to your Colab notebook and upload the first test suite you just ran with all its assets
Use the external server link –<ngrok_externeal_server_link>– to see the uploaded test suite on Giskard Hub.
6. While the other terminal runs the server, open a new terminal and execute this command on your local machine to start the ML worker.
The ML worker allows the giskard server to execute your model and all its associated operations directly from the Python environment where you trained it. This prevents dependency headaches and makes sure you can run test suites and model directly from the server.
7. Curate your tests, debug, and re-run tests
After running your initial test suite, add new tests, run them on newer versions of your models, and debug them, by reviewing failing subgroups to identify and address issues detected by Giskard.
Play around with the Giskard platform to maintain a centralized Hub for all your tests and model experiments in the most comprehensive way possible.
Through this guide, you've learned to scan your model and develop comprehensive test suites using the Giskard Python library. You’ve also seen how the Giskard Hub can prove to be the ideal debugging and model evaluation companion when training new ML models. . Giskard's tools simplify and automate many tasks that are often missed when creating ML models.
If you found this helpful, consider giving us a star on Github and becoming part of our Discord community. We appreciate your feedback and hope Giskard becomes an indispensable tool in your quest to create superior ML models.