G
Tutorials
February 1, 2023
5 min read

How to test the fairness of ML models? The 80% rule to measure the disparate impact

This article provides a step-by-step guide to detecting ethical bias in AI models, using a customer churn model as an example, using the LightGBM ML library. We show how to calculate the disparate impact metric with respect to gender and age, and demonstrate how to implement this metric as a fairness test within Giskard's open-source ML testing framework.

Picture illustrating gender bias generated by DALL-E2
Rabah Abdul Khalek
Picture illustrating gender bias generated by DALL-E2
Picture illustrating gender bias generated by DALL-E2

In a previous article, we outlined possible structural biases that are most often present in AI. In this article, we propose the disparate impact test to measure if an AI model is affected by such biases.

Introduction

Disparate impact in AI refers to the phenomenon wherein a machine learning model disproportionately harms certain groups of people. This can happen when a model is trained on data that is biased against certain groups, and as a result, the model makes decisions that discriminate against them. This is a serious issue that can have real-world consequences, particularly in areas such as lending, employment, and criminal justice, where machine learning models are increasingly used to make decisions affecting people's lives.

One way to concretely measure the disparate impact is by using the 80% rule. The 80% rule is a principle that states that if a selection rate for a protected group (such as a minority group) is less than 80% with respect to the group with the highest selection rate, the selection process may be considered discriminatory. This rule has been used in the context of employment discrimination cases in the United States, but it has also been applied to machine learning models. It was established to limit bad practices in employment, housing, and other areas that affected a minority (with respect to a sensitive characteristic), even when rules applied by employers or landlords are formally neutral. Most American federal civil rights laws consider characteristics such as race, colour, religion, national origin, and sex to be sensitive ones among others.

In the context of machine learning, the 80% rule can be used as a benchmark to identify whether a model has a disparate impact on certain groups. This can be used as a flag for further examination of the model, and as a way to identify potential biases in the data used to train the model.

The disparate impact test — the 80% rule

Several metrics can be thought of when measuring disparity between two statistical samples such as the 80% Rule or statistical significance tests among others. We will dedicate this article to talk about the 80% rule and discuss in length the statistical significance tests that are implemented currently in Giskard, in a future article.

The 80% rule was first published in the State of California Guidelines on Employee Selection Procedures in October 1972. Ever since it became a common rule in companies to ensure fair representation of protected minorities. This rule can be generally stated as follows:

The proportion of a protected (minority) group with respect to an unprotected (majority) one, given a positive outcome, should be more than 80%.

Which can be reformulated in an AI context to be:

The ratio of positive outcome probabilities, an AI model predicts, given the protected subset over the rest (unprotected) subset should be more than 80%.

Mathematically, we can formulate it as:

Disparate Impact metric

where Pr is the probability to have a positive outcome (here we suppose it is y=1) given a data subset D. In the context of an AI model that has some classification threshold and ŷ as output, the probability Pr can be expressed as follows:

Probability expression in the context of an AI binary classification model

Now that the disparate impact metric is formally defined, we need to define what a “positive outcome” is in Machine Learning. Does it only mean that the target variable is positive? Not necessarily.

The Disparate impact metric can be classified as a fairness one in AI Ethics. It measures the unfair allocation of a given good (resource, service, or practice) against a minority. To link the Machine Learning target variable with this “positive outcome”, we need to understand the business action that follows the prediction given by the Machine Learning model. For instance, in credit scoring, credit demanders who have a high default score (target) are usually denied access to the loan (output). An HR ML model that predicts high-potential candidates will propose a job position (output) for candidates who have high-score of being high-potential (target). For some famous cases like the Compas algorithm, judges choose to take a defendant into custody (output) if his recidivism score (target) is too high.

As a result, to define properly the disparate impact test, one should properly understand the link between the ML target variable that is being predicted and the output that should be allocated fairly between groups. Let’s see how it works for a common churn model.

Application on an AI model — Churn

It’s generally easier to keep an existing customer than to gain a new one. It’s also much easier to save a customer before they leave than to convince them to come back. For these reasons and more, understanding and preventing customer churn is critical to the long-term success of a company. Churn prediction is the process of using historical data to identify customers who are likely to cancel their subscriptions or stop using a service. Machine learning algorithms are often used to analyse customer data and build a model that can predict which customers are at risk of churning. The goal of churn prediction is to identify at-risk customers early so that the company can take action to retain them. This can include targeted marketing campaigns, special promotions, or personalised services.

What is the link between churning and the ethical metrics of Disparate impact? Fairness considerations are not always relevant for common marketing use cases. Machine Learning is all about creating effective discriminators, and there is no real point in forcing the model to provide the same decision between groups if there is no precise ethical justification behind it. But for churn, things can be a bit different.

Offering incentives is one strategy that a company can use to try to retain customers with high churn scores. These incentives can take many forms, such as discounts, rewards, or special promotions. For example, a company might offer a discount on a customer's next purchase if they agree to stay with the service for an additional period of time. Similarly, the company might offer a reward to customers who refer their friends and family to the service.

What if a company offers special promotions and gifts only to white and wealthy individuals? There is no doubt that this company would be viewed as discriminatory against minorities. To avoid this, let’s see how we can measure this with a practical churn code example.

In this section, we consider the following Telco Customer Churn Kaggle dataset. This dataset contains the demographic information about customers – gender, age range, if they have partners and dependents and whether they stopped being a customer within the last month, in other words, whether they churned or not.

An AI model (in our case it is the LightGBM model) has been trained in order to predict how likely a customer is to churn. It’s accuracy (on the test dataset) is of 79.6%. Our goal will be to infer whether this model is infected with some social biases. We will look into two features:

  • Gender
  • Age/Seniority

The dataset seems to contain a balanced number of male (50.5%) and female (49.5%) customers. However, it seems to contain only 16.2% seniors versus 83.8% non-seniors.

Gender balance in the dataset
Age balance in the dataset

The main question is then: how are these features (gender and seniority) distributed with respect to the positive outcome? We’ll suppose that the positive outcome is related to the target variable “did not churn”. In the following diagrams we plot the proportions of males and females who did not churn in both the training and testing datasets together with the predictions of the trained model. In each case we calculate the following disparate impact metric (that we call DI) which can be expressed as:

Disparate Impact metric applied on gender

It is clear that in all cases, DI ~ 1, thus no disparity has been detected. This is a clear evidence that the model gender representation (on the right) is representative of the data (on the left).

Correlation between target and gender in the data
Correlation between prediction and gender in the model

When it comes to the seniority feature. The metric becomes:

Disparate Impact metric applied on seniority

Most of the customers are younger people, and as maybe expected, the model does not respect the 80% rule with a DI=0.72 and 0.76 respectively for the training and testing datasets. Of course, this disproportion is also reflected in the data, but it means that the model is more likely to predict that a customer would churn if they are seniors. Although this is not necessarily an unethical bias, the company might choose the direction of eliminating it, and thus penalise their models during training to respect a DI 0.8.

Correlation between target and age seniority in the data
Correlation between prediction and age seniority in the model

Code implementation — example of test

In Giskard, we have implemented this test, allowing the user to define their protected slices of the data, the positive outcome, and providing a min_threshold and max_threshold that the DI has to respect.

We can see in the following screenshots the two investigations carried out above (on gender and seniority) reproduced in the Giskard interface:

  • Gender:
Disparate impact test result on gender
  • Seniority:
Disparate impact test results on seniority

Conclusion

Among the metrics that measure ethical biases in an AI model, we discussed the 80% rule for gauging a disparate impact. This disparate impact test helps in detecting when an AI model, that appears to be neutral, results in a disproportionate impact on a protected group. We saw that AI Fairness metrics are relevant not only for common critical use cases in HR, banking, or public services. In simple marketing use cases like churn, disparate impact tests can be relevant if they are implemented with a real understanding of the output of the AI system in the business world.

This test is part of a large collection implemented in Giskard. Try our demo projects preloaded in the app to learn more about testing in Machine Learning.

Dashboard of test results in Giskard

Check our Github!

Bibliography

Integrate | Scan | Test | Automate

Giskard: Testing & evaluation framework for LLMs and AI models

Automatic LLM testing
Protect agaisnt AI risks
Evaluate RAG applications
Ensure compliance

How to test the fairness of ML models? The 80% rule to measure the disparate impact

This article provides a step-by-step guide to detecting ethical bias in AI models, using a customer churn model as an example, using the LightGBM ML library. We show how to calculate the disparate impact metric with respect to gender and age, and demonstrate how to implement this metric as a fairness test within Giskard's open-source ML testing framework.

In a previous article, we outlined possible structural biases that are most often present in AI. In this article, we propose the disparate impact test to measure if an AI model is affected by such biases.

Introduction

Disparate impact in AI refers to the phenomenon wherein a machine learning model disproportionately harms certain groups of people. This can happen when a model is trained on data that is biased against certain groups, and as a result, the model makes decisions that discriminate against them. This is a serious issue that can have real-world consequences, particularly in areas such as lending, employment, and criminal justice, where machine learning models are increasingly used to make decisions affecting people's lives.

One way to concretely measure the disparate impact is by using the 80% rule. The 80% rule is a principle that states that if a selection rate for a protected group (such as a minority group) is less than 80% with respect to the group with the highest selection rate, the selection process may be considered discriminatory. This rule has been used in the context of employment discrimination cases in the United States, but it has also been applied to machine learning models. It was established to limit bad practices in employment, housing, and other areas that affected a minority (with respect to a sensitive characteristic), even when rules applied by employers or landlords are formally neutral. Most American federal civil rights laws consider characteristics such as race, colour, religion, national origin, and sex to be sensitive ones among others.

In the context of machine learning, the 80% rule can be used as a benchmark to identify whether a model has a disparate impact on certain groups. This can be used as a flag for further examination of the model, and as a way to identify potential biases in the data used to train the model.

The disparate impact test — the 80% rule

Several metrics can be thought of when measuring disparity between two statistical samples such as the 80% Rule or statistical significance tests among others. We will dedicate this article to talk about the 80% rule and discuss in length the statistical significance tests that are implemented currently in Giskard, in a future article.

The 80% rule was first published in the State of California Guidelines on Employee Selection Procedures in October 1972. Ever since it became a common rule in companies to ensure fair representation of protected minorities. This rule can be generally stated as follows:

The proportion of a protected (minority) group with respect to an unprotected (majority) one, given a positive outcome, should be more than 80%.

Which can be reformulated in an AI context to be:

The ratio of positive outcome probabilities, an AI model predicts, given the protected subset over the rest (unprotected) subset should be more than 80%.

Mathematically, we can formulate it as:

Disparate Impact metric

where Pr is the probability to have a positive outcome (here we suppose it is y=1) given a data subset D. In the context of an AI model that has some classification threshold and ŷ as output, the probability Pr can be expressed as follows:

Probability expression in the context of an AI binary classification model

Now that the disparate impact metric is formally defined, we need to define what a “positive outcome” is in Machine Learning. Does it only mean that the target variable is positive? Not necessarily.

The Disparate impact metric can be classified as a fairness one in AI Ethics. It measures the unfair allocation of a given good (resource, service, or practice) against a minority. To link the Machine Learning target variable with this “positive outcome”, we need to understand the business action that follows the prediction given by the Machine Learning model. For instance, in credit scoring, credit demanders who have a high default score (target) are usually denied access to the loan (output). An HR ML model that predicts high-potential candidates will propose a job position (output) for candidates who have high-score of being high-potential (target). For some famous cases like the Compas algorithm, judges choose to take a defendant into custody (output) if his recidivism score (target) is too high.

As a result, to define properly the disparate impact test, one should properly understand the link between the ML target variable that is being predicted and the output that should be allocated fairly between groups. Let’s see how it works for a common churn model.

Application on an AI model — Churn

It’s generally easier to keep an existing customer than to gain a new one. It’s also much easier to save a customer before they leave than to convince them to come back. For these reasons and more, understanding and preventing customer churn is critical to the long-term success of a company. Churn prediction is the process of using historical data to identify customers who are likely to cancel their subscriptions or stop using a service. Machine learning algorithms are often used to analyse customer data and build a model that can predict which customers are at risk of churning. The goal of churn prediction is to identify at-risk customers early so that the company can take action to retain them. This can include targeted marketing campaigns, special promotions, or personalised services.

What is the link between churning and the ethical metrics of Disparate impact? Fairness considerations are not always relevant for common marketing use cases. Machine Learning is all about creating effective discriminators, and there is no real point in forcing the model to provide the same decision between groups if there is no precise ethical justification behind it. But for churn, things can be a bit different.

Offering incentives is one strategy that a company can use to try to retain customers with high churn scores. These incentives can take many forms, such as discounts, rewards, or special promotions. For example, a company might offer a discount on a customer's next purchase if they agree to stay with the service for an additional period of time. Similarly, the company might offer a reward to customers who refer their friends and family to the service.

What if a company offers special promotions and gifts only to white and wealthy individuals? There is no doubt that this company would be viewed as discriminatory against minorities. To avoid this, let’s see how we can measure this with a practical churn code example.

In this section, we consider the following Telco Customer Churn Kaggle dataset. This dataset contains the demographic information about customers – gender, age range, if they have partners and dependents and whether they stopped being a customer within the last month, in other words, whether they churned or not.

An AI model (in our case it is the LightGBM model) has been trained in order to predict how likely a customer is to churn. It’s accuracy (on the test dataset) is of 79.6%. Our goal will be to infer whether this model is infected with some social biases. We will look into two features:

  • Gender
  • Age/Seniority

The dataset seems to contain a balanced number of male (50.5%) and female (49.5%) customers. However, it seems to contain only 16.2% seniors versus 83.8% non-seniors.

Gender balance in the dataset
Age balance in the dataset

The main question is then: how are these features (gender and seniority) distributed with respect to the positive outcome? We’ll suppose that the positive outcome is related to the target variable “did not churn”. In the following diagrams we plot the proportions of males and females who did not churn in both the training and testing datasets together with the predictions of the trained model. In each case we calculate the following disparate impact metric (that we call DI) which can be expressed as:

Disparate Impact metric applied on gender

It is clear that in all cases, DI ~ 1, thus no disparity has been detected. This is a clear evidence that the model gender representation (on the right) is representative of the data (on the left).

Correlation between target and gender in the data
Correlation between prediction and gender in the model

When it comes to the seniority feature. The metric becomes:

Disparate Impact metric applied on seniority

Most of the customers are younger people, and as maybe expected, the model does not respect the 80% rule with a DI=0.72 and 0.76 respectively for the training and testing datasets. Of course, this disproportion is also reflected in the data, but it means that the model is more likely to predict that a customer would churn if they are seniors. Although this is not necessarily an unethical bias, the company might choose the direction of eliminating it, and thus penalise their models during training to respect a DI 0.8.

Correlation between target and age seniority in the data
Correlation between prediction and age seniority in the model

Code implementation — example of test

In Giskard, we have implemented this test, allowing the user to define their protected slices of the data, the positive outcome, and providing a min_threshold and max_threshold that the DI has to respect.

We can see in the following screenshots the two investigations carried out above (on gender and seniority) reproduced in the Giskard interface:

  • Gender:
Disparate impact test result on gender
  • Seniority:
Disparate impact test results on seniority

Conclusion

Among the metrics that measure ethical biases in an AI model, we discussed the 80% rule for gauging a disparate impact. This disparate impact test helps in detecting when an AI model, that appears to be neutral, results in a disproportionate impact on a protected group. We saw that AI Fairness metrics are relevant not only for common critical use cases in HR, banking, or public services. In simple marketing use cases like churn, disparate impact tests can be relevant if they are implemented with a real understanding of the output of the AI system in the business world.

This test is part of a large collection implemented in Giskard. Try our demo projects preloaded in the app to learn more about testing in Machine Learning.

Dashboard of test results in Giskard

Check our Github!

Bibliography

Get Free Content

Download our guide and learn What the EU AI Act means for Generative AI Systems Providers.