Blog

November 23, 2021

3 min read

Where do biases in ML come from? #4 📊 Selection

Selection bias happens when your data is not representative of the situation to analyze, introducing risk to AI / ML systems

Jean-Marie John-Mathews, Ph.D.

Orange picking

In this post, we focus on selection biases. 📊

Selection bias occurs when the training dataset is not representative of the population intended to be analyzed. Theoretically, it appears when data selection is not properly randomized.

Here are some well-known examples of selection biases:

❌ Attrition

This bias happens when the training dataset only includes the subjects that “survived” a process. For example, imagine you want to predict the efficiency of a dieting program. To do so, suppose that you reject everyone who drops out of the program. It creates a bias since people can drop out precisely because the program is not working.

❌ Cherry-picking

This bias happens when one deliberately chooses specific subsets of data to support a conclusion. For example, imagine that you want to prove that air travel is dangerous. To do that, suppose you use a training set that focuses on plane accident cases, ignoring the far more common examples of flights that complete safely.

❌ Observer bias

This bias happens when data collection is dependent on the target variable. For example, suppose you are collecting data from users’ mobile phones to predict their purchase behavior. You then exclude from your training dataset all the people who don’t own a mobile phone. If these people have different purchasing behaviors, you will end up having biases in your prediction algorithm.

How to mitigate selection bias? You need to use exogenous information, such as:

✅ Third-party features

Include in your model new features that are correlated with the sampling method. In our example of measuring diet program efficiency, using the variable that measures people’s propensity to drop out of the program (a measure of perseverance for example) can mitigate your selection bias.

✅ Third-party examples

Find more data sources to include as many heterogenous examples as possible in your training dataset. In our example of purchase prediction, one can use additional data sources coming from other channels such as computers or user surveys.

✅ Prior representations

Resample your dataset so that it’s representative of the prior representation that you have on the population. In our previous example on air travel, you can conduct a qualitative survey on air travel to build a prior representation of your population. Based on this representation, you can then implement a quota sampling method to get a more representative training dataset.

At Giskard, we help AI professionals detect selection biases by enriching the modeling process with exogenous information.

Integrate | Scan | Test | Automate

Giskard: Testing platform to secure LLM Agents

Get alerted of new vulnerabilities

Protect agaisnt AI risks

Identify security vulnerabilities & hallucination

Enable cross-team collaboration

GET STARTED