Tutorials

March 11, 2025

7 min read

How to implement LLM as a Judge to test AI Agents? (Part 1)

Testing AI agents effectively requires automated systems that can evaluate responses across several scenarios. In this first part of our tutorial, we introduce a systematic approach using LLM as a judge to detect hallucinations and security vulnerabilities before deployment. Learn how to generate synthetic test data and implement business annotation processes for exhaustive AI agent testing.

Implementing LLM as a Judge to test AI agents

Jean-Marie John-Mathews, Ph.D.

Implementing LLM as a Judge to test AI agents

Introduction

Testing generative AI models presents unique challenges due to the infinite number of potential test cases and the domain-specific nature of these cases. Implementing an LLM-based evaluation (LLM as a judge) enables automation by allowing an LLM to assess various cases. Additionally, LLMs can generate synthetic test cases to enhance test coverage. However, these automated processes for evaluation and generation come with significant limitations:

Without human guidance, generated test cases tend to be too generic and unrealistic.
LLM-based evaluations often produce false positives due to the high cost of aligning human judgment with LLM evaluations.
As the world evolves, the process of test case generation must be continuously refined to prevent model drift.

In this two-part tutorial, we outline a four-step workflow for implementing LLM as a judge evaluation for AI agent testing:

Generation of synthetic data – Automate test case generation with a focus on legitimate and adversarial queries.
Business annotation – Use domain knowledge to review and refine test cases through annotation tools.
Test execution automation – Run evaluations and set up alerts for detected vulnerabilities.
Continuous red teaming – Detect emerging vulnerabilities through proactive monitoring.

This first part will cover the initial two steps of our four-step workflow: generating synthetic data and business annotation. The subsequent two steps—test execution automation and continuous red teaming—will be covered in the second part of this tutorial, which will be published soon.

We’ll illustrate these four steps using the Giskard LLM Evaluation Hub, the solution developed by Giskard for continuous testing of AI agents.

Step 1: Generation of Synthetic Data

Since generative AI models can face infinite test cases, automated test case generation is necessary. The goal is to create business-specific legitimate queries and adversarial queries to thoroughly test AI agent responses.

a. Legitimate Queries

Legitimate queries represent standard user inputs without malicious intent. Failures often indicate hallucinations or incorrect answers. Internal data like RAG knowledge bases can seed expected bot responses. A well-structured synthetic data process should be:

Exhaustive: Create diverse test cases by ensuring coverage of all documents and / or topics used by the bot.
Designed to trigger failures: Synthetic test cases should not be trivial queries, otherwise the chance that your tests fail becomes very low. Apply perturbation techniques to increase the likelihood of incorrect responses from the bot.
Automatable: Generate both queries and expected outputs for automatic comparison by the evaluation judge. This is essential for the LLM-as-a-judge setup.
Domain-specific: Synthetic test cases should not be generic queries; otherwise, they won’t be truly representative of real user queries. Include model metadata like bot descriptions to generate realistic, representative test cases.

As an illustration, the Giskard LLM Evaluation Hub provides an interface for the synthetic generation of legitimate queries with expected outputs. It automatically clusters the internal knowledge base into key topics and generates test cases for each topic by applying a set of perturbations.

Automatically generate domain-specific test cases of hallucinations by connecting your company’s internal knowledge bases

b. Adversarial Queries

Legitimate queries alone are not sufficient for synthetic test case generation. Some users may deliberately attack your bot using adversarial queries, often leveraging prompt injections. Toxic content generation can damage a company's reputation. Prompt injections can also be used for information disclosure, exposing sensitive data.

It's crucial to address these security flaws by generating both adversarial queries and their corresponding evaluation rules. Effective adversarial test generation should:

Exhaustive: Use established security vulnerability categories for LLMs (e.g., OWASP Top 10) to cover the most well-known issues.
Designed to trigger failures: Generate novel variations that bypass security patches (added by model providers). For example, for most prompt injection techniques (e.g., DAN), generating variants increases the likelihood of failures.
Automatable: Generate both adversarial queries and evaluation rules, so the evaluation judge verifies the bot's compliance automatically. This is essential for the LLM-as-a-judge setup.
Domain-specific: Include bot metadata to create realistic adversarial queries (e.g., including the bot's description in the generation process) and specific rules, increasing test effectiveness.

As an illustration, the Giskard Evaluation Hub offers customization to target security vulnerabilities such as:

Stereotypes & Discrimination
Harmful Content
Personal Information Disclosure
Off-topic Queries
Financial Advice
Medical Advice
Prompt Injection

Example: Personal Information Disclosure

The LLM Evaluation Hub detects and transforms a personal information disclosure vulnerability into a reproducible test case

Step 2: Business Annotation

Synthetic test case generation is the first step in the LLM-as-a-judge process: it enables the generation of queries, expected outputs, and rules so that test cases can be automatically evaluated by an LLM. However, synthetic generation alone is not sufficient:

Generated test cases often need refinements to expected outputs, domain-specific rules, and evaluation parameters. A human review is necessary to refine these test cases.
Security taxonomies and internal databases don't ensure complete coverage. Domain experts must create additional test cases based on unpredictable real-world usage.

To address these challenges, the LLM Evaluation Hub provides an Annotation Studio and a Red Teaming Playground.

1. Annotation Studio

The Annotation Studio provides an interface for reviewing and assigning evaluation criteria (checks) to conversations. This step is essential for implementing the LLM-as-a-judge approach, as it allows for the adaptation of expected outputs and rules for automatic evaluation. Since bots produce variable (non-deterministic) outputs, developing effective test cases requires iteratively generating multiple responses and refining evaluation criteria against these variations.

Iteratively design your test cases using a business-centric & interactive interface

As an illustration, various checks are available at Giskard:

Conformity Check: Given a rule or criterion, check whether the model answer complies with this rule. This can be used to check business specific behavior or constraints.
Groundedness Check: Check whether the model answer only contains facts that are included in a reference context. There might be omissions in the model answer compared to the context, but all information given by the model must be grounded in the context. The Groundedness check is useful to test for potential hallucinations in the model answer.
Correctness Check: Check whether the model answer completely agrees with the reference answer. This means that all information provided inside the model answer is found in the reference answer and the other way around. Compared to groundedness, correctness is stricter as the model answer must adhere completely to the reference answer without omission.
String Match: Checks if a specific keyword or sentence appears in the model's answer

2. Red Teaming Playground

Refining pre-generated test cases is often not enough. Sometimes, new test cases need to be crafted from scratch to better reflect the business and usage context of the bot. However, business users may face difficulties in:

Coming up with ideas for meaningful conversations.
Transforming the conversation into an LLM-as-a-judge test case (e.g., adding requirements, validation criteria, etc.).
Saving it in the right test database.

The Red Teaming Playground enables users to address these three challenges to manually craft test cases based on human interactions. This manual approach complements the automatic synthetic data generation.

Craft new test cases by interacting with your LLM app and send them to test dataset

Next steps

We've now explored the first two crucial phases of implementing an LLM as a judge for AI agent testing. By generating synthetic data that combines both legitimate and adversarial queries, we create a foundation for comprehensive testing. Through business annotation, we refine these test cases with domain expertise to ensure they're relevant and effective.

In the second part of this tutorial, we'll build upon this foundation by diving into Test Execution Automation and Continuous Red Teaming. These next phases will show you how to automate the evaluation process, interpret test results effectively, and establish ongoing monitoring to detect new vulnerabilities as they emerge.

Reach out to our team to discuss how the LLM Evaluation Hub can address your specific AI security challenges.

Integrate | Scan | Test | Automate

Giskard: Testing platform to secure LLM Agents

Get alerted of new vulnerabilities

Protect agaisnt AI risks

Identify security vulnerabilities & hallucination

Enable cross-team collaboration

GET STARTED

How to implement LLM as a Judge to test AI Agents? (Part 1)

Introduction

Without human guidance, generated test cases tend to be too generic and unrealistic.
LLM-based evaluations often produce false positives due to the high cost of aligning human judgment with LLM evaluations.
As the world evolves, the process of test case generation must be continuously refined to prevent model drift.

In this two-part tutorial, we outline a four-step workflow for implementing LLM as a judge evaluation for AI agent testing:

Generation of synthetic data – Automate test case generation with a focus on legitimate and adversarial queries.
Business annotation – Use domain knowledge to review and refine test cases through annotation tools.
Test execution automation – Run evaluations and set up alerts for detected vulnerabilities.
Continuous red teaming – Detect emerging vulnerabilities through proactive monitoring.

We’ll illustrate these four steps using the Giskard LLM Evaluation Hub, the solution developed by Giskard for continuous testing of AI agents.

Step 1: Generation of Synthetic Data

a. Legitimate Queries

Exhaustive: Create diverse test cases by ensuring coverage of all documents and / or topics used by the bot.
Designed to trigger failures: Synthetic test cases should not be trivial queries, otherwise the chance that your tests fail becomes very low. Apply perturbation techniques to increase the likelihood of incorrect responses from the bot.
Automatable: Generate both queries and expected outputs for automatic comparison by the evaluation judge. This is essential for the LLM-as-a-judge setup.
Domain-specific: Synthetic test cases should not be generic queries; otherwise, they won’t be truly representative of real user queries. Include model metadata like bot descriptions to generate realistic, representative test cases.

b. Adversarial Queries

It's crucial to address these security flaws by generating both adversarial queries and their corresponding evaluation rules. Effective adversarial test generation should:

Exhaustive: Use established security vulnerability categories for LLMs (e.g., OWASP Top 10) to cover the most well-known issues.
Designed to trigger failures: Generate novel variations that bypass security patches (added by model providers). For example, for most prompt injection techniques (e.g., DAN), generating variants increases the likelihood of failures.
Automatable: Generate both adversarial queries and evaluation rules, so the evaluation judge verifies the bot's compliance automatically. This is essential for the LLM-as-a-judge setup.
Domain-specific: Include bot metadata to create realistic adversarial queries (e.g., including the bot's description in the generation process) and specific rules, increasing test effectiveness.

As an illustration, the Giskard Evaluation Hub offers customization to target security vulnerabilities such as:

Stereotypes & Discrimination
Harmful Content
Personal Information Disclosure
Off-topic Queries
Financial Advice
Medical Advice
Prompt Injection

Example: Personal Information Disclosure

Step 2: Business Annotation

Generated test cases often need refinements to expected outputs, domain-specific rules, and evaluation parameters. A human review is necessary to refine these test cases.
Security taxonomies and internal databases don't ensure complete coverage. Domain experts must create additional test cases based on unpredictable real-world usage.

To address these challenges, the LLM Evaluation Hub provides an Annotation Studio and a Red Teaming Playground.

1. Annotation Studio

As an illustration, various checks are available at Giskard:

Conformity Check: Given a rule or criterion, check whether the model answer complies with this rule. This can be used to check business specific behavior or constraints.
Groundedness Check: Check whether the model answer only contains facts that are included in a reference context. There might be omissions in the model answer compared to the context, but all information given by the model must be grounded in the context. The Groundedness check is useful to test for potential hallucinations in the model answer.
Correctness Check: Check whether the model answer completely agrees with the reference answer. This means that all information provided inside the model answer is found in the reference answer and the other way around. Compared to groundedness, correctness is stricter as the model answer must adhere completely to the reference answer without omission.
String Match: Checks if a specific keyword or sentence appears in the model's answer

2. Red Teaming Playground

Coming up with ideas for meaningful conversations.
Transforming the conversation into an LLM-as-a-judge test case (e.g., adding requirements, validation criteria, etc.).
Saving it in the right test database.

Next steps

Reach out to our team to discuss how the LLM Evaluation Hub can address your specific AI security challenges.

Get Free Content

Download our guide and learn What the EU AI Act means for Generative AI Systems Providers.

You will also like

Increasing trust in foundation language models through multi-lingual security, safety and robustness testing

News

Giskard announces Phare, a new open & multi-lingual LLM Benchmark

During the Paris AI Summit, Giskard launches Phare, a new open & independent LLM benchmark to evaluate key AI security dimensions including hallucination, factual accuracy, bias, and potential for harm across several languages, with Google DeepMind as research partner. This initiative is meant to provide open measurements to assess trustworthiness of Generative AI models in real applications.

Matteo Dora

View post

News

New course with DeepLearningAI: Red Teaming LLM Applications

Our new course in collaboration with DeepLearningAI team provides training on red teaming techniques for Large Language Model (LLM) and chatbot applications. Through hands-on attacks using prompt injections, you'll learn how to identify vulnerabilities and security failures in LLM systems.

Blanca Rivera Campos

View post

News

DeepSeek R1: Complete analysis of capabilities and limitations

In this article, we provide a detailed analysis of DeepSeek R1, comparing its performance against leading AI models like GPT-4o and O1. Our testing reveals both impressive knowledge capabilities and significant concerns, particularly regarding the model's tendency to generate hallucinations. Through concrete examples, we examine how R1 handles politically sensitive topics.

Matteo Dora

View post