Introduction
In Part 1 of our tutorial, we explored the initial phases of implementing an LLM-based evaluation system for testing AI agents. We covered the challenges of testing generative AI applications and introduced a four-step workflow for implementing LLM as a judge.
The first part focused on:
- Generating synthetic data - Creating both legitimate and adversarial queries to test AI agent responses across diverse scenarios
- Business annotation - Refining test cases with domain knowledge through the Annotation Studio and Red Teaming Playground
Now in Part 2, we'll build on these foundations to complete the implementation process. We'll explore how to automate test execution, interpret results, and establish continuous monitoring through red teaming. These steps are crucial for maintaining robust AI agent security over time, especially as new vulnerabilities emerge due to changes in company content, evolving news, cybersecurity research advances, and model updates.
Let's dive into the next steps!
Step 3: Automate testing LLM agents
Once the test cases are written and stored in a single dataset (golden dataset), it’s time to execute them and interpret the results.
Having the right interface to execute tests (UI or API) and analyze the results is crucial for applying test outcomes in different contexts:
- Development time: Compare model versions during development and identify the right correction strategies for developers.
- Deployment time: Perform non-regression testing in the CI/CD pipeline for DevOps.
- Production time: Provide high-level reporting for business executives to stay informed about key vulnerabilities in a running bot.
1. Running evaluations of AI agents: When and How?
Depending on the evaluation context, test execution can be manual (during development), triggered (enabling automated non-regression checks in CI/CD), or scheduled (at predefined intervals to monitor performance over time).
Executing a test case requires:
- A test dataset – This should contain all test cases (synthetic or manually crafted) with the proper requirements and expected outputs for an LLM judge to evaluate.
- A model version – The specific version of the AI model being tested.
- Tags (optional) – Used to trigger evaluations for specific subsets of test cases.
Depending on the evaluation scenario, different execution methods may be more suitable. For instance, programmatic evaluation allows seamless integration into various workflows, such as CI/CD pipelines.

2. Interpreting the test results
Depending on the context of use, the dashboard for test result evaluation may differ.
During the development phase, it is essential to diagnose issues and implement corrections to improve the bot’s performance.
- Failure rate per check: Identifying the checks with the highest failure rate makes it easier to apply targeted corrections. For example, if you created a custom check to verify whether the bot starts with “I’m sorry,” it is useful to know how many conversations fail this requirement. If the failure rate is high, you can develop mitigation strategies such as prompt engineering, implementing guardrails, or using routers to address the issue.
- Failure rate per tag: Measuring failure rates across different vulnerability categories (e.g., hallucination, prompt injection) helps prioritize mitigation strategies for the AI agent.
As an illustration, the LLM Evaluation Hub provides overall success metrics, which can be further decomposed into:
- Correctness
- Conformity
- Groundedness
- Custom checks (e.g., scam warnings, discrimination detection)

At production time, it is important to provide regular reports to business executives to identify key vulnerabilities in the bot. Test results may vary with each execution due to the stochastic nature of the bot. Therefore, it is crucial to run tests regularly (e.g., once a week) to monitor for new vulnerabilities over time. Regular reports can be sent via email to notify teams of newly detected vulnerabilities.

Step 4: Continuous AI Red Teaming
Once your test cases are generated, refined with business knowledge, and automatically executed, it is essential to maintain them over time. As AI applications interact with real-world data, new vulnerabilities emerge, and your test dataset may miss critical test cases. New vulnerabilities can arise when:
- Company content changes: Updates to the RAG knowledge base or modifications to the company’s products.
- News evolves: Events not included in the foundational model’s training data (e.g., the 2024 Olympic Games, a new CEO appointment, U.S. elections, etc.).
- Cybersecurity research advances: Newly discovered prompt injections or other vulnerabilities identified by the scientific community.
- New model versions are introduced: Changes in prompts, updates to foundational models, or modifications in AI behavior.
The Giskard LLM Evaluation Hub conducts continuous red teaming by constantly enriching test cases through:
- Internal data (e.g., RAG knowledge base)
- External data (e.g., social media, news articles)
- Security research

Combining email alerts with continuous red teaming allows you to be promptly notified when new vulnerabilities emerge within your AI agents.
Conclusion
Implementing LLM as a judge for AI agent testing provides organizations with automated, systematic evaluation that identifies vulnerabilities before they reach production. This approach significantly reduces hallucinations, ensures compliance with business requirements, and creates a foundation for continuous security improvement.
The LLM Evaluation Hub provides a systematic approach to:
- Generate business-specific test cases.
- Use annotation tools to refine and validate responses.
- Automate evaluations and receive alerts on failures.
- Continuously monitor and red team AI agents to detect new threats.
Reach out our team to discuss how this approach can address your specific AI security challenges.