News

April 30, 2025

5 min read

Good answers are not necessarily factual answers: an analysis of hallucination in leading LLMs

We're sharing the first results from Phare, our multilingual benchmark for evaluating language models. The benchmark research reveals leading LLMs confidently produce factually inaccurate information. Our evaluation of top models from eight AI labs shows they generate authoritative-sounding responses containing completely fabricated details, particularly when handling misinformation.

Phare LLM Benchmark - an analysis of hallucination in leading LLMs

Matteo Dora

Phare LLM Benchmark - an analysis of hallucination in leading LLMs

In February, we announced our work on Phare (Potential Harm Assessment & Risk Evaluation), a comprehensive multilingual benchmark designed to evaluate the safety and security of leading LLMs across four critical domains: hallucination, bias & fairness, harmfulness, and vulnerability to intentional abuse through techniques like jailbreaking.

In the coming weeks, we will share in-depth analysis for each of these categories. Today, we start with hallucination—a challenge with serious implications for production applications. In our recent RealHarm study, we reviewed all documented incidents affecting LLM applications and found that hallucination issues accounted for more than one-third of all reviewed incidents in deployed LLM applications. This finding underscores the practical relevance of understanding and mitigating hallucination risks.

What makes hallucination particularly concerning is its deceptive nature: responses that sound authoritative can completely mislead users who lack the expertise to identify factual errors. As organizations increasingly deploy LLMs in critical workflows, understanding these limitations becomes an essential risk management consideration.

In this first post, we'll explore our base methodology and discuss three critical aspects of hallucinations revealed by the Phare benchmark: how hallucination can manifest, which are the factors that influence the tendency to hallucinate, and which models are most susceptible.

Methodology

The Phare benchmark implements a systematic evaluation process to ensure consistent and fair assessment across language models:

Source gathering: We collect language-specific content and seed prompts that reflect authentic usage patterns of LLMs (currently in English, French, and Spanish).
Sample generation: We transform the source materials into evaluation test cases that comprise both the test prompt (question or multi-turn scenario) that will be presented to the language model and specific evaluation criteria depending on the task.
Human review: All samples undergo human annotation and quality verification to ensure accuracy and relevancy for the evaluation.
Model evaluation: We let the language models answer our test scenario and then score their responses against the defined criteria.

Figure 1: Phare LLM Benchmark test generation process.

The hallucination module

The hallucination module evaluates models across multiple task categories designed to capture different ways models may generate misleading or false information. The assessment framework currently includes four of these tasks: factual accuracy, misinformation resistance, debunking capabilities, and tool reliability.

Factuality is tested through structured question-answering tasks, measuring how precisely models can retrieve and communicate established information.

Figure 2: Factuality test showing model confidently providing incorrect information.

Misinformation resistance examines models' capabilities to correctly refute ambiguous or ill-posed questions rather than fabricating narratives that support them.

Figure 3: Misinformation test showing how models can fabricate detailed, authoritative-sounding responses to questions containing false premises. The model creates a fictional controversy involving the U.S. Fish and Wildlife Service rather than identifying and correcting the false assumption in the question.

Debunking tests whether the models can identify and debunk pseudoscientific claims, conspiracy theories, or urban legends, rather than reinforcing or amplifying them.

Figure 4: An example of an LLM answer supporting an urban legend about Phil Collins’ song “In the Air Tonight”.

Tool reliability measures how well LLMs can leverage external functions (like APIs or databases) to perform their tasks accurately. In particular, we assess how LLMs can interface with tools under non-ideal conditions, such as partial information, misleading contexts, or ambiguous queries. For example, when a tool normally requires a person's first name, surname, and age, we simulate a user request that only provides the name and surname and check how the model responds—whether it asks for the missing age, or proceeds by fabricating a fake value for it. This approach provides a more realistic measure of how models perform when facing the types of imperfect inputs they encounter in actual deployments.

Key findings

Figure 5: Hallucination resistance scores (higher is better)

1. Model popularity doesn't guarantee factual reliability

Our research reveals a concerning disconnect between user preference and hallucination resistance. Models ranking highest in popular benchmarks like LMArena—which primarily measure user preference and satisfaction—are not necessarily the most resistant to hallucination. Optimization for user experience can sometimes come at the expense of factual accuracy.

Consider the following example where a model produces an eloquent, authoritative response that would likely score highly on user preference metrics, despite containing entirely fabricated information:

Figure 6: This response demonstrates how models can generate responses that appear comprehensive and authoritative while containing completely fabricated information. The question refers to a non-existent agreement between France and Italy, yet the model confidently elaborates on specific details rather than correcting the false premise.

Models optimized primarily for user satisfaction consistently provide information that sounds plausible and authoritative despite questionable or nonexistent factual bases. Users without domain expertise cannot detect these inaccuracies, making these hallucinations particularly problematic in real-world applications.

2. Question framing significantly influences debunking effectiveness

Our evaluation reveals a direct relationship between the perceived confidence or authority in a user's query and the model's willingness to refute controversial claims. This phenomenon is known as "sycophancy".

Our tests reveal that when users present controversial claims with high confidence or cite perceived authorities, most models are significantly less likely to debunk these claims. Presenting claims in a highly confident manner (e.g. “I’m 100% sure that …” or “My teacher told me that …”) can cause debunking performances to drop by up to 15% with respect to a neutral framing (e.g. “I’ve heard that …”).

The sycophancy effect could be a byproduct of RLHF training processes that encourage models to be agreeable and helpful to users. This creates a tension between accuracy and alignment with user expectations, particularly when those expectations include false premises.

On a positive note, some models show resistance to sycophancy (Anthropic models and Meta’s Llama in their largest versions), suggesting that it is possible to tackle the issue at the model training level.

Figure 7: Comparative model performance charts on hallucination resistance and debunking capabilities. The left chart shows model accuracy in debunking controversial claims under different user tones (unsure to very confident). The right chart illustrates models' resistance to hallucination with different system instructions (neutral vs concise answer).

3. System instructions dramatically impact hallucination rates

Our data shows that simple changes to system instructions dramatically influence a model's tendency to hallucinate. Instructions emphasizing conciseness (e.g. “answer this question briefly”) specifically degraded factual reliability across most models tested. In the most extreme cases, this resulted in a 20% drop in hallucination resistance.

This effect seems to occur because effective rebuttals generally require longer explanations. When forced to be concise, models face an impossible choice between fabricating short but inaccurate answers or appearing unhelpful by rejecting the question entirely. Our data shows models consistently prioritize brevity over accuracy when given these constraints.

This finding has important implications for deployment, as many applications prioritize concise outputs to reduce token usage, improve latency, and minimize costs. Our research suggests that such optimization should be thoroughly tested against the increased risk of factual errors.

Conclusion

The Phare benchmark reveals some eye-opening patterns about hallucination in LLMs. Your favorite model might be great at giving you answers you like—but that doesn't mean those answers are true. Our testing shows that models ranking highest in user satisfaction often produce responses that sound authoritative but contain fabricated information.

The way questions are framed dramatically affects what models say back. They're surprisingly susceptible to the confidence in the user's tone. When information is presented tentatively ("I heard that..."), the model might correct it. Present the same false information confidently ("My teacher told me..."), and suddenly the model is much more likely to go along with it.

Perhaps most importantly for developers, seemingly innocent system prompts like "be concise" can sabotage a model's ability to debunk misinformation. When forced to keep it short, models consistently choose brevity over accuracy—they simply don't have the space to acknowledge the false premise, explain the error, and provide accurate information.

In the coming weeks, we'll share additional findings from our Bias & Fairness and Harmfulness modules as we continue developing comprehensive evaluation frameworks for safer, more reliable AI systems.

We invite you to explore the complete benchmark results at phare.giskard.ai. For organizations interested in contributing to the Phare initiative or testing their own models, please reach out to the Phare research team at phare@giskard.ai.

Phare is a project developed by Giskard with Google DeepMind, the European Union, and Bpifrance as research and funding partners.

Integrate | Scan | Test | Automate

Giskard: Testing platform to secure LLM Agents

Get alerted of new vulnerabilities

Protect against AI risks

Identify security vulnerabilities & hallucination

Enable cross-team collaboration

GET STARTED

Good answers are not necessarily factual answers: an analysis of hallucination in leading LLMs

Methodology

The Phare benchmark implements a systematic evaluation process to ensure consistent and fair assessment across language models:

Source gathering: We collect language-specific content and seed prompts that reflect authentic usage patterns of LLMs (currently in English, French, and Spanish).
Sample generation: We transform the source materials into evaluation test cases that comprise both the test prompt (question or multi-turn scenario) that will be presented to the language model and specific evaluation criteria depending on the task.
Human review: All samples undergo human annotation and quality verification to ensure accuracy and relevancy for the evaluation.
Model evaluation: We let the language models answer our test scenario and then score their responses against the defined criteria.

The hallucination module

Factuality is tested through structured question-answering tasks, measuring how precisely models can retrieve and communicate established information.

Misinformation resistance examines models' capabilities to correctly refute ambiguous or ill-posed questions rather than fabricating narratives that support them.

Debunking tests whether the models can identify and debunk pseudoscientific claims, conspiracy theories, or urban legends, rather than reinforcing or amplifying them.

Key findings

1. Model popularity doesn't guarantee factual reliability

2. Question framing significantly influences debunking effectiveness

3. System instructions dramatically impact hallucination rates

Conclusion

Phare is a project developed by Giskard with Google DeepMind, the European Union, and Bpifrance as research and funding partners.

Get Free Content

Download our guide and learn What the EU AI Act means for Generative AI Systems Providers.

You will also like

Increasing trust in foundation language models through multi-lingual security, safety and robustness testing

News

Giskard announces Phare, a new open & multi-lingual LLM Benchmark

During the Paris AI Summit, Giskard launches Phare, a new open & independent LLM benchmark to evaluate key AI security dimensions including hallucination, factual accuracy, bias, and potential for harm across several languages, with Google DeepMind as research partner. This initiative is meant to provide open measurements to assess trustworthiness of Generative AI models in real applications.

Matteo Dora

View post

Implementing LLM as a Judge to test AI agents

Tutorials

How to implement LLM as a Judge to test AI Agents? (Part 1)

Testing AI agents effectively requires automated systems that can evaluate responses across several scenarios. In this first part of our tutorial, we introduce a systematic approach using LLM as a judge to detect hallucinations and security vulnerabilities before deployment. Learn how to generate synthetic test data and implement business annotation processes for exhaustive AI agent testing.

Jean-Marie John-Mathews, Ph.D.

View post

News

DeepSeek R1: Complete analysis of capabilities and limitations

In this article, we provide a detailed analysis of DeepSeek R1, comparing its performance against leading AI models like GPT-4o and O1. Our testing reveals both impressive knowledge capabilities and significant concerns, particularly regarding the model's tendency to generate hallucinations. Through concrete examples, we examine how R1 handles politically sensitive topics.

Matteo Dora

View post