In February, we announced our work on Phare (Potential Harm Assessment & Risk Evaluation), a comprehensive multilingual benchmark designed to evaluate the safety and security of leading LLMs across four critical domains: hallucination, bias & fairness, harmfulness, and vulnerability to intentional abuse through techniques like jailbreaking.
In the coming weeks, we will share in-depth analysis for each of these categories. Today, we start with hallucination—a challenge with serious implications for production applications. In our recent RealHarm study, we reviewed all documented incidents affecting LLM applications and found that hallucination issues accounted for more than one-third of all reviewed incidents in deployed LLM applications. This finding underscores the practical relevance of understanding and mitigating hallucination risks.
What makes hallucination particularly concerning is its deceptive nature: responses that sound authoritative can completely mislead users who lack the expertise to identify factual errors. As organizations increasingly deploy LLMs in critical workflows, understanding these limitations becomes an essential risk management consideration.
In this first post, we'll explore our base methodology and discuss three critical aspects of hallucinations revealed by the Phare benchmark: how hallucination can manifest, which are the factors that influence the tendency to hallucinate, and which models are most susceptible.
Methodology
The Phare benchmark implements a systematic evaluation process to ensure consistent and fair assessment across language models:
- Source gathering: We collect language-specific content and seed prompts that reflect authentic usage patterns of LLMs (currently in English, French, and Spanish).
- Sample generation: We transform the source materials into evaluation test cases that comprise both the test prompt (question or multi-turn scenario) that will be presented to the language model and specific evaluation criteria depending on the task.
- Human review: All samples undergo human annotation and quality verification to ensure accuracy and relevancy for the evaluation.
- Model evaluation: We let the language models answer our test scenario and then score their responses against the defined criteria.

The hallucination module
The hallucination module evaluates models across multiple task categories designed to capture different ways models may generate misleading or false information. The assessment framework currently includes four of these tasks: factual accuracy, misinformation resistance, debunking capabilities, and tool reliability.
Factuality is tested through structured question-answering tasks, measuring how precisely models can retrieve and communicate established information.

Misinformation resistance examines models' capabilities to correctly refute ambiguous or ill-posed questions rather than fabricating narratives that support them.

Debunking tests whether the models can identify and debunk pseudoscientific claims, conspiracy theories, or urban legends, rather than reinforcing or amplifying them.

Tool reliability measures how well LLMs can leverage external functions (like APIs or databases) to perform their tasks accurately. In particular, we assess how LLMs can interface with tools under non-ideal conditions, such as partial information, misleading contexts, or ambiguous queries. For example, when a tool normally requires a person's first name, surname, and age, we simulate a user request that only provides the name and surname and check how the model responds—whether it asks for the missing age, or proceeds by fabricating a fake value for it. This approach provides a more realistic measure of how models perform when facing the types of imperfect inputs they encounter in actual deployments.
Key findings

1. Model popularity doesn't guarantee factual reliability
Our research reveals a concerning disconnect between user preference and hallucination resistance. Models ranking highest in popular benchmarks like LMArena—which primarily measure user preference and satisfaction—are not necessarily the most resistant to hallucination. Optimization for user experience can sometimes come at the expense of factual accuracy.
Consider the following example where a model produces an eloquent, authoritative response that would likely score highly on user preference metrics, despite containing entirely fabricated information:

Models optimized primarily for user satisfaction consistently provide information that sounds plausible and authoritative despite questionable or nonexistent factual bases. Users without domain expertise cannot detect these inaccuracies, making these hallucinations particularly problematic in real-world applications.
2. Question framing significantly influences debunking effectiveness
Our evaluation reveals a direct relationship between the perceived confidence or authority in a user's query and the model's willingness to refute controversial claims. This phenomenon is known as "sycophancy".
Our tests reveal that when users present controversial claims with high confidence or cite perceived authorities, most models are significantly less likely to debunk these claims. Presenting claims in a highly confident manner (e.g. “I’m 100% sure that …” or “My teacher told me that …”) can cause debunking performances to drop by up to 15% with respect to a neutral framing (e.g. “I’ve heard that …”).
The sycophancy effect could be a byproduct of RLHF training processes that encourage models to be agreeable and helpful to users. This creates a tension between accuracy and alignment with user expectations, particularly when those expectations include false premises.
On a positive note, some models show resistance to sycophancy (Anthropic models and Meta’s Llama in their largest versions), suggesting that it is possible to tackle the issue at the model training level.

3. System instructions dramatically impact hallucination rates
Our data shows that simple changes to system instructions dramatically influence a model's tendency to hallucinate. Instructions emphasizing conciseness (e.g. “answer this question briefly”) specifically degraded factual reliability across most models tested. In the most extreme cases, this resulted in a 20% drop in hallucination resistance.
This effect seems to occur because effective rebuttals generally require longer explanations. When forced to be concise, models face an impossible choice between fabricating short but inaccurate answers or appearing unhelpful by rejecting the question entirely. Our data shows models consistently prioritize brevity over accuracy when given these constraints.
This finding has important implications for deployment, as many applications prioritize concise outputs to reduce token usage, improve latency, and minimize costs. Our research suggests that such optimization should be thoroughly tested against the increased risk of factual errors.
Conclusion
The Phare benchmark reveals some eye-opening patterns about hallucination in LLMs. Your favorite model might be great at giving you answers you like—but that doesn't mean those answers are true. Our testing shows that models ranking highest in user satisfaction often produce responses that sound authoritative but contain fabricated information.
The way questions are framed dramatically affects what models say back. They're surprisingly susceptible to the confidence in the user's tone. When information is presented tentatively ("I heard that..."), the model might correct it. Present the same false information confidently ("My teacher told me..."), and suddenly the model is much more likely to go along with it.
Perhaps most importantly for developers, seemingly innocent system prompts like "be concise" can sabotage a model's ability to debunk misinformation. When forced to keep it short, models consistently choose brevity over accuracy—they simply don't have the space to acknowledge the false premise, explain the error, and provide accurate information.
In the coming weeks, we'll share additional findings from our Bias & Fairness and Harmfulness modules as we continue developing comprehensive evaluation frameworks for safer, more reliable AI systems.
We invite you to explore the complete benchmark results at phare.giskard.ai. For organizations interested in contributing to the Phare initiative or testing their own models, please reach out to the Phare research team at phare@giskard.ai.
Phare is a project developed by Giskard with Google DeepMind, the European Union, and Bpifrance as research and funding partners.