All Knowledge

Articles, tutorials & news on AI Quality, Security & Compliance

AI Safety Research - Phare Benchmark - Bias Evaluation - Self-Coherency

LLMs recognise bias but also reproduce harmful stereotypes: an analysis of bias in leading LLMs

Our Phare benchmark reveals that leading LLMs reproduce stereotypes in stories despite recognising bias when asked directly. Analysis of 17 models shows the generation vs discrimination gap.

David Berenstein

News

Recent content

News

RealPerformance, A Dataset of Language Model Business Compliance Issues

Giskard launches RealPerformance to address the gap between the focus on security and business compliance issues: the first systematic dataset of business performance failures in conversational AI, based on real-world testing across banks, insurers, and other industries.

David Berenstein

View post

Tutorials

RAG Benchmarking: Comparing RAGAS, BERTScore, and Giskard for AI Evaluation

Discover the best tools for benchmarking Retrieval-Augmented Generation (RAG) systems. Compare RAGAS, BERTScore, Levenshtein Distance, and Giskard with real-world examples and find the optimal evaluation approach for your AI applications.

Christianto Kurniawan (guest)

View post

Blog

LLM Observability vs LLM Evaluation: Building Comprehensive Enterprise AI Testing Strategies

Enterprise AI teams often treat observability and evaluation as competing priorities, leading to gaps in either technical monitoring or quality assurance.

David Berenstein

View post

Knowledge to be shared

Articles, tutorials and latest news on AI Quality, Security & Compliance

See all

Blog

Real-Time Guardrails vs Batch LLM Evaluations: A Comprehensive AI Testing Strategy

Enterprise AI teams need both immediate protection and deep quality insights but often treat guardrails and batch evaluations as competing priorities.

David Berenstein

View post

A Practical Guide to LLM Hallucinations and Misinformation Detection

David Berenstein

Blog

A Practical Guide on AI Security and LLM Vulnerabilities

David Berenstein

Blog

Good answers are not necessarily factual answers: an analysis of hallucination in leading LLMs

Matteo Dora

News

How to implement LLM as a Judge to test AI Agents? (Part 2)

Jean-Marie John-Mathews, Ph.D.

Tutorials