From Benchmarks to Evals: How We Measure AI and Why It Matters

Benchmarks score models. Evals test them in real workflows. This is your guide to understanding how we measure and trust AI performance today.

Rob Hoeijmakers

03 Aug 2025 • 5 min read

Benchmarks are everywhere in AI, but what do they really measure? And why are startups, regulators, and researchers suddenly investing so much in what used to be just test scores?

This article offers a high-level guide to how benchmarks and evaluations (evals) shape the development, selection, and trust in large language models (LLMs). It's written for digital professionals who want to make sense of the tools and standards that now define the AI landscape and the shifting expectations that come with them.

Benchmarks: More Than Just a Scoreboard

The word benchmark has a telling origin: a carved mark in stone, used by surveyors to measure altitude. Its modern use in AI is similar: a fixed reference point used to compare models.

Most AI benchmarks today are static datasets: questions, problems, or tasks designed to test core capabilities like reasoning, knowledge, coding, or emotional insight.

Examples:

MMLU tests broad academic knowledge.
GSM8K checks mathematical reasoning at grade-school level.
TruthfulQA reveals how easily a model reproduces falsehoods.
EQ-Bench explores how well models interpret emotional cues.

These benchmarks help developers track progress and compare models under identical conditions. They’re widely used in papers and press releases, and for good reason. But they’re also limited.

The Limits of Static Testing

Benchmarks are reliable, but they are also static. They often measure a model in isolation, divorced from how it will be used in a workflow or product.

And as models grow more capable, they begin to saturate these benchmarks — approaching the ceiling of what these tests can reveal. A model scoring 89% may behave very differently in production than one scoring 91%, even if the numbers seem close.

Moreover, benchmarks tend to generalise — they tell us what a model can do on average, not how it behaves in your setting.

Evals: From Capability to Suitability

That’s where evals — short for evaluations — come in. Evals don’t just ask how good is this model compared to another? They ask:
Is this model good enough for this task, in this context?

Evals are often:

Application-specific (e.g. customer support, legal search, RAG pipelines)
Dynamic and iterative (e.g. testing after each update)
Multidimensional (e.g. assessing coherence, truthfulness, helpfulness)

They combine automated metrics (e.g. faithfulness scores), qualitative judgments (e.g. tone or clarity), and even human review. Where benchmarks provide fixed scores, evals create ongoing feedback loops.

Companies like Langwatch, who I’m currently interviewing, build platforms to support this — enabling continuous model evaluation in production settings, tracking performance, hallucination risks, and regressions over time.

Who Uses What and Why?

Group	Use of Benchmarks	Use of Evals
Model builders	Track progress, compare versions	Validate safety, edge cases, use-case readiness
AI teams / developers	Choose a model for general suitability	Fine-tune performance, measure fit to product context
Evaluation startups	Publish or interpret benchmark results	Provide custom evals, dashboards, scenario testing
Regulators / researchers	Reference for compliance frameworks	Investigate bias, fairness, safety in live systems

As the ecosystem matures, the distinction becomes sharper:

Benchmarks measure capability
Evals measure suitability

A Shift in Thinking

The industry is slowly moving from:

"Is this the best model?"
to
"Is this the right model for the job?"

This is part of a broader realisation: LLMs are not finished products — they are infrastructures. Their behaviour depends on how they’re used, what prompts they receive, how retrieval is configured, and what oversight is in place.

That’s why evaluation is evolving — from static tests to dynamic, task-driven assessment.

🧪

Bias in QA: The BBQ Benchmark
While reading the OpenAI o3 System Card, I came across BBQ – the Bias Benchmark for Question Answering. It’s a dataset designed to test whether language models make biased assumptions when answering questions that involve demographic references. For example, if a prompt states “Jordan is a janitor,” does the model infer their race or gender? BBQ evaluates whether a model maintains neutrality or falls into stereotypical patterns. It includes both ambiguous and unambiguous contexts to reveal how likely a model is to fill in missing information with socially biased assumptions. This benchmark has become a valuable tool in assessing fairness and safety in open-ended AI systems.

Why This Matters Now

AI systems are increasingly embedded in high-stakes environments: healthcare, finance, law, education. In those contexts, performance isn’t just about intelligence. It’s about robustness, fairness, trustworthiness, and relevance.

Benchmarks offer altitude. Evals offer direction.

Understanding both is crucial for anyone working with AI in the real world — not just to select better models, but to build better systems.

Closing

Benchmarks and evals may look like details on a model card, but they reflect a deeper shift in how we define competence, trust, and progress in AI.

As I explore these topics further with companies like Langwatch, I’ll continue sharing insights into how we measure what matters, and how we make these models not just powerful, but responsible.

Most-Used AI Benchmarks

Benchmark	What It Tests	Created By	Link
MMLU	General knowledge across 57 subjects	OpenAI & NYU	Paper
GSM8K	Step-by-step math reasoning	OpenAI	GitHub
HumanEval	Python code generation	OpenAI	GitHub
TruthfulQA	Resistance to falsehoods and myths	OpenAI & Oxford	Paper
MATH	High-level competition mathematics	OpenAI	Paper
ARC	Commonsense science reasoning	Allen Institute for AI (AI2)	Website
HellaSwag	Sentence-level commonsense inference	UW & AI2	Website
DROP	Discrete reasoning over paragraphs	AI2	Leaderboard
EQ-Bench	Emotional intelligence in dialogue	Hugging Face & LAION	Website
Natural Questions (NQ)	Real-world question answering	Google AI	Website