From Benchmarks to Evals: How We Measure AI and Why It Matters

Benchmarks score models. Evals test them in real workflows. This is your guide to understanding how we measure and trust AI performance today.

From Benchmarks to Evals: How We Measure AI and Why It Matters

Benchmarks are everywhere in AI, but what do they really measure? And why are startups, regulators, and researchers suddenly investing so much in what used to be just test scores?

This article offers a high-level guide to how benchmarks and evaluations (evals) shape the development, selection, and trust in large language models (LLMs). It's written for digital professionals who want to make sense of the tools and standards that now define the AI landscape and the shifting expectations that come with them.

Benchmarks: More Than Just a Scoreboard

The word benchmark has a telling origin: a carved mark in stone, used by surveyors to measure altitude. Its modern use in AI is similar: a fixed reference point used to compare models.

Most AI benchmarks today are static datasets: questions, problems, or tasks designed to test core capabilities like reasoning, knowledge, coding, or emotional insight.

Examples:

  • MMLU tests broad academic knowledge.
  • GSM8K checks mathematical reasoning at grade-school level.
  • TruthfulQA reveals how easily a model reproduces falsehoods.
  • EQ-Bench explores how well models interpret emotional cues.

These benchmarks help developers track progress and compare models under identical conditions. They’re widely used in papers and press releases, and for good reason. But they’re also limited.

EQ-Bench 3 Leaderboard

The benchmarks often also have leaderboards.

The Limits of Static Testing

Benchmarks are reliable, but they are also static. They often measure a model in isolation, divorced from how it will be used in a workflow or product.

And as models grow more capable, they begin to saturate these benchmarks — approaching the ceiling of what these tests can reveal. A model scoring 89% may behave very differently in production than one scoring 91%, even if the numbers seem close.

Moreover, benchmarks tend to generalise — they tell us what a model can do on average, not how it behaves in your setting.

Evals: From Capability to Suitability

That’s where evals — short for evaluations — come in. Evals don’t just ask how good is this model compared to another? They ask:
Is this model good enough for this task, in this context?

Evals are often:

  • Application-specific (e.g. customer support, legal search, RAG pipelines)
  • Dynamic and iterative (e.g. testing after each update)
  • Multidimensional (e.g. assessing coherence, truthfulness, helpfulness)

They combine automated metrics (e.g. faithfulness scores), qualitative judgments (e.g. tone or clarity), and even human review. Where benchmarks provide fixed scores, evals create ongoing feedback loops.

Companies like Langwatch, who I’m currently interviewing, build platforms to support this — enabling continuous model evaluation in production settings, tracking performance, hallucination risks, and regressions over time.

Who Uses What and Why?

Group Use of Benchmarks Use of Evals
Model builders Track progress, compare versions Validate safety, edge cases, use-case readiness
AI teams / developers Choose a model for general suitability Fine-tune performance, measure fit to product context
Evaluation startups Publish or interpret benchmark results Provide custom evals, dashboards, scenario testing
Regulators / researchers Reference for compliance frameworks Investigate bias, fairness, safety in live systems

As the ecosystem matures, the distinction becomes sharper:

Benchmarks measure capability
Evals measure suitability

A Shift in Thinking

The industry is slowly moving from:

"Is this the best model?"
to
"Is this the right model for the job?"

This is part of a broader realisation: LLMs are not finished products — they are infrastructures. Their behaviour depends on how they’re used, what prompts they receive, how retrieval is configured, and what oversight is in place.

That’s why evaluation is evolving — from static tests to dynamic, task-driven assessment.

🧪
Bias in QA: The BBQ Benchmark
While reading the OpenAI o3 System Card, I came across BBQ – the Bias Benchmark for Question Answering. It’s a dataset designed to test whether language models make biased assumptions when answering questions that involve demographic references. For example, if a prompt states “Jordan is a janitor,” does the model infer their race or gender? BBQ evaluates whether a model maintains neutrality or falls into stereotypical patterns. It includes both ambiguous and unambiguous contexts to reveal how likely a model is to fill in missing information with socially biased assumptions. This benchmark has become a valuable tool in assessing fairness and safety in open-ended AI systems.

Why This Matters Now

AI systems are increasingly embedded in high-stakes environments: healthcare, finance, law, education. In those contexts, performance isn’t just about intelligence. It’s about robustness, fairness, trustworthiness, and relevance.

Benchmarks offer altitude. Evals offer direction.

Understanding both is crucial for anyone working with AI in the real world — not just to select better models, but to build better systems.

Closing

Benchmarks and evals may look like details on a model card, but they reflect a deeper shift in how we define competence, trust, and progress in AI.

As I explore these topics further with companies like Langwatch, I’ll continue sharing insights into how we measure what matters, and how we make these models not just powerful, but responsible.


Most-Used AI Benchmarks

Benchmark What It Tests Created By Link
MMLU General knowledge across 57 subjects OpenAI & NYU Paper
GSM8K Step-by-step math reasoning OpenAI GitHub
HumanEval Python code generation OpenAI GitHub
TruthfulQA Resistance to falsehoods and myths OpenAI & Oxford Paper
MATH High-level competition mathematics OpenAI Paper
ARC Commonsense science reasoning Allen Institute for AI (AI2) Website
HellaSwag Sentence-level commonsense inference UW & AI2 Website
DROP Discrete reasoning over paragraphs AI2 Leaderboard
EQ-Bench Emotional intelligence in dialogue Hugging Face & LAION Website
Natural Questions (NQ) Real-world question answering Google AI Website

LangWatch: Power of Evals for LLM-Based Systems
When your chatbot changes behaviour overnight, how do you know why? Evals are the missing link between AI intuition and product certainty. I talked to Manouk Draisma from LangWatch.
Where does the word “benchmark” come from?
Uncover the origins of ‘benchmarking,’ a term that went from measuring physical terrains to evaluating digital performance. Learn its impact today.
Model Cards, System Cards and What They’re Quietly Becoming
What are AI model cards, and why are they becoming the documents regulators will turn to first? I read a few and it taught me more than I expected.