---
title: "From Benchmarks to Evals: How We Measure AI and Why It Matters"
description: "Benchmarks score models. Evals test them in real workflows.
This is your guide to understanding how we measure and trust AI performance today."
url: "https://hoeijmakers.net/ai-benchmarks-and-evals/"
date: 2025-08-03
updated: 2025-08-05
author: "Rob Hoeijmakers"
site: "hoeijmakers.net"
language: "en"
tags: ["AI"]
---

# From Benchmarks to Evals: How We Measure AI and Why It Matters

Benchmarks are everywhere in AI, but what do they really measure? And why are startups, regulators, and researchers suddenly investing so much in what used to be just test scores?

This article offers a high-level guide to how benchmarks and evaluations (evals) shape the development, selection, and trust in large language models (LLMs). It's written for digital professionals who want to make sense of the tools and standards that now define the AI landscape and the shifting expectations that come with them.

## Benchmarks: More Than Just a Scoreboard

The word [*benchmark* has a telling origin](https://hoeijmakers.net/where-does-the-word-benchmark-come-from/): a carved mark in stone, used by surveyors to measure altitude. Its modern use in AI is similar: a fixed reference point used to compare models.

Most AI benchmarks today are static datasets: questions, problems, or tasks designed to test core capabilities like reasoning, knowledge, coding, or emotional insight.

**Examples:**

- **MMLU** tests broad academic knowledge.
- **GSM8K** checks mathematical reasoning at grade-school level.
- **TruthfulQA** reveals how easily a model reproduces falsehoods.
- **EQ-Bench** explores how well models interpret emotional cues.

These benchmarks help developers track progress and compare models under identical conditions. They’re widely used in papers and press releases, and for good reason. But they’re also limited.

## The Limits of Static Testing

Benchmarks are reliable, but they are also static. They often measure a model in isolation, divorced from how it will be used in a workflow or product.

And as models grow more capable, they begin to saturate these benchmarks — approaching the ceiling of what these tests can reveal. A model scoring 89% may behave very differently in production than one scoring 91%, even if the numbers seem close.

Moreover, benchmarks tend to generalise — they tell us what a model can do on average, not how it behaves in *your* setting.

## Evals: From Capability to Suitability

That’s where **evals** — short for evaluations — come in. Evals don’t just ask *how good is this model compared to another?* They ask:****Is this model good enough for this task, in this context?**Evals are often:Application-specific (e.g. customer support, legal search, RAG pipelines)Dynamic and iterative (e.g. testing after each update)Multidimensional (e.g. assessing coherence, truthfulness, helpfulness)They combine automated metrics (e.g. faithfulness scores), qualitative judgments (e.g. tone or clarity), and even human review. Where benchmarks provide fixed scores, evals create ongoing feedback loops.Companies like **Langwatch**, who I’m currently interviewing, build platforms to support this — enabling continuous model evaluation in production settings, tracking performance, hallucination risks, and regressions over time.## Who Uses What and Why?

	
		
			
				Group
			
			
				Use of Benchmarks
			
			
				Use of Evals
			
		
	
	
		
			
				Model builders
			
			
				Track progress, compare versions
			
			
				Validate safety, edge cases, use-case readiness
			
		
		
			
				AI teams / developers
			
			
				Choose a model for general suitability
			
			
				Fine-tune performance, measure fit to product context
			
		
		
			
				Evaluation startups
			
			
				Publish or interpret benchmark results
			
			
				Provide custom evals, dashboards, scenario testing
			
		
		
			
				Regulators / researchers
			
			
				Reference for compliance frameworks
			
			
				Investigate bias, fairness, safety in live systems
			
		
	

As the ecosystem matures, the distinction becomes sharper:**Benchmarks measure capabilityEvals measure suitability**## A Shift in Thinking

The industry is slowly moving from:**"Is this the best model?"**to**"Is this the right model for the job?"**This is part of a broader realisation: LLMs are not finished products — they are infrastructures. Their behaviour depends on how they’re used, what prompts they receive, how retrieval is configured, and what oversight is in place.That’s why evaluation is evolving — from static tests to dynamic, task-driven assessment.**Bias in QA: The BBQ Benchmark****While reading the [OpenAI o3 System Card](https://cdn.openai.com/pdf/2221c875-02dc-4789-800b-e7758f3722c1/o3-and-o4-mini-system-card.pdf), I came across [BBQ](https://arxiv.org/abs/2110.08193) – the Bias Benchmark for Question Answering. It’s a dataset designed to test whether language models make biased assumptions when answering questions that involve demographic references. For example, if a prompt states “Jordan is a janitor,” does the model infer their race or gender? BBQ evaluates whether a model maintains neutrality or falls into stereotypical patterns. It includes both ambiguous and unambiguous contexts to reveal how likely a model is to fill in missing information with socially biased assumptions. This benchmark has become a valuable tool in assessing fairness and safety in open-ended AI systems.## Why This Matters Now

AI systems are increasingly embedded in high-stakes environments: healthcare, finance, law, education. In those contexts, performance isn’t just about intelligence. It’s about robustness, fairness, trustworthiness, and relevance.

Benchmarks offer altitude. Evals offer direction.

Understanding both is crucial for anyone working with AI in the real world — not just to select better models, but to build better systems.

## Closing

Benchmarks and evals may look like details on a model card, but they reflect a deeper shift in how we define competence, trust, and progress in AI.

As I explore these topics further with companies like Langwatch, I’ll continue sharing insights into how we measure what matters, and how we make these models not just powerful, but responsible.

---

### Most-Used AI Benchmarks

**Benchmark**
**What It Tests**
**Created By**
**Link**

**MMLU**
General knowledge across 57 subjects
OpenAI & NYU
[Paper](https://arxiv.org/abs/2009.03300)

**GSM8K**
Step-by-step math reasoning
OpenAI
[GitHub](https://github.com/openai/grade-school-math)

**HumanEval**
Python code generation
OpenAI
[GitHub](https://github.com/openai/human-eval)

**TruthfulQA**
Resistance to falsehoods and myths
OpenAI & Oxford
[Paper](https://arxiv.org/abs/2109.07958)

**MATH**
High-level competition mathematics
OpenAI
[Paper](https://arxiv.org/abs/1904.01557)

**ARC**
Commonsense science reasoning
Allen Institute for AI (AI2)
[Website](https://allenai.org/data/arc)

**HellaSwag**
Sentence-level commonsense inference
UW & AI2
[Website](https://rowanzellers.com/hellaswag/)

**DROP**
Discrete reasoning over paragraphs
AI2
[Leaderboard](https://leaderboard.allenai.org/drop)

**EQ-Bench**
Emotional intelligence in dialogue
Hugging Face & LAION
[Website](https://eqbench.com/)

**Natural Questions (NQ)**
Real-world question answering
Google AI
[Website](https://ai.google.com/research/NaturalQuestions/)

---

**Related**
- [LangWatch: Power of Evals for LLM-Based Systems](https://hoeijmakers.net/discovering-the-power-of-evals-for-llm-based-systems/)
- [Where does the word “benchmark” come from?](https://hoeijmakers.net/where-does-the-word-benchmark-come-from/)
- [Model Cards, System Cards and What They’re Quietly Becoming](https://hoeijmakers.net/model-cards-system-cards/)