Model Cards, System Cards and What They’re Quietly Becoming

What are AI model cards, and why are they becoming the documents regulators will turn to first? I read a few and it taught me more than I expected.

Model Cards, System Cards and What They’re Quietly Becoming
"I think everyone interested in AI should read the model cards for the frontier models, especially the safety sections, which give you a sense of immediate concerns."
— Ethan Mollick, 4 August 2025
View on X

A few days ago, I came across a post by Ethan Mollick on X that stopped me in my scroll. He simply said that anyone interested in AI should read the model cards of the big frontier models — and especially the safety sections. He even linked to a few: OpenAI’s o3, Google’s Gemini Deep Think, Anthropic’s Claude 4. One was missing: Grok. Just question marks.

Mollick’s post wasn’t technical, but it hit a nerve. It made me realise how little I had actually looked at these cards myself. I work with AI models, talk about AI policy, and build things with them — but I hadn’t taken the time to really read the source documents that describe what these systems are capable of, where they might go wrong, and how (or if) they’re being kept in check.

At the same time, I had been reading up on AI benchmarks and evaluations — trying to make sense of what all these tests (MMLU, ARC, GSM8K, TruthfulQA) actually measure. That helped me understand what I was seeing in the cards. Benchmarks are not just numbers. They’re signals too: of what the labs choose to measure, and what they might prefer to avoid.

So I started digging.

What Are Model Cards — and Why Do They Matter?

In short: model cards (and their broader cousins, system cards) are documents published by the makers of large AI models to describe what those models can do, how they’ve been tested, and what their limitations are.

They usually include:

  • What kind of data the model was trained on (or a vague description of it)
  • What benchmarks it has been tested against (like MMLU, ARC, TruthfulQA)
  • Where it tends to fail (hallucinations, bias, factuality)
  • What the developers did to reduce harm (red teaming, refusal behaviour)
  • And how they see the model in a broader societal context (misuse risks, alignment strategies)

Some are technical. Some are polished. All of them are political in the sense that they reveal (or conceal) the design choices and values behind these increasingly powerful systems.

System Cards and Safety: The Frontier Disclosure

OpenAI calls theirs system cards. So does Anthropic. Google went with model cards. The terms overlap, but the purpose is similar: they’re a kind of disclosure. A way to say: "Here’s what we’ve built, and here’s what you should know about it."

When you read through them, as I did this week, you start to see patterns:

  • Benchmarks are everywhere. Some models outperform humans on complex tests, but still hallucinate simple facts.
  • Safety claims are layered: technical fixes, refusal behaviour, internal red teaming, sometimes external audits.
  • Many disclaim any true autonomy. Even the most capable models are described as "not agentic," "not self-aware," or "not capable of long-term planning."
  • And yet, the cards often reveal that these same models can reason, plan, and follow goals in scaffolded environments.

It’s like looking at a machine that almost drives itself and seeing all the little disclaimers stuck to the dashboard.

How This Relates to Policy

In the EU AI Act, which I wrote about recently here, foundation models like GPT, Claude and Gemini are being put in their own regulatory category. The Act introduces specific documentation and transparency obligations — including the need to describe:

  • Training data sources,
  • Evaluation results,
  • Risks and mitigations.

If you’ve read a good system card, you’ll recognise the format.

So even though model/system cards are not legally required today, they are functioning like proto-regulatory documents. They’re the documents regulators will ask for first. And they show where the lines are being drawn — not just between companies, but between what’s considered safe, responsible, or questionable.

In the US, there’s no law like the EU AI Act — yet. But the recent Executive Order on AI, as well as NIST’s AI Risk Management Framework, strongly encourage documentation, testing, and disclosure. Again, model/system cards are the natural container for that.

Learning to Work with the EU AI Act
I used to avoid EU regulation. Now I’m learning to work with it. The AI Act isn’t perfect, but it’s shaping how I think about risk, trust, and tech.

And Then There’s Grok

In Mollick’s post, Grok by xAI was the only one listed without a link. "????", he wrote. That’s telling.

For all the emphasis on "truth-seeking" and "maximum transparency," xAI has not published a system card for Grok. No clear documentation of risks. No benchmarks. No list of safety mitigations or evaluations. As someone who cares about both freedom and accountability, I find that silence notable. Absence is a kind of signal, too.

Why I’ll Keep Reading Them

This week, I’ve learned more from reading these cards than from any number of AI blog posts or press releases. They’re not perfect. They’re not always fully honest. But they are informative.

More importantly, they give you a vocabulary for thinking about capabilities, failures, responsibility and for asking better questions about the tools we’re integrating into our work, schools, and societies.

If you haven’t read one yet, start with the Claude 4 system card[¹]. Or OpenAI’s recent o4 mini system card[²]. They’re not just technical documents. They’re early signals of how AI governance is taking shape from the inside out.


This article is part of a broader series I’m working on, exploring benchmarks, evaluations, and AI policy. If you're interested, please subscribe. I’m happy to share next drafts or connect dots across these themes.


¹ Claude 4 System Card, Anthropic: https://www.anthropic.com/index/claude-4

² o4 Mini System Card, OpenAI: https://cdn.openai.com/papers/o4-mini-system-card.pdf

³ Gemini 2.5 Deep Think Model Card, Google DeepMind: PDF