---
title: "The Detective and the Swarm"
description: "A traffic spike with no statistics counterpart, 400 requests in a minute, 25 countries. The clues were all there. So was the twist: I built the bait myself."
url: "https://hoeijmakers.net/the-detective-and-the-swarm/"
date: 2026-05-09
author: "Rob Hoeijmakers"
site: "hoeijmakers.net"
language: "en"
tags: ["AI in Practice"]
---

# The Detective and the Swarm

At 07:43 UTC this morning, something hit my site. Not a flood in the security sense: no alarms, no WAF triggers, nothing in the human analytics. Just a number that shouldn't have been that high, on a chart I check more out of habit than worry.

Four hundred requests in sixty seconds. Then back to normal.

My human-facing analytics saw nothing. That absence is itself a clue.

## Two layers

Most publishers have one view of their traffic: the analytics dashboard. It shows pageviews, sessions, referrers, the countries their readers come from. It tells them what humans did.

What it doesn't show is everything else. And everything else, it turns out, is interesting.

I run a Cloudflare Worker that logs every request to a D1 database before passing it along. Every request: [human, bot, crawler, scraper, agent](https://hoeijmakers.net/bot-stats). The Worker tries to classify each one, matching user-agent strings against a database of known bot signatures. What it can't classify, it logs as unknown.

That second layer is where the detective work happens.

## Reading the evidence

The 07:43 spike broke down like this: 202 of those 400 requests were for `.md` paths. Paths like `/when-bots-become-readers.md`, `/web-traffic-and-the-rise-of-llms.md`, `/measuring-traffic-machines-bots.md`. The Worker classified 192 of them as human, because the user-agent strings looked like browsers: Chrome 138, Firefox 115, Edge 114. Perfectly formed, perfectly plausible.

But one user-agent hit 47 different `.md` paths in sixty seconds. Another hit 37. Both from the same two browser fingerprints, distributed across 25 countries.

Then the robots.txt requests: 116 of them in the same minute. That's a preflight pattern, a swarm checking the rules before it reads the content.

Chrome 114 hasn't been a current browser for a long time. Neither has Firefox 115. These are frozen strings, a signature of bot infrastructure that picks a browser version and pins it, never updating. The User Agent (UA) looks human. The behaviour doesn't.

The conclusion assembled itself: a distributed scraper, running across a proxy network or botnet, using spoofed browser identities to avoid classification. Evasive, coordinated, and genuinely clever.

## The twist

Here's where the detective story gets uncomfortable.

Those `.md` endpoints don't exist by default in Ghost. I added them. A few months ago, as an experiment: serve each post as clean Markdown alongside the HTML version, reference them in `llms.txt`, see what happens. The idea was to make the content easier for AI systems to consume. Structured, clean, no JavaScript noise.

The scrapers found them almost immediately. So did legitimate AI users, ChatGPT-User and OAI-SearchBot among them, reading the same paths through the same door.

I set the bait. They smelled it.

## The idée fixe

The reflex response to a traffic spike like this is defensive. Block the IPs, rate-limit the endpoint, add a CAPTCHA, harden the WAF. There is an entire industry built around that reflex.

It rests on a premise worth examining: that keeping machines out is possible, and that it is worth the effort.

Neither is quite true. A scraper that can distribute across 25 countries and rotate frozen browser UAs is not stopped by a robots.txt entry or a Cloudflare rule. It routes around friction the way water routes around a stone. And the content, once published on the public web, is going to be consumed by machine pipelines whether or not you make it easy.

The more interesting question is what you can learn by watching it happen.

The spike told me that my [Markdown](https://hoeijmakers.net/markdown-the-wd-40-of-digital-information/) experiment is working, in the sense that it is attracting exactly the traffic it was designed for. It told me that the machine layer of the web is active, distributed, and more sophisticated than most people assume. It told me that the gap between what human analytics show and what the full request log shows is where the real picture lives.

Blocking that traffic would have closed the window. Watching it left it open.

## Signal in the noise

The thing about running a second logging layer is that most of what it catches is unremarkable. Googlebots, Bingbots, ChatGPT-User ticking through recent posts, the usual crawl of SEO tools and RSS readers. Noise.

But the noise is the baseline. Without it, the 07:43 spike is invisible. With it, you can ask: what's different about this minute? Why these paths? Why this many countries? Why frozen UAs?

The detective work is in the filtering, not the blocking!

The .md endpoints at hoeijmakers.net are intentional. Each post is available as clean Markdown alongside the HTML version. The llms.txt file indexes them. This is an ongoing experiment in machine-readable publishing.

---

**Related**
- [I Thought I Was Optimising for Speed](https://hoeijmakers.net/i-thought-i-was-optimising-for-speed/)
- [Thirty Years of Caching, Sorted in an Afternoon](https://hoeijmakers.net/thirty-years-of-caching-sorted-in-an-afternoon/)
- [My Visitors Are Not All Human. That Is Fine.](https://hoeijmakers.net/my-visitors-are-not-all-human-that-is-fine/)
- [Guests That Should Behave](https://hoeijmakers.net/guests-that-should-behave/)
- [Markdown, the WD-40 of Digital Information](https://hoeijmakers.net/markdown-the-wd-40-of-digital-information/)