Back in the Search Arena, This Time With Vectors

Three years after writing about site search, I rebuilt mine from scratch: Algolia for full-text speed, a vector database for semantic retrieval, and a RAG pipeline that made both layers useful.

Back in the Search Arena, This Time With Vectors

Three years ago I wrote about site search as if it were a solved problem. Collect, index, serve. It made sense at the time. Ghost had a search widget, it worked well enough, and I moved on.

What I did not register then was the ceiling. Ghost's built-in search indexes titles and excerpts only. The body of every post, where the actual thinking lives, stays invisible. For a site with a handful of posts, that is fine. For a site with six hundred, it means your own archive is mostly unsearchable.

I knew this. I just never had the right combination of tools, time, and nerve to do something about it.

The return

This spring, that changed. I came back to the problem with Claude Code, which turned out to be the missing piece. Not because it writes perfect code, but because it removes the activation energy. The setup that used to require a developer and a week of back-and-forth now takes an afternoon and a working session.

I integrated Algolia. Fast, full-text, generous free tier. It indexes everything: title, excerpt, and the complete body of every post. The difference is immediate. Searching my own archive now works the way I think, not the way I once phrased a headline.

That alone would have been enough. But I kept going.

Search
Browse by topic AI Strategy AI Governance AI in Practice Future of Work Europe

The semantic layer

Keyword search finds what you named. It does not find what you meant, what you explored without naming it, or what connects two posts you wrote two years apart without realising they shared a thread.

For that, I built a vector database on Cloudflare. Each post is converted into an embedding, a numerical representation of its meaning. When you search, the system does not match terms: it finds conceptual proximity. You ask in plain language, and it surfaces what is semantically related, even when the words never overlap.

This is the shift that search in the age of AI points toward: from retrieval to understanding. I have been writing about that shift for a while. Now I have a small working version of it on my own site.

The pipeline

Together, the two layers form a RAG pipeline. Retrieval-augmented generation: a way of grounding AI output in a specific corpus rather than in general training data. In enterprise contexts, this is how knowledge bots and AI assistants are built on top of internal documentation.

On the scale of one writer with six hundred posts, it works differently but toward the same end. When I draft something new, I can query the pipeline: what have I already written here? What connects? What internal links belong in this piece? The AI does not guess based on what it knows about the world. It searches what I have actually written and brings it back.

Every step of building this taught me something I could not have learned by reading about it. The difference between keyword and semantic retrieval is obvious in theory. It becomes concrete the first time a search returns a post you had forgotten, on a topic you never explicitly named, because the meaning was close enough. That is not a feature. That is a different relationship with your own archive.

What it synthesises

I am familiar with data. I have worked with it long enough to know that structure matters more than volume. What this pipeline does is give structure to six hundred pieces of thinking that were previously just chronological.

The archive did not grow. It became navigable in a way it was not before, for visitors and for me and for the AI I work with. That last part is the one I keep returning to. Not as a novelty, but as a genuine shift in how I can build on what I have already made.

↗️
The vector database runs on Cloudflare Workers AI with D1 for storage. Retrieval is powered by BGE embeddings in Cloudflare Vectorize. Check it out here for the query: sitesearch Cloudflare Algolia.