Running Gemma 4 on Your iPhone

Google's Gemma 4 runs offline on your iPhone. A follow-up to my local LLM experiment, now with a sharper app, a better model, and a clearer sense of what this category is becoming.

Rob Hoeijmakers

07 Apr 2026 • 5 min read

Last summer I ran a local language model on my iPhone using Haplo AI. Gemma was the only model that actually produced a useful result. Now Google has shipped Gemma 4, and the story has changed enough to revisit.

What Gemma is

Gemma is Google's family of open-weight models, designed to be downloaded and run locally rather than accessed through a cloud API. The distinction matters. Where Gemini lives on Google's servers and requires a connection, Gemma publishes the weights. You can download them, run them on your own hardware, quantize them for smaller devices, fine-tune them for specific tasks.

This is also why Gemini Nano, which ships inside Chrome and on Pixel devices, is not the same thing. Nano is part of the closed Gemini family: Google controls it, distributes it as a black box, and you call it through an API without ever seeing or owning the model. Gemma is a separate release, designed from the start for local deployment. Same research lineage, completely different distribution model.

Why multiple sizes? A 2B model fits on a phone and responds in seconds. A 27B model needs a laptop with a capable GPU. Gemma 4 comes in four variants; the smallest runs on a current iPhone. The choice is always between what the hardware can carry and what the task actually needs.

💡

Gemma 4 comes in E2B, E4B, 26B MoE and 31B Dense variants. The E2B (about 2.5GB download) runs on an iPhone 15 Pro or newer. Multimodal support, meaning it can reason about images as well as text, is included in the smaller variants.

Under the hood

Two apps let you run Gemma on iPhone, and they are built differently in ways that show.

AI Locally uses MLX, Apple's own machine learning framework, built around the unified memory architecture of Apple Silicon. The model runs on the GPU via Metal, tight and native. Google AI Edge Gallery uses LiteRT, which Google built for Android NPUs, then translated into Metal on iOS. It works, but it is an extra step that Apple's own framework does not need.

Neither uses the Neural Engine for LLM inference, despite what the marketing around "on-device AI" often implies. Language models need the GPU's flexibility. Face ID and camera processing run on the Neural Engine because they are fixed, predictable operations. Running the E4B model, the larger of the two downloadable Gemma 4 variants, my iPhone 16 Pro got noticeably warm. That is not a complaint. It is the GPU working: the phone is doing real compute, not calling a server.

Demo versus tool

Edge Gallery is polished and honest about what it is. Conversations are ephemeral, nothing is saved between sessions, and the feature set reads like a capabilities showcase: image questions, audio transcription, a tool-calling demo. Google built it to show what Gemma 4 can do on a phone. For that purpose it works well.

AI Locally is building toward something different. It has Shortcuts integration, a voice mode (English only for now), and a model browser that flags which models will run well on your specific device. I have been using it more than Edge Gallery, and the preference is clear after a few sessions.

The limitations are also clear. The context window is small: long documents hit the ceiling quickly, PDF reading crashed outright, and transcription is not a realistic use case at this scale. These are not edge cases; they are the main things you would reach for a cloud model to do.

Progress, with caveats

What has actually changed since my Haplo AI experiment is the baseline. The hardware has caught up: on a recent iPhone, a 3B to 4B model runs fast enough to feel responsive. Gemma 4 handles Dutch well, which matters more than it might seem for a model running entirely on device. And the apps have matured from rough experiments into something you could actually use.

What has not changed is the gap with cloud models. Local AI on a phone is useful for bounded tasks where the input is sensitive: a financial document you have not decided to share, notes that should not leave the device, a question you would rather not route through a server. For that use case, the combination of privacy and reasonable quality is now genuinely good enough.

For everything else, the cloud model is still the right tool. The context window is larger, the reasoning is deeper, and nothing crashes when you open a PDF.

What Gemma is

Under the hood

Demo versus tool

Progress, with caveats

Sign up for more like this.