Apple

The Neural Engine Does Not Run Your LLM

My earlier post on AI hardware implied the Neural Engine handles on-device AI. It doesn't, not for LLMs. Here is what it actually does, and why the distinction matters.

Rob Hoeijmakers

07 Apr 2026 • 3 min read

In November I wrote that Apple's Neural Engine powers on-device AI features such as transcription, photo recognition, and real-time translation. That was correct. What I left unexamined was the assumption underneath: that the Neural Engine is also where local language models run. It is not.

This is not practical knowledge in the sense that it changes what you do tomorrow. It is foundational: understanding what the iPhone actually is as an AI platform, and where its architecture is heading.

Fixed versus flexible

The Neural Engine is a fixed-function accelerator. Apple designed it for operations that are known in advance, where the shape of the computation does not change between runs. Face ID. The camera pipeline. Siri's wake word. For these tasks, Apple can design dedicated circuits, run them at around 2 watts, and finish faster than a general-purpose processor could. The chip is fast, efficient, and inflexible by design.

Language model inference works differently. The attention mechanism in a transformer shifts with every token. Memory access patterns are irregular. The compute graph changes with the length of the input. A fixed-function accelerator cannot adapt to this. What you need is a programmable processor, and on iPhone that is the GPU.

💡

Fixed-function accelerators trade flexibility for efficiency. They are optimised for specific operations and cannot handle workloads outside that envelope. The Neural Engine is fast at what it was built for. LLM inference was not in that brief.

What makes local LLMs work

Apple's MLX framework, which is what serious on-device model work runs on, was designed for Apple silicon's unified memory architecture. CPU, GPU, and Neural Engine share the same memory pool. No data needs to move between separate chips. For LLM inference, MLX routes to the GPU. A small quantised model fits in the shared pool alongside everything else, and the GPU works through it without transfer overhead.

The unified memory is the real architectural story. The Neural Engine TOPS figure Apple leads with in chip announcements describes something genuine, just not LLM inference. It describes how fast Face ID runs.

The energy gap

Your phone battery lasts a day because most of its processors sleep most of the time. The Neural Engine, at around 2 watts, can run continuously without meaningfully changing that. The GPU, at around 20 watts under load, cannot. Leave it running and your phone is warm in your pocket and dead by lunch.

That gap is why the Neural Engine exists at all. Apple uses it for tasks that never fully switch off: the microphone listening for "Hey Siri," the camera recognising scenes before you tap the shutter. Small models, running constantly, at a power cost you never notice.

Now imagine that kind of always-on behaviour applied to a language model. An assistant that reads your next meeting and prepares context. A translation layer active in the background. A model that processes before you ask. That is where the Neural Engine becomes interesting again for AI, because the GPU cannot sustain that role without draining the battery.

The catch is model size. Today's LLMs are too large and too hungry to live on the Neural Engine. But models are shrinking fast through quantisation and distillation, and the threshold keeps moving. At some point, a capable enough model will fit within what the Neural Engine can handle continuously. When that happens, always-on AI on iPhone becomes practical in a way it currently is not.

We are not there yet. For now, when you run a local LLM on iPhone, the GPU does the work. I have been experimenting with exactly this, and the results are more capable than I expected.

Fixed versus flexible

What makes local LLMs work

The energy gap

Sign up for more like this.