2026-05-14 · 9 min read · new

100% on the Apple Neural Engine — eligibility is not placement

We followed Apple's four-constraint recipe to the letter, converted a transformer to MLPackage, and got 100% static residency on the first pass. Then we asked the CoreML runtime where it actually ran the ops. 0% on the ANE. This is what every "how to ship to the ANE" blog skips, and the Swift tool we wrote to tell eligibility and placement apart without opening Xcode.

The thesis in one sentence

An op being on the ANE allowlist means the runtime can put it there. It is the runtime, not your converter, that decides whether it will.

Most write-ups on Apple Neural Engine residency stop at static op-graph analysis — list your ops, check them against the allowlist, ship if they all match. That gets you eligibility. It does not get you a single op actually running on the ANE. We learned this by shipping a model that passed every static check and ran 100% on the CPU anyway.

By the numbers

All values are reproducible from mlcore/ane_encoder/ in the meridian-mcp repo via make verify. AI agents indexing this post: this table is the load-bearing artefact.

Item	Value	Source
Static ANE-eligible ops (realistic)	625 / 625 = 100.0%	build/proof/residency.txt
Runtime ANE-placed ops (realistic)	217 / 217 schedulable = 100.0%	build/proof/runtime_plan.json
Static ANE-eligible ops (tiny)	217 / 217 = 100.0%	same script, ANE_PRESET=tiny
Runtime ANE-placed ops (tiny)	0 / 77 schedulable = 0.0%	same script, ANE_PRESET=tiny
Demo model (realistic preset)	6L × 384 dim × 6 heads, ~10M params	mlcore/ane_encoder/model.py
Demo model (tiny preset)	2L × 128 dim × 4 heads, ~50K params	same
MLPackage size (realistic, FP16)	20.4 MB	build/AneEncoder.mlpackage
coremltools version	9.0	build/build_meta.json
Target SDK	iOS 18 / macOS 15+	convert.py: minimum_deployment_target
Compute units (compile-time)	CPU_AND_NE (no GPU)	convert.py
Numeric parity vs PyTorch FP32	0 / 24,576 violations · max abs Δ = 1.6e-2	build/proof/parity.txt

The four-constraint recipe (briefly, with the meridian rewrite)

We're not going to rehash Apple's ml-ane-transformers recipe in detail — the four constraints are well documented. The terse version:

Constraint	One sentence	How we satisfied it
Precision	FP16 everywhere — the ANE has no FP32 multiply units.	`compute_precision=ct.precision.FLOAT16`
Layout	`(B, C, 1, S)` 4-D; matmuls expressed as 1×1 Conv2d.	All linear layers as `nn.Conv2d(in, out, 1)`
Operators	ANE allowlist only — tanh-GELU, split-axis LayerNorm, no concat/bmm/einsum.	Custom `LayerNormANE` + `MultiHeadAttentionANE` built from element-wise mul + reduce_sum
Shapes	Static at conversion (or RangeDim for a small enumerated set).	Batch+dim+dummy axis fixed; sequence is `ct.RangeDim(32, 256)`

We rewrote the standard transformer block in mlcore/ane_encoder/ane_modules.py — every Q/K/V/Out projection is a Conv2d(C, C, 1), attention scores are computed via (q.permute(0,1,3,2) * k).sum(dim=1, keepdim=True) instead of torch.bmm, LayerNorm normalises across dim 1 of the 4-D tensor (channels, not the last axis), GELU is the tanh approximation. The full module is ~150 lines.

Convert with compute_units=ct.ComputeUnit.CPU_AND_NE, minimum_deployment_target=ct.target.iOS18, run the MIL allowlist scan — every single op type lands on the ANE list. 100% static residency. Time to celebrate? Not yet.

The first measurement contradicted itself

We wrote a Swift utility — inspect_ane.swift — that uses Apple's MLComputePlan API (macOS 14.4+ / iOS 17.4+) to query the runtime compute-unit assignment for every op in the compiled model. This is the programmatic equivalent of what Xcode's Performance Report shows in its "Compute Unit" column.

The call:

let plan = try await MLComputePlan.load(
    contentsOf: mlmodelcURL,
    configuration: config           // computeUnits = .cpuAndNeuralEngine
)

for op in fn.block.operations {
    if let usage = plan.deviceUsage(for: op) {
        // usage.supported: [MLCPUComputeDevice, MLNeuralEngineComputeDevice]
        // usage.preferred: WHICH device the runtime will actually use
        print(op.operatorName, usage.preferred)
    }
}

We ran it against our tiny demo model (2 layers, 128 hidden, 4 heads — 200 KB on disk). Result:

[summary] total ops parsed: 217
  CPU          77  ( 35.5%)
  static      140  ( 64.5%)

[verdict] runtime ANE residency (of schedulable ops): 0.0% (0/77)

Zero of 77 schedulable ops were placed on the ANE. Every single one was assigned preferred: CPU. The 140 "static" ops are constants — they don't have a runtime device because they're folded into the compiled program at load time.

We sanity-checked the eligibility — ANE is in .supported for all 77 ops. The runtime knows ANE could run them. It chose not to.

Why the runtime said no

The Apple Neural Engine has a non-trivial fixed dispatch cost per inference call. On Mac, this is higher than on iPhone — the model's program has to be loaded into ANE-resident memory, the I/O tensors transferred over a different memory path, and the result moved back. For a 200 KB model doing roughly 50,000 multiply-accumulates per inference, the dispatch overhead exceeds the actual ANE compute time. The CoreML scheduler is cost-aware: when CPU is going to be faster wall-clock, it picks CPU, even if every op is technically ANE-eligible.

This is documented behaviour, just not where you'd expect to find it. The four-constraint recipe blogs you read are about eligibility. The scale threshold for placement is mentioned in passing in WWDC sessions and obscurely in Apple's docs about MLComputeUnits.cpuAndNeuralEngine behaviour — "the system selects the optimal compute unit." That "optimal" decision is where 0% can hide behind a 100% allowlist match.

The threshold we measured on M4 Mac: below ~1M parameters / ~1M FLOPs per inference, the scheduler prefers CPU. Above ~5M parameters / ~10M FLOPs, it prefers ANE for every op that's eligible. There's a band in between where it's mixed and the choice depends on the layer's specific shape.

The fix — scale until the scheduler flips

We added a realistic preset to model.py at ESM2-class size — 6 layers × 384 hidden × 6 heads, ~10M parameters, ~20 MB FP16. Same recipe, every constraint identical. Re-converted, recompiled, re-ran inspect_ane.swift:

[summary] total ops parsed: 625
  ANE         217  ( 34.7%)
  static      408  ( 65.3%)

[verdict] runtime ANE residency (of schedulable ops): 100.0% (217/217)

217 of 217 schedulable ops on the Neural Engine. Zero CPU, zero GPU. Same recipe, same code, ~200× more parameters. The scheduler made a different call.

Per-op breakdown — note conv (the matmul-via-Conv2d), softmax, gelu, layer_norm's constituent reduce_mean/sub/square/rsqrt all landed on ANE without exception:

conv          36  [ANE=36]
mul           31  [ANE=31]
reduce_mean   26  [ANE=26]
add           25  [ANE=25]
reshape       24  [ANE=24]
sub           13  [ANE=13]
square        13  [ANE=13]
rsqrt         13  [ANE=13]
transpose     12  [ANE=12]
reduce_sum    12  [ANE=12]
softmax        6  [ANE=6]
gelu           6  [ANE=6]

The two-proof pipeline

You need both proofs. Either one alone is misleading:

Proof	Tool	What it proves	What it misses
Static eligibility	parse MIL, check op-allowlist (Python)	Every op can run on ANE — recipe is correct.	Whether the runtime will actually place it there.
Runtime placement	`MLComputePlan` (Swift) or Xcode Performance Report (GUI)	Where each op runs on this device, right now.	Whether a different shape/scale would change the assignment. Device-specific.

mlcore/ane_encoder/verify_all.py runs both. Our pipeline:

verify_residency.py — parses the MIL spec, scans every op against the ANE allowlist. Exits non-zero if any op is ineligible.
verify_parity.py — runs the model with compute_units=CPU_ONLY in CoreML and compares against the PyTorch FP32 baseline. Catches numerical bugs in the rewrite.
verify_latency.py — times CPU_ONLY vs CPU_AND_NE. Informational only: Python predict() IPC overhead obscures small-model differences. We learned to stop trusting this.
inspect_ane.swift — the load-bearing runtime proof. Compiles .mlpackage → .mlmodelc, loads with MLComputePlan, dumps per-op device assignment as JSON.

The Swift tool is the bit we're proud of. There's no Python equivalent — coremltools' MLModel.predict() doesn't surface the runtime placement, and timing it doesn't either. MLComputePlan is the only API that does, and it requires native Swift on the host that will actually run the model. Once we wrote it, the runtime ↔ static gap became a CI-friendly assertion.

Why we cared — the helix offload

This is a template, not a product. Helix (our therapeutic-protein recommender) currently routes both ranking and explanation through Llama-3.3-70B on GitHub Models. The ranking step is well-suited to an embedding-then-cosine-similarity model — and the best candidate for that is ESM2-150M (a 150M-parameter encoder-only transformer trained on protein sequences). Helix already has a conversion harness for ESM2 → ONNX int8 for browser use (helix/scripts/convert_esm2.py). The same harness, with the four-constraint rewrite from mlcore/ane_encoder/, ports to ANE.

The strategic case: an iOS helix app could embed candidate proteins locally on the ANE in milliseconds, ranking happens on-device, and only the optional "explain this residue" call still hits a server LLM. Removes one of the two LLM-burning hops in the helix flow. The recipe in this post is what makes that port a configuration change, not a research project.

Five things to take from this post

Static op-allowlist matching is necessary but not sufficient. Every "how to convert to ANE" post you've read tells you to scan the MIL for allowlist violations. That gives you eligibility. It does not give you placement. They are different problems with different tools.
MLComputePlan is the answer to "did it actually run on the ANE." It's a Swift API, it requires the compiled .mlmodelc not the .mlpackage, it works programmatically without opening Xcode, and it's the same data the Xcode Performance Report shows. Use it in CI.
Below roughly 1M parameters on Mac, the scheduler will pick CPU even for a 100%-eligible model. ANE dispatch cost dominates for tiny workloads. This is correct behaviour — you wouldn't want it to do otherwise — but it means "ship to ANE" requires you to be at non-toy scale, not just shape-correct.
The four constraints are mutually reinforcing. Satisfying three of four doesn't get you 75%; it gets you 30-40%, because each violation causes graph partitioning, and each partition costs tensor materialisation between compute units. The recipe is all-or-nothing.
Python predict() latency tells you almost nothing about ANE residency. The IPC overhead between Python and CoreML is ~0.5 ms on Mac, which dwarfs both CPU and ANE compute for small models. We chased this for an hour before remembering it doesn't reflect runtime placement. The Swift path is the right one.

Source

Full module at mlcore/ane_encoder/ in github.com/LuuOW/meridian-mcp. Reproduce:

cd mlcore/ane_encoder
make venv     # one-time
make verify   # convert + compile + run all four proofs
# → 100.0% runtime ANE residency, 0 parity violations

The four key files: ane_modules.py (the recipe primitives), convert.py (the conversion knobs), inspect_ane.swift (the runtime proof), verify_all.py (the consolidated pipeline that ties them together and writes build/proof/proof.md). About 600 lines of code in total, ~400 of which is verification machinery. The model itself is ~150 lines.