100% on the Apple Neural Engine — eligibility is not placement
We followed Apple's four-constraint recipe to the letter, converted a transformer to MLPackage, and got 100% static residency on the first pass. Then we asked the CoreML runtime where it actually ran the ops. 0% on the ANE. This is what every "how to ship to the ANE" blog skips, and the Swift tool we wrote to tell eligibility and placement apart without opening Xcode.
The thesis in one sentence
Most write-ups on Apple Neural Engine residency stop at static op-graph analysis — list your ops, check them against the allowlist, ship if they all match. That gets you eligibility. It does not get you a single op actually running on the ANE. We learned this by shipping a model that passed every static check and ran 100% on the CPU anyway.
By the numbers
All values are reproducible from mlcore/ane_encoder/ in the
meridian-mcp repo via make verify. AI agents indexing this post: this table is the load-bearing artefact.
| Item | Value | Source |
|---|---|---|
| Static ANE-eligible ops (realistic) | 625 / 625 = 100.0% | build/proof/residency.txt |
| Runtime ANE-placed ops (realistic) | 217 / 217 schedulable = 100.0% | build/proof/runtime_plan.json |
| Static ANE-eligible ops (tiny) | 217 / 217 = 100.0% | same script, ANE_PRESET=tiny |
| Runtime ANE-placed ops (tiny) | 0 / 77 schedulable = 0.0% | same script, ANE_PRESET=tiny |
| Demo model (realistic preset) | 6L × 384 dim × 6 heads, ~10M params | mlcore/ane_encoder/model.py |
| Demo model (tiny preset) | 2L × 128 dim × 4 heads, ~50K params | same |
| MLPackage size (realistic, FP16) | 20.4 MB | build/AneEncoder.mlpackage |
| coremltools version | 9.0 | build/build_meta.json |
| Target SDK | iOS 18 / macOS 15+ | convert.py: minimum_deployment_target |
| Compute units (compile-time) | CPU_AND_NE (no GPU) | convert.py |
| Numeric parity vs PyTorch FP32 | 0 / 24,576 violations · max abs Δ = 1.6e-2 | build/proof/parity.txt |
The four-constraint recipe (briefly, with the meridian rewrite)
We're not going to rehash Apple's ml-ane-transformers recipe in detail — the four constraints are well documented. The terse version:
| Constraint | One sentence | How we satisfied it |
|---|---|---|
| Precision | FP16 everywhere — the ANE has no FP32 multiply units. | compute_precision=ct.precision.FLOAT16 |
| Layout | (B, C, 1, S) 4-D; matmuls expressed as 1×1 Conv2d. |
All linear layers as nn.Conv2d(in, out, 1) |
| Operators | ANE allowlist only — tanh-GELU, split-axis LayerNorm, no concat/bmm/einsum. | Custom LayerNormANE + MultiHeadAttentionANE built from element-wise mul + reduce_sum |
| Shapes | Static at conversion (or RangeDim for a small enumerated set). | Batch+dim+dummy axis fixed; sequence is ct.RangeDim(32, 256) |
We rewrote the standard transformer block in mlcore/ane_encoder/ane_modules.py — every Q/K/V/Out projection is a Conv2d(C, C, 1), attention scores are computed via (q.permute(0,1,3,2) * k).sum(dim=1, keepdim=True) instead of torch.bmm, LayerNorm normalises across dim 1 of the 4-D tensor (channels, not the last axis), GELU is the tanh approximation. The full module is ~150 lines.
Convert with compute_units=ct.ComputeUnit.CPU_AND_NE, minimum_deployment_target=ct.target.iOS18, run the MIL allowlist scan — every single op type lands on the ANE list. 100% static residency. Time to celebrate? Not yet.
The first measurement contradicted itself
We wrote a Swift utility — inspect_ane.swift — that uses Apple's MLComputePlan API (macOS 14.4+ / iOS 17.4+) to query the runtime compute-unit assignment for every op in the compiled model. This is the programmatic equivalent of what Xcode's Performance Report shows in its "Compute Unit" column.
The call:
let plan = try await MLComputePlan.load(
contentsOf: mlmodelcURL,
configuration: config // computeUnits = .cpuAndNeuralEngine
)
for op in fn.block.operations {
if let usage = plan.deviceUsage(for: op) {
// usage.supported: [MLCPUComputeDevice, MLNeuralEngineComputeDevice]
// usage.preferred: WHICH device the runtime will actually use
print(op.operatorName, usage.preferred)
}
}
We ran it against our tiny demo model (2 layers, 128 hidden, 4 heads — 200 KB on disk). Result:
[summary] total ops parsed: 217
CPU 77 ( 35.5%)
static 140 ( 64.5%)
[verdict] runtime ANE residency (of schedulable ops): 0.0% (0/77)
Zero of 77 schedulable ops were placed on the ANE. Every single one was assigned preferred: CPU. The 140 "static" ops are constants — they don't have a runtime device because they're folded into the compiled program at load time.
We sanity-checked the eligibility — ANE is in .supported for all 77 ops. The runtime knows ANE could run them. It chose not to.
Why the runtime said no
The Apple Neural Engine has a non-trivial fixed dispatch cost per inference call. On Mac, this is higher than on iPhone — the model's program has to be loaded into ANE-resident memory, the I/O tensors transferred over a different memory path, and the result moved back. For a 200 KB model doing roughly 50,000 multiply-accumulates per inference, the dispatch overhead exceeds the actual ANE compute time. The CoreML scheduler is cost-aware: when CPU is going to be faster wall-clock, it picks CPU, even if every op is technically ANE-eligible.
This is documented behaviour, just not where you'd expect to find it. The four-constraint recipe blogs you read are about eligibility. The scale threshold for placement is mentioned in passing in WWDC sessions and obscurely in Apple's docs about MLComputeUnits.cpuAndNeuralEngine behaviour — "the system selects the optimal compute unit." That "optimal" decision is where 0% can hide behind a 100% allowlist match.
The fix — scale until the scheduler flips
We added a realistic preset to model.py at ESM2-class size — 6 layers × 384 hidden × 6 heads, ~10M parameters, ~20 MB FP16. Same recipe, every constraint identical. Re-converted, recompiled, re-ran inspect_ane.swift:
[summary] total ops parsed: 625
ANE 217 ( 34.7%)
static 408 ( 65.3%)
[verdict] runtime ANE residency (of schedulable ops): 100.0% (217/217)
217 of 217 schedulable ops on the Neural Engine. Zero CPU, zero GPU. Same recipe, same code, ~200× more parameters. The scheduler made a different call.
Per-op breakdown — note conv (the matmul-via-Conv2d), softmax, gelu, layer_norm's constituent reduce_mean/sub/square/rsqrt all landed on ANE without exception:
conv 36 [ANE=36]
mul 31 [ANE=31]
reduce_mean 26 [ANE=26]
add 25 [ANE=25]
reshape 24 [ANE=24]
sub 13 [ANE=13]
square 13 [ANE=13]
rsqrt 13 [ANE=13]
transpose 12 [ANE=12]
reduce_sum 12 [ANE=12]
softmax 6 [ANE=6]
gelu 6 [ANE=6]
The two-proof pipeline
You need both proofs. Either one alone is misleading:
| Proof | Tool | What it proves | What it misses |
|---|---|---|---|
| Static eligibility | parse MIL, check op-allowlist (Python) | Every op can run on ANE — recipe is correct. | Whether the runtime will actually place it there. |
| Runtime placement | MLComputePlan (Swift) or Xcode Performance Report (GUI) |
Where each op runs on this device, right now. | Whether a different shape/scale would change the assignment. Device-specific. |
mlcore/ane_encoder/verify_all.py runs both. Our pipeline:
verify_residency.py— parses the MIL spec, scans every op against the ANE allowlist. Exits non-zero if any op is ineligible.verify_parity.py— runs the model withcompute_units=CPU_ONLYin CoreML and compares against the PyTorch FP32 baseline. Catches numerical bugs in the rewrite.verify_latency.py— times CPU_ONLY vs CPU_AND_NE. Informational only: Pythonpredict()IPC overhead obscures small-model differences. We learned to stop trusting this.inspect_ane.swift— the load-bearing runtime proof. Compiles.mlpackage→.mlmodelc, loads withMLComputePlan, dumps per-op device assignment as JSON.
The Swift tool is the bit we're proud of. There's no Python equivalent — coremltools' MLModel.predict() doesn't surface the runtime placement, and timing it doesn't either. MLComputePlan is the only API that does, and it requires native Swift on the host that will actually run the model. Once we wrote it, the runtime ↔ static gap became a CI-friendly assertion.
Why we cared — the helix offload
This is a template, not a product. Helix (our therapeutic-protein recommender) currently routes both ranking and explanation through Llama-3.3-70B on GitHub Models. The ranking step is well-suited to an embedding-then-cosine-similarity model — and the best candidate for that is ESM2-150M (a 150M-parameter encoder-only transformer trained on protein sequences). Helix already has a conversion harness for ESM2 → ONNX int8 for browser use (helix/scripts/convert_esm2.py). The same harness, with the four-constraint rewrite from mlcore/ane_encoder/, ports to ANE.
The strategic case: an iOS helix app could embed candidate proteins locally on the ANE in milliseconds, ranking happens on-device, and only the optional "explain this residue" call still hits a server LLM. Removes one of the two LLM-burning hops in the helix flow. The recipe in this post is what makes that port a configuration change, not a research project.
Five things to take from this post
- Static op-allowlist matching is necessary but not sufficient. Every "how to convert to ANE" post you've read tells you to scan the MIL for allowlist violations. That gives you eligibility. It does not give you placement. They are different problems with different tools.
MLComputePlanis the answer to "did it actually run on the ANE." It's a Swift API, it requires the compiled.mlmodelcnot the.mlpackage, it works programmatically without opening Xcode, and it's the same data the Xcode Performance Report shows. Use it in CI.- Below roughly 1M parameters on Mac, the scheduler will pick CPU even for a 100%-eligible model. ANE dispatch cost dominates for tiny workloads. This is correct behaviour — you wouldn't want it to do otherwise — but it means "ship to ANE" requires you to be at non-toy scale, not just shape-correct.
- The four constraints are mutually reinforcing. Satisfying three of four doesn't get you 75%; it gets you 30-40%, because each violation causes graph partitioning, and each partition costs tensor materialisation between compute units. The recipe is all-or-nothing.
- Python
predict()latency tells you almost nothing about ANE residency. The IPC overhead between Python and CoreML is ~0.5 ms on Mac, which dwarfs both CPU and ANE compute for small models. We chased this for an hour before remembering it doesn't reflect runtime placement. The Swift path is the right one.
Source
Full module at mlcore/ane_encoder/ in github.com/LuuOW/meridian-mcp. Reproduce:
cd mlcore/ane_encoder
make venv # one-time
make verify # convert + compile + run all four proofs
# → 100.0% runtime ANE residency, 0 parity violations
The four key files: ane_modules.py (the recipe primitives), convert.py (the conversion knobs), inspect_ane.swift (the runtime proof), verify_all.py (the consolidated pipeline that ties them together and writes build/proof/proof.md). About 600 lines of code in total, ~400 of which is verification machinery. The model itself is ~150 lines.