2026-05-07 · 10 min read · engineering

From 17% to 81%: calibrating an orbital skill classifier with simulations and online SGD

A skill-routing classifier had a planet bias — 17 of 18 panel skills got classified as planet. We tried two textbook physics frameworks (Vallado's CRTBP, Sears & Zemansky's spectral classification), neither beat heuristics. We retuned heuristics and hit 50% on the panel. Then we ran the same classifier against real labelled data and hit 81% recall@1. Then we wired an online-SGD loop on top so it keeps improving from every user click. Here's the full journey, with the numbers.

starting

17%

class accuracy on panel

after v2 retune

50%

class accuracy on panel

on real data

81%

recall@1, public benchmark

95% Wilson CI [60%, 92%] · n=21

online learning

∞

SGD steps per user click

The bug we caught with a calibration panel

Meridian's MCP routes a free-form task to compatible skills using a deterministic orbital classifier — each candidate gets a physics signature (mass · scope · independence · cross_domain · fragmentation · drag · dep_ratio) and a celestial class (planet · moon · trojan · asteroid · comet · irregular). It's been in production at mcp.ask-meridian.uk since v2.0. We added a calibration script (scripts/calibrate-classifier.mjs) — a fixed 18-skill panel, three skills per class — and pointed it at the production code.

The result was bad. 17 of 18 skills classified as planet.

orbital.mjs — production v1, 18 panel skills, single routing task

Investigation took about an hour and pointed straight at the mass formula:

\text{mass} \;=\; \bigl(\log_{10}(\text{bodyLen}) - 1.7\bigr) \cdot 0.35 \;+\; \min\!\left(0.4,\; \frac{|\text{kws}|}{15}\right)

Tuned for short SKILL.md bodies (~300 chars, 4 keywords). For the 1500–2500-char, 6–8 keyword bodies that Llama-3.3-70B emits today, mass saturates at ~0.95 for almost everything. Same shape with scope (0.25 floor + saturating keyword term) and independence. The planet score was \(\text{mass} \cdot \text{scope} \cdot \text{independence}\), three saturated axes multiplied. Result: planet always wins.

A real LLM-distribution shift, not a code regression — the formula was right when it shipped, and the model behind route_task evolved out from under it. A failure mode worth a name: silent calibration drift.

The temptation to physics our way out

The first instinct was to ground the classifier in real physics. The orbital metaphor has a ready-made textbook: David Vallado's Fundamentals of Astrodynamics and Applications, §12.7 — the Circular Restricted Three-Body Problem (CRTBP). Two primaries, a test particle, a Jacobi constant. Lagrange points (L1–L5) have closed-form coordinates. Trojan asteroids live at the triangular points (L4/L5). The Hill sphere is one cube root away. Every celestial class has a textbook physical definition.

We built a CRTBP simulator (scripts/simulate-classifier-crtbp.mjs) that maps each skill's physics signature to a state vector in the synodic frame, computes Jacobi C via Vallado Eq. 12-15, and assigns class by physics test:

// Vallado §12.7
const M_STAR = 0.1                              // mass ratio
const L4 = { x: 0.5 - M_STAR, y:  Math.sqrt(3)/2 }    // Eq. 12-18 triangular
const L5 = { x: 0.5 - M_STAR, y: -Math.sqrt(3)/2 }
const HILL_R = Math.cbrt(M_STAR / 3)            // Hill sphere
// Class = first match: irregular → trojan → moon → comet → planet → asteroid

The math works. Jacobi constants computed correctly. L4/L5 landed at the textbook positions. Hill radius came out right. The classification result was 22% — worse than the bug we started with. 17 of 18 panel skills became comet because every test particle's Jacobi constant \(C\) was below \(C(L_1)\), and the comet rule fires on \(C < C(L_1)\). The physics is sound; the SKILL.md → synodic-frame mapping is the bottleneck. With no labelled ground truth to fit it, every skill ended up at the same Jacobi-band, and physics couldn't separate them.

Sears & Zemansky Vol. 2 gave us a richer framing. Light has many more simultaneously-meaningful axes than orbits: wavelength, polarization, amplitude, phase, coherence, full spectrum, Doppler shift. Astronomy classifies stars by spectral type (O / B / A / F / G / K / M), not by orbits. We built a spectral classifier that encodes each skill as a 32-bin visible-band spectrum (Stefan-Boltzmann continuum + Doppler-broadened emission lines per domain hit) and assigns class by best-fit cosine similarity to six prototype spectra.

Same outcome: 33%, also worse than v2 heuristic. The encoding from text to spectrum produced too-similar spectra for skills sharing a single dominant domain; sibling correlation went to 1.00 for 11 of 18 skills, the moon test fired everywhere. The right framework with the wrong encoder produces noise.

Class accuracy on the synthetic 18-skill calibration panel.

The boring fix that worked

With both physics frameworks underperforming, the pragmatic answer was rebalancing the heuristic constants against current LLM output statistics. Three changes in mcp/_lib/orbital.mjs:

mass renormalised across realistic body lengths (200–3000 chars) and keyword counts (3–12). Median LLM output now hits mass ≈ 0.5 instead of saturating at ~0.95.
Planet score switched from product to \(\min(\text{mass},\, \text{scope},\, \text{independence})^{1.5}\). The product let two strong axes drown out a missing one; \(\min\) penalises the deficit, which is what "anchor skill" actually means semantically — strong on every dimension, not strong on average.
Asteroid threshold raised from 0.4 → 0.55 to match the new (lower) mass distribution.

Class accuracy on the panel: 0.167 → 0.500. Class distribution rebalanced to four classes used (was two). Mass and scope discrimination both recovered above 0.5. Shipped at d1051fb; the simulation script that validated it before any production change is in the repo as scripts/simulate-classifier-v2.mjs.

The plot twist: real labelled data

The synthetic panel is a stress test — three skills per class, designed to break things. What does the classifier do on real task→tool data? We pulled 60 rows from shawhin/tool-use-finetuning via the Hugging Face datasets API (no auth, free), filtered to the 21 rows with single-correct-tool labels, and ran v2 against them.

Real labelled tool-routing data — first positive result on external evaluation. Bootstrap CIs (v2.2.1) replaced with Wilson score interval in v2.2.2 after a coverage simulation showed bootstrap under-covers at high p (66% actual coverage at p=0.95 vs nominal 95%).

81% recall@1 [95% Wilson CI 60%, 92%], 95% recall@5 [77%, 99%]. 8.5× above the random baseline (~9.5% expected on ~20 candidates) and 10 percentage points above a 5-line token-overlap baseline on point estimate. This is a much better number than the synthetic-panel score, which means the panel was harder than real-world tool routing — exactly what a stress test is supposed to do.

The panel had been overstating the bug. The classifier was always doing real work on real inputs; it just lost its margin of error on intentionally-adversarial fixtures.

Honest caveat: with n=21, v2's r@1 CI is [0.60, 0.92] and the trivial token-overlap baseline's is [0.50, 0.86]. They overlap. v2 unambiguously beats the random floor (upper bound 0.35) but the 10-percentage-point lead over trivial is not statistically separable at 95% from this sample. More labelled rows would confirm whether the lead is real.

Quick aside on the CI itself: 2.2.1 first shipped a bootstrap CI (Bell & Glasstone §1.6e Monte Carlo). A follow-up coverage simulation (5,000 trials × 5 true rates × 5 methods) revealed bootstrap under-covers at extreme p — at the true rate p=0.95, the nominal-95% bootstrap CI only contained the truth 66.6% of the time. Wilson and Clopper-Pearson stayed at nominal coverage across the whole range. So 2.2.2 swapped to the Wilson score interval — closed-form, well-calibrated for n>10, zero compute. The lesson: the universal numerical method (MC) isn't always the right tool. For binomial proportions, an 18-line closed-form expression has been waiting for us since 1927.

Closing the loop: online SGD without a training run

81% leaves 19 percentage points to ceiling. Closing that gap normally means labelling data and fitting a classifier — but we don't have a training pipeline, and the operator machine is "a notepad with network and git". So we built the loop fully cloud-native: every component runs on free GitHub Actions, GitHub Models, or Cloudflare Workers KV. The local machine never compiles or trains anything.

user click on a planet in lens / miniapp / Photon
↓
POST mcp.ask-meridian.uk/v1/feedback { query, candidates[], chosen_slug }
↓
Worker pulls fitted-weights from KV (~24 floats per class)
↓
One pairwise-ranking SGD step against (chosen, every-other) — ~1 ms
↓
Worker writes weights back to KV
↓
Next /v1/route call applies the fitted correction (formula below)
↓
Better orbits → better clicks → ...

\text{score}_{\text{final}} \;=\; \text{score}_{\text{heuristic}} \;\cdot\; \bigl(1 \,+\, \tanh(K \cdot \mathbf{w} \cdot \mathbf{x})\bigr)

The fitted layer multiplies the heuristic v2 ranking by \(1 + \tanh(K \cdot \mathbf{w} \cdot \mathbf{x}) \in [0,\, 2]\), so no candidate can be silently boosted beyond \(2\times\) heuristic. Cold start: \(\mathbf{w} = \mathbf{0}\), multiplier \(= 1\), pure heuristic. As feedback accumulates, weights drift. At launch the worker reported cold_start: true; within minutes of the first integration test it had learned 21 pairs, then 405 after a single bootstrap-cron run. By the time you read this the production model is at:

GET https://mcp.ask-meridian.uk/v1/model-info

{
  "version": "v1",
  "n_updates": 64,
  "n_pairs":  1213,
  "cold_start": false,
  "updated_at": "2026-05-07T17:47:14.815Z"
}

Two GitHub Actions cron jobs keep the loop alive without organic traffic:

classifier-bootstrap.yml (every 3 days) fetches labelled examples from the same HF dataset, classifies them locally with the orbital classifier in the runner, POSTs each to /v1/feedback. Seeds the fitted weights with real human-validated data — 21 pairwise updates per cron tick.
classifier-health.yml (Mondays 06:00 UTC) re-runs the public-data eval and writes landing/healthz.json with current recall@1 / @5 and model state. Drift catches regressions automatically — if next week's number drops more than tolerance, the workflow fails and the operator sees it.

What's gained, what's not

The architecture has a few specific properties worth naming:

Heuristic stays as the cold-start. Day 1 deployments and brand-new domains use the v2 heuristic with multiplier = 1. No "training data required" onboarding cliff.
Fitted correction is bounded. The \(\tanh\) squash means no single skill can be silently boosted beyond \(2\times\) heuristic, even if a feedback flood tried.
Per-request training cost is constant. One KV read, ~24 multiplies per candidate, one KV write. ~1 ms per /v1/feedback POST. The Worker stays well inside Cloudflare's free tier even at organic-feedback scale.
Connectors stay deterministic. The OAuth-gated /mcp endpoint (Grok / ChatGPT / Claude.ai) returns pure heuristic ranking — no per-call drift means the connector behaviour is reproducible. Only the browser-facing /v1/route endpoint applies the fitted correction.

What this doesn't get us: the remaining 19 percentage points to recall@1 = 1.0. For that, we need labelled off-distribution data — tasks the synthetic and public benchmarks don't cover. That's what the Manifund / LTFF / Bluedot grant work targets: a labelled task→skill dataset purpose-built for tool-routing failure modes.

The receipts

Five simulation scripts live in scripts/ as durable receipts of what we tried and what worked. Each one runs without local training, just Node + network:

script	approach	panel acc	verdict
`simulate-classifier-v2.mjs`	heuristic retune (mass / scope / planet / asteroid)	0.500	shipped
`simulate-classifier-crtbp.mjs`	Vallado §12.7 — CRTBP + Jacobi + Hill	0.222	diagnostic only
`simulate-classifier-spectral.mjs`	Sears & Zemansky 32–38 — spectral classification	0.333	diagnostic only
`simulate-classifier-v2-with-crtbp.mjs`	v2 + Hill-sphere moon + L4/L5 trojan	0.389	diagnostic only
`eval-against-public-data.mjs`	v2 vs HF benchmark (real labels)	0.810 r@1	weekly cron

Source: github.com/LuuOW/meridian-mcp — the worker (cf-worker/online_learning.mjs), the cron workflows (.github/workflows/classifier-*.yml), and the front-end feedback wiring (lens + landing/miniapp/api.js) are all in tree.

Two takeaways for anyone calibrating a heuristic classifier

Build a calibration panel before you need one. Without scripts/calibrate-classifier.mjs, the planet-bias bug would have stayed silent for weeks. The panel itself is 18 hand-crafted SKILL.md objects in a single JS file — cheaper than not catching the drift.
Don't physics your way out of an empirical labelling problem. The CRTBP and spectral simulations were satisfying to write but neither beat heuristics, because the bottleneck was the SKILL-text → physics mapping, which has no closed-form solution. Online SGD on top of heuristics solved more in 50 lines than either textbook framework did in 300.

Try the live classifier

The browser miniapp at /miniapp/ calls mcp.ask-meridian.uk/v1/route directly. Every skill you click trains the online layer. Current model state: GET /v1/model-info.