From 17% to 81%: calibrating an orbital skill classifier with simulations and online SGD
A skill-routing classifier had a planet bias — 17 of 18 panel skills got classified as
planet. We tried two textbook physics frameworks (Vallado's CRTBP, Sears &
Zemansky's spectral classification), neither beat heuristics. We retuned heuristics and
hit 50% on the panel. Then we ran the same classifier against real labelled data and
hit 81% recall@1. Then we wired an online-SGD loop on top so it keeps improving from
every user click. Here's the full journey, with the numbers.
The bug we caught with a calibration panel
Meridian's MCP routes a free-form task to compatible skills using a deterministic
orbital classifier — each candidate gets a physics signature
(mass · scope · independence · cross_domain · fragmentation · drag · dep_ratio)
and a celestial class (planet · moon · trojan · asteroid · comet · irregular).
It's been in production at mcp.ask-meridian.uk since v2.0. We added a
calibration script (scripts/calibrate-classifier.mjs) — a fixed 18-skill panel,
three skills per class — and pointed it at the production code.
The result was bad. 17 of 18 skills classified as planet.
orbital.mjs — production v1, 18 panel skills, single routing task
Investigation took about an hour and pointed straight at the mass formula:
Tuned for short SKILL.md bodies (~300 chars, 4 keywords). For the 1500–2500-char,
6–8 keyword bodies that Llama-3.3-70B emits today, mass saturates at ~0.95
for almost everything. Same shape with scope (0.25 floor + saturating
keyword term) and independence. The planet score was
\(\text{mass} \cdot \text{scope} \cdot \text{independence}\), three saturated axes
multiplied. Result: planet always wins.
A real LLM-distribution shift, not a code regression — the formula was right when it
shipped, and the model behind route_task evolved out from under it.
A failure mode worth a name: silent calibration drift.
The temptation to physics our way out
The first instinct was to ground the classifier in real physics. The orbital metaphor has a ready-made textbook: David Vallado's Fundamentals of Astrodynamics and Applications, §12.7 — the Circular Restricted Three-Body Problem (CRTBP). Two primaries, a test particle, a Jacobi constant. Lagrange points (L1–L5) have closed-form coordinates. Trojan asteroids live at the triangular points (L4/L5). The Hill sphere is one cube root away. Every celestial class has a textbook physical definition.
We built a CRTBP simulator (scripts/simulate-classifier-crtbp.mjs) that maps
each skill's physics signature to a state vector in the synodic frame, computes Jacobi C
via Vallado Eq. 12-15, and assigns class by physics test:
// Vallado §12.7
const M_STAR = 0.1 // mass ratio
const L4 = { x: 0.5 - M_STAR, y: Math.sqrt(3)/2 } // Eq. 12-18 triangular
const L5 = { x: 0.5 - M_STAR, y: -Math.sqrt(3)/2 }
const HILL_R = Math.cbrt(M_STAR / 3) // Hill sphere
// Class = first match: irregular → trojan → moon → comet → planet → asteroid
The math works. Jacobi constants computed correctly. L4/L5 landed at the textbook positions.
Hill radius came out right. The classification result was 22% — worse than the
bug we started with. 17 of 18 panel skills became comet because
every test particle's Jacobi constant \(C\) was below \(C(L_1)\), and the comet rule fires
on \(C < C(L_1)\). The physics is sound; the
SKILL.md → synodic-frame mapping is the bottleneck. With no labelled ground
truth to fit it, every skill ended up at the same Jacobi-band, and physics couldn't
separate them.
Sears & Zemansky Vol. 2 gave us a richer framing. Light has many more simultaneously-meaningful axes than orbits: wavelength, polarization, amplitude, phase, coherence, full spectrum, Doppler shift. Astronomy classifies stars by spectral type (O / B / A / F / G / K / M), not by orbits. We built a spectral classifier that encodes each skill as a 32-bin visible-band spectrum (Stefan-Boltzmann continuum + Doppler-broadened emission lines per domain hit) and assigns class by best-fit cosine similarity to six prototype spectra.
Same outcome: 33%, also worse than v2 heuristic. The encoding from text to spectrum produced too-similar spectra for skills sharing a single dominant domain; sibling correlation went to 1.00 for 11 of 18 skills, the moon test fired everywhere. The right framework with the wrong encoder produces noise.
Class accuracy on the synthetic 18-skill calibration panel.
The boring fix that worked
With both physics frameworks underperforming, the pragmatic answer was rebalancing the
heuristic constants against current LLM output statistics. Three changes in
mcp/_lib/orbital.mjs:
-
massrenormalised across realistic body lengths (200–3000 chars) and keyword counts (3–12). Median LLM output now hitsmass ≈ 0.5instead of saturating at~0.95. - Planet score switched from product to \(\min(\text{mass},\, \text{scope},\, \text{independence})^{1.5}\). The product let two strong axes drown out a missing one; \(\min\) penalises the deficit, which is what "anchor skill" actually means semantically — strong on every dimension, not strong on average.
- Asteroid threshold raised from 0.4 → 0.55 to match the new (lower) mass distribution.
Class accuracy on the panel: 0.167 → 0.500. Class distribution
rebalanced to four classes used (was two). Mass and scope discrimination both recovered
above 0.5. Shipped at
d1051fb; the simulation script that validated it before any production
change is in the repo as scripts/simulate-classifier-v2.mjs.
The plot twist: real labelled data
The synthetic panel is a stress test — three skills per class, designed to break things.
What does the classifier do on real task→tool data? We pulled 60 rows from
shawhin/tool-use-finetuning
via the Hugging Face datasets API (no auth, free), filtered to the 21 rows with
single-correct-tool labels, and ran v2 against them.
Real labelled tool-routing data — first positive result on external evaluation. Bootstrap CIs (v2.2.1) replaced with Wilson score interval in v2.2.2 after a coverage simulation showed bootstrap under-covers at high p (66% actual coverage at p=0.95 vs nominal 95%).
81% recall@1 [95% Wilson CI 60%, 92%], 95% recall@5 [77%, 99%]. 8.5× above the random baseline (~9.5% expected on ~20 candidates) and 10 percentage points above a 5-line token-overlap baseline on point estimate. This is a much better number than the synthetic-panel score, which means the panel was harder than real-world tool routing — exactly what a stress test is supposed to do.
The panel had been overstating the bug. The classifier was always doing real work on real inputs; it just lost its margin of error on intentionally-adversarial fixtures.
Honest caveat: with n=21, v2's r@1 CI is [0.60, 0.92] and the trivial token-overlap baseline's is [0.50, 0.86]. They overlap. v2 unambiguously beats the random floor (upper bound 0.35) but the 10-percentage-point lead over trivial is not statistically separable at 95% from this sample. More labelled rows would confirm whether the lead is real.
Quick aside on the CI itself: 2.2.1 first shipped a bootstrap CI (Bell & Glasstone §1.6e Monte Carlo). A follow-up coverage simulation (5,000 trials × 5 true rates × 5 methods) revealed bootstrap under-covers at extreme p — at the true rate p=0.95, the nominal-95% bootstrap CI only contained the truth 66.6% of the time. Wilson and Clopper-Pearson stayed at nominal coverage across the whole range. So 2.2.2 swapped to the Wilson score interval — closed-form, well-calibrated for n>10, zero compute. The lesson: the universal numerical method (MC) isn't always the right tool. For binomial proportions, an 18-line closed-form expression has been waiting for us since 1927.
Closing the loop: online SGD without a training run
81% leaves 19 percentage points to ceiling. Closing that gap normally means labelling data and fitting a classifier — but we don't have a training pipeline, and the operator machine is "a notepad with network and git". So we built the loop fully cloud-native: every component runs on free GitHub Actions, GitHub Models, or Cloudflare Workers KV. The local machine never compiles or trains anything.
↓
POST mcp.ask-meridian.uk/v1/feedback { query, candidates[], chosen_slug }
↓
Worker pulls fitted-weights from KV (~24 floats per class)
↓
One pairwise-ranking SGD step against (chosen, every-other) — ~1 ms
↓
Worker writes weights back to KV
↓
Next /v1/route call applies the fitted correction (formula below)
↓
Better orbits → better clicks → ...
The fitted layer multiplies the heuristic v2 ranking by
\(1 + \tanh(K \cdot \mathbf{w} \cdot \mathbf{x}) \in [0,\, 2]\), so no candidate can be
silently boosted beyond \(2\times\) heuristic. Cold start: \(\mathbf{w} = \mathbf{0}\),
multiplier \(= 1\), pure heuristic. As feedback accumulates, weights drift. At launch the
worker reported
cold_start: true; within minutes of the first integration test it had
learned 21 pairs, then 405 after a single bootstrap-cron run. By the time you read this
the production model is at:
GET https://mcp.ask-meridian.uk/v1/model-info
{
"version": "v1",
"n_updates": 64,
"n_pairs": 1213,
"cold_start": false,
"updated_at": "2026-05-07T17:47:14.815Z"
}
Two GitHub Actions cron jobs keep the loop alive without organic traffic:
-
classifier-bootstrap.yml (every 3 days) fetches labelled examples
from the same HF dataset, classifies them locally with the orbital classifier in the
runner, POSTs each to
/v1/feedback. Seeds the fitted weights with real human-validated data — 21 pairwise updates per cron tick. -
classifier-health.yml (Mondays 06:00 UTC) re-runs the public-data eval
and writes
landing/healthz.jsonwith current recall@1 / @5 and model state. Drift catches regressions automatically — if next week's number drops more than tolerance, the workflow fails and the operator sees it.
What's gained, what's not
The architecture has a few specific properties worth naming:
- Heuristic stays as the cold-start. Day 1 deployments and brand-new domains use the v2 heuristic with multiplier = 1. No "training data required" onboarding cliff.
- Fitted correction is bounded. The \(\tanh\) squash means no single skill can be silently boosted beyond \(2\times\) heuristic, even if a feedback flood tried.
-
Per-request training cost is constant. One KV read, ~24 multiplies
per candidate, one KV write. ~1 ms per
/v1/feedbackPOST. The Worker stays well inside Cloudflare's free tier even at organic-feedback scale. -
Connectors stay deterministic. The OAuth-gated
/mcpendpoint (Grok / ChatGPT / Claude.ai) returns pure heuristic ranking — no per-call drift means the connector behaviour is reproducible. Only the browser-facing/v1/routeendpoint applies the fitted correction.
What this doesn't get us: the remaining 19 percentage points to recall@1 = 1.0. For that, we need labelled off-distribution data — tasks the synthetic and public benchmarks don't cover. That's what the Manifund / LTFF / Bluedot grant work targets: a labelled task→skill dataset purpose-built for tool-routing failure modes.
The receipts
Five simulation scripts live in scripts/ as durable receipts of what we
tried and what worked. Each one runs without local training, just Node + network:
| script | approach | panel acc | verdict |
|---|---|---|---|
simulate-classifier-v2.mjs |
heuristic retune (mass / scope / planet / asteroid) | 0.500 | shipped |
simulate-classifier-crtbp.mjs |
Vallado §12.7 — CRTBP + Jacobi + Hill | 0.222 | diagnostic only |
simulate-classifier-spectral.mjs |
Sears & Zemansky 32–38 — spectral classification | 0.333 | diagnostic only |
simulate-classifier-v2-with-crtbp.mjs |
v2 + Hill-sphere moon + L4/L5 trojan | 0.389 | diagnostic only |
eval-against-public-data.mjs |
v2 vs HF benchmark (real labels) | 0.810 r@1 | weekly cron |
Source: github.com/LuuOW/meridian-mcp
— the worker (cf-worker/online_learning.mjs),
the cron workflows (.github/workflows/classifier-*.yml), and the front-end
feedback wiring (lens +
landing/miniapp/api.js) are all in tree.
Two takeaways for anyone calibrating a heuristic classifier
-
Build a calibration panel before you need one. Without
scripts/calibrate-classifier.mjs, the planet-bias bug would have stayed silent for weeks. The panel itself is 18 hand-crafted SKILL.md objects in a single JS file — cheaper than not catching the drift. - Don't physics your way out of an empirical labelling problem. The CRTBP and spectral simulations were satisfying to write but neither beat heuristics, because the bottleneck was the SKILL-text → physics mapping, which has no closed-form solution. Online SGD on top of heuristics solved more in 50 lines than either textbook framework did in 300.
Try the live classifier
The browser miniapp at /miniapp/ calls
mcp.ask-meridian.uk/v1/route directly. Every skill you click trains the
online layer. Current model state:
GET /v1/model-info.