From 17% to 81%: calibrating an orbital skill classifier with simulations and online SGD
A skill-routing classifier had a planet bias β 17 of 18 panel skills got classified as
planet. We tried two textbook physics frameworks (Vallado's CRTBP, Sears &
Zemansky's spectral classification), neither beat heuristics. We retuned heuristics and
hit 50% on the panel. Then we ran the same classifier against real labelled data and
hit 81% recall@1. Then we wired an online-SGD loop on top so it keeps improving from
every user click. Here's the full journey, with the numbers.
The bug we caught with a calibration panel
Meridian's MCP routes a free-form task to compatible skills using a deterministic
orbital classifier β each candidate gets a physics signature
(mass Β· scope Β· independence Β· cross_domain Β· fragmentation Β· drag Β· dep_ratio)
and a celestial class (planet Β· moon Β· trojan Β· asteroid Β· comet Β· irregular).
It's been in production at mcp.ask-meridian.uk since v2.0. We added a
calibration script (scripts/calibrate-classifier.mjs) β a fixed 18-skill panel,
three skills per class β and pointed it at the production code.
The result was bad. 17 of 18 skills classified as planet.
orbital.mjs β production v1, 18 panel skills, single routing task
Investigation took about an hour and pointed straight at the mass formula:
mass = (log10(bodyLen) β 1.7) Γ 0.35 + min(0.4, kws.length / 15)
Tuned for short SKILL.md bodies (~300 chars, 4 keywords). For the 1500β2500-char,
6β8 keyword bodies that Llama-3.3-70B emits today, mass saturates at ~0.95
for almost everything. Same shape with scope (0.25 floor + saturating
keyword term) and independence. The planet score was
mass Γ scope Γ independence, three saturated axes multiplied.
Result: planet always wins.
A real LLM-distribution shift, not a code regression β the formula was right when it
shipped, and the model behind route_task evolved out from under it.
A failure mode worth a name: silent calibration drift.
The temptation to physics our way out
The first instinct was to ground the classifier in real physics. The orbital metaphor has a ready-made textbook: David Vallado's Fundamentals of Astrodynamics and Applications, Β§12.7 β the Circular Restricted Three-Body Problem (CRTBP). Two primaries, a test particle, a Jacobi constant. Lagrange points (L1βL5) have closed-form coordinates. Trojan asteroids live at the triangular points (L4/L5). The Hill sphere is one cube root away. Every celestial class has a textbook physical definition.
We built a CRTBP simulator (scripts/simulate-classifier-crtbp.mjs) that maps
each skill's physics signature to a state vector in the synodic frame, computes Jacobi C
via Vallado Eq. 12-15, and assigns class by physics test:
// Vallado Β§12.7
const M_STAR = 0.1 // mass ratio
const L4 = { x: 0.5 - M_STAR, y: Math.sqrt(3)/2 } // Eq. 12-18 triangular
const L5 = { x: 0.5 - M_STAR, y: -Math.sqrt(3)/2 }
const HILL_R = Math.cbrt(M_STAR / 3) // Hill sphere
// Class = first match: irregular β trojan β moon β comet β planet β asteroid
The math works. Jacobi constants computed correctly. L4/L5 landed at the textbook positions.
Hill radius came out right. The classification result was 22% β worse than the
bug we started with. 17 of 18 panel skills became comet because
every test particle's Jacobi C was below C(L1), and the comet rule fires on
C < C(L1). The physics is sound; the
SKILL.md β synodic-frame mapping is the bottleneck. With no labelled ground
truth to fit it, every skill ended up at the same Jacobi-band, and physics couldn't
separate them.
Sears & Zemansky Vol. 2 gave us a richer framing. Light has many more simultaneously-meaningful axes than orbits: wavelength, polarization, amplitude, phase, coherence, full spectrum, Doppler shift. Astronomy classifies stars by spectral type (O / B / A / F / G / K / M), not by orbits. We built a spectral classifier that encodes each skill as a 32-bin visible-band spectrum (Stefan-Boltzmann continuum + Doppler-broadened emission lines per domain hit) and assigns class by best-fit cosine similarity to six prototype spectra.
Same outcome: 33%, also worse than v2 heuristic. The encoding from text to spectrum produced too-similar spectra for skills sharing a single dominant domain; sibling correlation went to 1.00 for 11 of 18 skills, the moon test fired everywhere. The right framework with the wrong encoder produces noise.
Class accuracy on the synthetic 18-skill calibration panel.
The boring fix that worked
With both physics frameworks underperforming, the pragmatic answer was rebalancing the
heuristic constants against current LLM output statistics. Three changes in
mcp/_lib/orbital.mjs:
-
massrenormalised across realistic body lengths (200β3000 chars) and keyword counts (3β12). Median LLM output now hitsmass β 0.5instead of saturating at~0.95. -
Planet score switched from product to
min(mass, scope, independence)1.5. The product let two strong axes drown out a missing one;minpenalises the deficit, which is what "anchor skill" actually means semantically β strong on every dimension, not strong on average. - Asteroid threshold raised from 0.4 β 0.55 to match the new (lower) mass distribution.
Class accuracy on the panel: 0.167 β 0.500. Class distribution
rebalanced to four classes used (was two). Mass and scope discrimination both recovered
above 0.5. Shipped at
d1051fb; the simulation script that validated it before any production
change is in the repo as scripts/simulate-classifier-v2.mjs.
The plot twist: real labelled data
The synthetic panel is a stress test β three skills per class, designed to break things.
What does the classifier do on real taskβtool data? We pulled 60 rows from
shawhin/tool-use-finetuning
via the Hugging Face datasets API (no auth, free), filtered to the 21 rows with
single-correct-tool labels, and ran v2 against them.
Real labelled tool-routing data β first positive result on external evaluation.
81% recall@1, 95% recall@5. 8.5Γ above the random baseline (~9.5% expected on ~20 candidates) and 10 percentage points above a 5-line token-overlap baseline. This is a much better number than the synthetic-panel score, which means the panel was harder than real-world tool routing β exactly what a stress test is supposed to do.
The panel had been overstating the bug. The classifier was always doing real work on real inputs; it just lost its margin of error on intentionally-adversarial fixtures.
Closing the loop: online SGD without a training run
81% leaves 19 percentage points to ceiling. Closing that gap normally means labelling data and fitting a classifier β but we don't have a training pipeline, and the operator machine is "a notepad with network and git". So we built the loop fully cloud-native: every component runs on free GitHub Actions, GitHub Models, or Cloudflare Workers KV. The local machine never compiles or trains anything.
β
POST mcp.ask-meridian.uk/v1/feedback { query, candidates[], chosen_slug }
β
Worker pulls fitted-weights from KV (~24 floats per class)
β
One pairwise-ranking SGD step against (chosen, every-other) β ~1 ms
β
Worker writes weights back to KV
β
Next /v1/route call:
final_score = heuristic Γ (1 + tanh(K Β· wΒ·x))
β
Better orbits β better clicks β ...
The fitted layer multiplies the heuristic v2 ranking by 1 + tanh(K Β· wΒ·x),
bounded to [0, 2]. Cold start: w = 0, multiplier = 1, pure heuristic.
As feedback accumulates, weights drift. At launch the worker reported
cold_start: true; within minutes of the first integration test it had
learned 21 pairs, then 405 after a single bootstrap-cron run. By the time you read this
the production model is at:
GET https://mcp.ask-meridian.uk/v1/model-info
{
"version": "v1",
"n_updates": 64,
"n_pairs": 1213,
"cold_start": false,
"updated_at": "2026-05-07T17:47:14.815Z"
}
Two GitHub Actions cron jobs keep the loop alive without organic traffic:
-
classifier-bootstrap.yml (every 3 days) fetches labelled examples
from the same HF dataset, classifies them locally with the orbital classifier in the
runner, POSTs each to
/v1/feedback. Seeds the fitted weights with real human-validated data β 21 pairwise updates per cron tick. -
classifier-health.yml (Mondays 06:00 UTC) re-runs the public-data eval
and writes
landing/healthz.jsonwith current recall@1 / @5 and model state. Drift catches regressions automatically β if next week's number drops more than tolerance, the workflow fails and the operator sees it.
What's gained, what's not
The architecture has a few specific properties worth naming:
- Heuristic stays as the cold-start. Day 1 deployments and brand-new domains use the v2 heuristic with multiplier = 1. No "training data required" onboarding cliff.
-
Fitted correction is bounded. The
tanhsquash means no single skill can be silently boosted beyond 2Γ heuristic, even if a feedback flood tried. -
Per-request training cost is constant. One KV read, ~24 multiplies
per candidate, one KV write. ~1 ms per
/v1/feedbackPOST. The Worker stays well inside Cloudflare's free tier even at organic-feedback scale. -
Connectors stay deterministic. The OAuth-gated
/mcpendpoint (Grok / ChatGPT / Claude.ai) returns pure heuristic ranking β no per-call drift means the connector behaviour is reproducible. Only the browser-facing/v1/routeendpoint applies the fitted correction.
What this doesn't get us: the remaining 19 percentage points to recall@1 = 1.0. For that, we need labelled off-distribution data β tasks the synthetic and public benchmarks don't cover. That's what the Manifund / LTFF / Bluedot grant work targets: a labelled taskβskill dataset purpose-built for tool-routing failure modes.
The receipts
Five simulation scripts live in scripts/ as durable receipts of what we
tried and what worked. Each one runs without local training, just Node + network:
| script | approach | panel acc | verdict |
|---|---|---|---|
simulate-classifier-v2.mjs |
heuristic retune (mass / scope / planet / asteroid) | 0.500 | shipped |
simulate-classifier-crtbp.mjs |
Vallado Β§12.7 β CRTBP + Jacobi + Hill | 0.222 | diagnostic only |
simulate-classifier-spectral.mjs |
Sears & Zemansky 32β38 β spectral classification | 0.333 | diagnostic only |
simulate-classifier-v2-with-crtbp.mjs |
v2 + Hill-sphere moon + L4/L5 trojan | 0.389 | diagnostic only |
eval-against-public-data.mjs |
v2 vs HF benchmark (real labels) | 0.810 r@1 | weekly cron |
Source: github.com/LuuOW/meridian-mcp
β the worker (cf-worker/online_learning.mjs),
the cron workflows (.github/workflows/classifier-*.yml), and the front-end
feedback wiring (lens +
landing/miniapp/api.js) are all in tree.
Two takeaways for anyone calibrating a heuristic classifier
-
Build a calibration panel before you need one. Without
scripts/calibrate-classifier.mjs, the planet-bias bug would have stayed silent for weeks. The panel itself is 18 hand-crafted SKILL.md objects in a single JS file β cheaper than not catching the drift. - Don't physics your way out of an empirical labelling problem. The CRTBP and spectral simulations were satisfying to write but neither beat heuristics, because the bottleneck was the SKILL-text β physics mapping, which has no closed-form solution. Online SGD on top of heuristics solved more in 50 lines than either textbook framework did in 300.
Try the live classifier
The browser miniapp at /miniapp/ calls
mcp.ask-meridian.uk/v1/route directly. Every skill you click trains the
online layer. Current model state:
GET /v1/model-info.