From 17% to 81%: calibrating an orbital skill classifier with simulations and online SGD

A skill-routing classifier had a planet bias β€” 17 of 18 panel skills got classified as planet. We tried two textbook physics frameworks (Vallado's CRTBP, Sears & Zemansky's spectral classification), neither beat heuristics. We retuned heuristics and hit 50% on the panel. Then we ran the same classifier against real labelled data and hit 81% recall@1. Then we wired an online-SGD loop on top so it keeps improving from every user click. Here's the full journey, with the numbers.

starting
17%
class accuracy on panel
after v2 retune
50%
class accuracy on panel
on real data
81%
recall@1, public benchmark
online learning
∞
SGD steps per user click

The bug we caught with a calibration panel

Meridian's MCP routes a free-form task to compatible skills using a deterministic orbital classifier β€” each candidate gets a physics signature (mass Β· scope Β· independence Β· cross_domain Β· fragmentation Β· drag Β· dep_ratio) and a celestial class (planet Β· moon Β· trojan Β· asteroid Β· comet Β· irregular). It's been in production at mcp.ask-meridian.uk since v2.0. We added a calibration script (scripts/calibrate-classifier.mjs) β€” a fixed 18-skill panel, three skills per class β€” and pointed it at the production code.

The result was bad. 17 of 18 skills classified as planet.

CLASS DISTRIBUTION β€” before retune (17/18 planet) planet 17 moon 0 trojan 0 asteroid 0 comet 0 irregular 1 class accuracy = 0.167 Β· 2 of 6 classes used

orbital.mjs β€” production v1, 18 panel skills, single routing task

Investigation took about an hour and pointed straight at the mass formula:

mass = (log10(bodyLen) βˆ’ 1.7) Γ— 0.35 + min(0.4, kws.length / 15)

Tuned for short SKILL.md bodies (~300 chars, 4 keywords). For the 1500–2500-char, 6–8 keyword bodies that Llama-3.3-70B emits today, mass saturates at ~0.95 for almost everything. Same shape with scope (0.25 floor + saturating keyword term) and independence. The planet score was mass Γ— scope Γ— independence, three saturated axes multiplied. Result: planet always wins.

A real LLM-distribution shift, not a code regression β€” the formula was right when it shipped, and the model behind route_task evolved out from under it. A failure mode worth a name: silent calibration drift.

The temptation to physics our way out

The first instinct was to ground the classifier in real physics. The orbital metaphor has a ready-made textbook: David Vallado's Fundamentals of Astrodynamics and Applications, Β§12.7 β€” the Circular Restricted Three-Body Problem (CRTBP). Two primaries, a test particle, a Jacobi constant. Lagrange points (L1–L5) have closed-form coordinates. Trojan asteroids live at the triangular points (L4/L5). The Hill sphere is one cube root away. Every celestial class has a textbook physical definition.

We built a CRTBP simulator (scripts/simulate-classifier-crtbp.mjs) that maps each skill's physics signature to a state vector in the synodic frame, computes Jacobi C via Vallado Eq. 12-15, and assigns class by physics test:

// Vallado Β§12.7
const M_STAR = 0.1                              // mass ratio
const L4 = { x: 0.5 - M_STAR, y:  Math.sqrt(3)/2 }    // Eq. 12-18 triangular
const L5 = { x: 0.5 - M_STAR, y: -Math.sqrt(3)/2 }
const HILL_R = Math.cbrt(M_STAR / 3)            // Hill sphere
// Class = first match: irregular β†’ trojan β†’ moon β†’ comet β†’ planet β†’ asteroid

The math works. Jacobi constants computed correctly. L4/L5 landed at the textbook positions. Hill radius came out right. The classification result was 22% β€” worse than the bug we started with. 17 of 18 panel skills became comet because every test particle's Jacobi C was below C(L1), and the comet rule fires on C < C(L1). The physics is sound; the SKILL.md β†’ synodic-frame mapping is the bottleneck. With no labelled ground truth to fit it, every skill ended up at the same Jacobi-band, and physics couldn't separate them.

Sears & Zemansky Vol. 2 gave us a richer framing. Light has many more simultaneously-meaningful axes than orbits: wavelength, polarization, amplitude, phase, coherence, full spectrum, Doppler shift. Astronomy classifies stars by spectral type (O / B / A / F / G / K / M), not by orbits. We built a spectral classifier that encodes each skill as a 32-bin visible-band spectrum (Stefan-Boltzmann continuum + Doppler-broadened emission lines per domain hit) and assigns class by best-fit cosine similarity to six prototype spectra.

Same outcome: 33%, also worse than v2 heuristic. The encoding from text to spectrum produced too-similar spectra for skills sharing a single dominant domain; sibling correlation went to 1.00 for 11 of 18 skills, the moon test fired everywhere. The right framework with the wrong encoder produces noise.

CLASS ACCURACY ON PANEL β€” five approaches tried v1 (production bug) 0.167 CRTBP physics 0.222 spectral physics 0.333 v2 + CRTBP hybrid 0.389 v2 retune (shipped) 0.500 βœ“ two textbook physics frameworks Β· neither beat heuristics

Class accuracy on the synthetic 18-skill calibration panel.

The boring fix that worked

With both physics frameworks underperforming, the pragmatic answer was rebalancing the heuristic constants against current LLM output statistics. Three changes in mcp/_lib/orbital.mjs:

Class accuracy on the panel: 0.167 β†’ 0.500. Class distribution rebalanced to four classes used (was two). Mass and scope discrimination both recovered above 0.5. Shipped at d1051fb; the simulation script that validated it before any production change is in the repo as scripts/simulate-classifier-v2.mjs.

The plot twist: real labelled data

The synthetic panel is a stress test — three skills per class, designed to break things. What does the classifier do on real task→tool data? We pulled 60 rows from shawhin/tool-use-finetuning via the Hugging Face datasets API (no auth, free), filtered to the 21 rows with single-correct-tool labels, and ran v2 against them.

RECALL ON shawhin/tool-use-finetuning (21 single-tool rows) recall@1 v2 0.810 recall@1 trivial 0.714 recall@1 random 0.095 recall@5 v2 0.952 recall@5 trivial 0.857 v2 beats random by 8.5Γ—, trivial overlap by 10pp on @1

Real labelled tool-routing data β€” first positive result on external evaluation.

81% recall@1, 95% recall@5. 8.5Γ— above the random baseline (~9.5% expected on ~20 candidates) and 10 percentage points above a 5-line token-overlap baseline. This is a much better number than the synthetic-panel score, which means the panel was harder than real-world tool routing β€” exactly what a stress test is supposed to do.

The panel had been overstating the bug. The classifier was always doing real work on real inputs; it just lost its margin of error on intentionally-adversarial fixtures.

Closing the loop: online SGD without a training run

81% leaves 19 percentage points to ceiling. Closing that gap normally means labelling data and fitting a classifier β€” but we don't have a training pipeline, and the operator machine is "a notepad with network and git". So we built the loop fully cloud-native: every component runs on free GitHub Actions, GitHub Models, or Cloudflare Workers KV. The local machine never compiles or trains anything.

user click on a planet in lens / miniapp / Photon
↓
POST mcp.ask-meridian.uk/v1/feedback   { query, candidates[], chosen_slug }
↓
Worker pulls fitted-weights from KV (~24 floats per class)
↓
One pairwise-ranking SGD step against (chosen, every-other) β€” ~1 ms
↓
Worker writes weights back to KV
↓
Next /v1/route call: final_score = heuristic Γ— (1 + tanh(K Β· wΒ·x))
↓
Better orbits β†’ better clicks β†’ ...

The fitted layer multiplies the heuristic v2 ranking by 1 + tanh(K Β· wΒ·x), bounded to [0, 2]. Cold start: w = 0, multiplier = 1, pure heuristic. As feedback accumulates, weights drift. At launch the worker reported cold_start: true; within minutes of the first integration test it had learned 21 pairs, then 405 after a single bootstrap-cron run. By the time you read this the production model is at:

GET https://mcp.ask-meridian.uk/v1/model-info

{
  "version": "v1",
  "n_updates": 64,
  "n_pairs":  1213,
  "cold_start": false,
  "updated_at": "2026-05-07T17:47:14.815Z"
}

Two GitHub Actions cron jobs keep the loop alive without organic traffic:

What's gained, what's not

The architecture has a few specific properties worth naming:

What this doesn't get us: the remaining 19 percentage points to recall@1 = 1.0. For that, we need labelled off-distribution data — tasks the synthetic and public benchmarks don't cover. That's what the Manifund / LTFF / Bluedot grant work targets: a labelled task→skill dataset purpose-built for tool-routing failure modes.

The receipts

Five simulation scripts live in scripts/ as durable receipts of what we tried and what worked. Each one runs without local training, just Node + network:

scriptapproachpanel accverdict
simulate-classifier-v2.mjs heuristic retune (mass / scope / planet / asteroid) 0.500 shipped
simulate-classifier-crtbp.mjs Vallado Β§12.7 β€” CRTBP + Jacobi + Hill 0.222 diagnostic only
simulate-classifier-spectral.mjs Sears & Zemansky 32–38 β€” spectral classification 0.333 diagnostic only
simulate-classifier-v2-with-crtbp.mjs v2 + Hill-sphere moon + L4/L5 trojan 0.389 diagnostic only
eval-against-public-data.mjs v2 vs HF benchmark (real labels) 0.810 r@1 weekly cron

Source: github.com/LuuOW/meridian-mcp β€” the worker (cf-worker/online_learning.mjs), the cron workflows (.github/workflows/classifier-*.yml), and the front-end feedback wiring (lens + landing/miniapp/api.js) are all in tree.

Two takeaways for anyone calibrating a heuristic classifier

  1. Build a calibration panel before you need one. Without scripts/calibrate-classifier.mjs, the planet-bias bug would have stayed silent for weeks. The panel itself is 18 hand-crafted SKILL.md objects in a single JS file β€” cheaper than not catching the drift.
  2. Don't physics your way out of an empirical labelling problem. The CRTBP and spectral simulations were satisfying to write but neither beat heuristics, because the bottleneck was the SKILL-text β†’ physics mapping, which has no closed-form solution. Online SGD on top of heuristics solved more in 50 lines than either textbook framework did in 300.

Try the live classifier

The browser miniapp at /miniapp/ calls mcp.ask-meridian.uk/v1/route directly. Every skill you click trains the online layer. Current model state: GET /v1/model-info.