tootsies

a discord bot for the tootsies server. ask, recap, discuss, ship features by typing.

Project maintained by mejasonmejason Hosted on GitHub Pages — Theme by mattgraham

Optimizing prompts for Opus: cut and reword, don’t add

The thesis of Epic #849. Read this before you touch any text Opus sees.

Every piece of text the model reads is both a cost and a lever: the constitution, the persona, each per-surface prompt, every tool description, every injected context-block label, even the eval judges. The default instinct when a model misbehaves is to add a rule to correct it. For Opus that instinct is almost always wrong. The right move is to find the existing text that is causing the behavior and cut or reword it.

Why “add a rule” fails on Opus specifically

The shared prompt was tuned, fix by fix, against Sonnet’s failure modes (sycophancy, hedging, stiffness, fence-sitting). Opus does not share those failure modes, and it differs in two ways that make accretion actively harmful:

Opus follows instructions literally. A reductive verb it’s handed (“Compact”, “Synthesize”, “Drop the noise”) it obeys — it compacts. Sonnet largely ignores those words. So a phrase that is inert on Sonnet is a live instruction on Opus.
Opus over-commits. Hand it a maximal instruction (“be exhaustive”, “the complete record”, “breadth so any question has a hit”) and it maximizes — it pads. A counter-rule stacked on top doesn’t cancel the first; now it’s following both, harder.

So when Opus drifts, adding a counter-instruction compounds the problem. It also bloats the prompt, and a longer prompt is itself a cost (tokens, latency, and more surface for cross-rule contradiction).

The method (what to do instead)

Find the cause in existing text; remove or reword it. Don’t write a new rule. Ask: what am I already saying that produces this?
Show before/after verbatim, every edit. No silent rewrites. The diff is the unit of review.
Dry-run on REAL data, not synthetic. Pull production logs/notes (Railway, /debug/query) and run the actual pipeline. A fixture can’t show you what the live distribution does.
Measure the axis you actually care about, with the real machinery. For recall that means max(prose_vector, tags_vector) cosine over the real query set — not a vibe read of the text. For length, the production answer_length telemetry. Name the metric before you edit.
Ablate: cut vs add vs reword, and measure each. (See #858’s ablation table — cutting the regulars block made the symptom worse; an explicit clause fixed it; the real lever was a layer up.) The first thing you blame is usually not the cause.
Length is not the enemy — bloat-without-payoff is. Don’t trim for its own sake. Measure whether the extra text buys the thing (searchability, depth, accuracy). If it doesn’t, it’s pure context cost and should go.
Key everything per model; share only the safety floor. A rule earned against Sonnet’s failures mis-fires on Opus. The constitution HARD rules, the memory fence, and protected-path/preflight guarantees are the one universal base and are never loosened per-model. Everything else is per-model.

Worked examples (this epic)

Memory over-compaction (#854/#855). Opus wrote half-length notes. The cause was reductive verbs (“Compact”/”Synthesize”/”Drop”) it followed literally — not a missing “be thorough” rule. Fix: remove the verbs and reframe, no additions. Result: Opus length matched Sonnet (≈6k→13k chars), no rule added. Adding “be exhaustive” first was scar tissue that treated the symptom.
Recall reframe, then over-correction (#855). Telling the writer the note is searched (“write it to be FOUND, matched by meaning + keyword”) was a good, kept change. But it arrived bundled with maximalism — “the complete record”, “the full searchable record”, “breadth so any question has a hit, depth so the hit is complete”, “the small as much as the big”. The instructive part is what the rigorous dry run found (prod daily #1444, the aggressive vs softened prose, n=5 each, fixed 75-moment checklist): the two versions are a dead wash on every metric — length (13.2k vs 14.1k chars, overlapping ranges), and searchability (recall 0.343 vs 0.344, identical sd). A first n=2 pass had looked like the softened version was ~30% shorter and deeper; n=5 erased that — it was sampling noise. The real lessons: (a) Opus normalizes length regardless of “be exhaustive” wording (it wrote ~13–14k chars either way), so that maximalism wasn’t even buying the length it cost on paper; (b) the maximalism didn’t improve recall at all; (c) trust n≈5 over n=2 — a clean two-sample story is often noise. So cutting the maximalism is justified on prose hygiene (it was thrice-repeated, redundant — “complete record” and “full searchable record” and “the small as much as the big”), not on a measured length/search win. Don’t oversell a metric that isn’t there.
Ask over-length roast (#858 / #856 — the canonical example, see docs/opus_length_drift.md). Opus ran long on thread-continuation roasts via a defend-then-soften reflex. Two things make this the textbook case:
- The obvious cut is usually not the lever — ablate to find the real one. The instinct was to blame _OPUS_REGULARS (the don’t-punch-down block) and cut it. The ablation said otherwise: cutting it made the symptom worse (bf_hater 3.0 → 3.3) and wasn’t free (it’s the lane/villain guard). The real driver was a length-sanctioning roast rule a layer up (“a full takedown … is the fun”).
- Fix the cause’s LAYER; don’t stack a counter-rule in a different layer. The roast sanction lives in Layer 1 (constitution.py CALIBRATION, Sonnet-era, never Opus-optimized); the length rule lives in Layer 3 (_OPUS_LENGTH). Adding a length clause to Layer 3 would just contradict Layer 1 one layer down — two rules fighting. The fix is to reword the Layer 1 line so a roast lands sharp in a line or two and drops the defend-then-soften reflex. (The same investigation surfaced Layer-1↔Layer-3 duplication — STAY IN YOUR LANE in both, DATA INTEGRITY’s 9 bullets vs the lean _OPUS_GROUNDING — i.e. more text to consolidate, not add.)
Discourse: measure the model on the REBUILT prompt, not the legacy one (#876 / #895, full arc in docs/opus_discourse_eval.md). The first measurement compared Opus on the Sonnet-tuned 43k prompt against Sonnet, found Opus worse (a same-topic head-to-head ~6-2), and nearly concluded “keep Sonnet, don’t build.” That measured Opus on the wrong prompt — the exact confound this epic exists to fix. Given a lean discourse core, Opus flips to the strongest of the three (open-question 1/8 → 8/8, flag-planted 10/16 → 15/16, open-loop Opus 3/4 vs Sonnet 0/4 on the real surface). Four sub-traps it hit, each generalizable:
- A pairwise “which is better” judge is high-variance — prefer per-item metrics. The same-topic head-to-head swung 6-2 → 4-4 → 2-6 on identical conditions; the stable per-post metrics (quality / open-question / flag-planted) were the trustworthy signal. Distinct from the judge-ERROR lesson below: this is judge VARIANCE — a comparative “A vs B” judge is noisier than an absolute per-item one, so never hang a conclusion on one head-to-head.
- A reused component carries its HOME surface’s framing. Pulling the ask blocks _OPUS_CITE / _OPUS_TOOLDISC into discourse leaked meta-commentary into a room post — their “the user” / “your answer” 1:1 framing reads wrong in a broadcast post (A/B META-leak 2/8 → 0/8 once swapped for a discourse-framed line). Reuse only the surface-NEUTRAL component (_OPUS_REGULARS); reframe the rest for the surface.
- Dry-run the REAL wired path, including its fallbacks. An offline A/B (direct _call) looked clean, but the live discourse() path has a forced-search retry (thinking OFF) the offline harness never exercised — it degraded the lean output (~4.0 sentences + selection-narration meta-leak vs the primary’s ~2.3). Skipping that retry on the lean path was the fix. Grade the actual pipeline, not the prompt in isolation.
- Vivid prompt metaphors get parroted (instruction-echo). “plant YOUR flag first” surfaced verbatim in ~1-2/12 posts (“I’m planting my flag: …”). Reworded to “lead with where YOU land” — same instruction, echo 1/12 → 0/12, keystone metrics held (re-validated because the TASK block is load-bearing). The cause was the prompt’s own memorable noun, not a missing “don’t narrate yourself” rule.

Evaluating model effectiveness (how to measure, not just what to change)

Half of this epic’s lessons were about evaluation method, because a prompt change is only as trustworthy as the measurement that justified it. What we learned:

Pick the metric and the axis before you edit. “Better” is meaningless. Name the axis — searchability, length, depth-of-record, fabrication rate, adherence — and define how you’ll measure each before touching text, so you can’t rationalize a change after the fact.
Measure with the REAL pipeline, not a proxy. Recall means the production path: max(prose_vector, tags_vector) cosine over the real query set against the real embedding model — because that’s what actually retrieves the note. Eyeballing the prose for “searchability” measures nothing.
n=2 is noise; trust n≈5. A two-sample dry run told a clean story (softened was “~30% shorter and deeper”); n=5 erased it (length a wash, depth inconclusive). Model output is high-variance — a clean small-n result is more likely sampling luck than signal. Run enough samples to see the spread, and report the spread (min–max + sd), not just the mean. A mean without a range hides whether the difference is real.
Validate the metric itself — a broken metric is worse than none. The n=5 depth judge returned coverage counts above the fixed checklist size (depth

1.0), i.e. it was nonsense, and a careless read would have “concluded” from it. Sanity-bound every derived metric (covered ≤ total) and discard it loudly when it violates the bound, rather than quoting a number that can’t be real.
Isolate the parameter you’re testing. Hold everything else fixed (same source data, same queries, same fixed checklist as the depth denominator) so the only moving part is the one prompt change. A floating denominator (the judge re-deriving the moment count per call) injects variance that swamps the signal.
Ablate cut vs add vs reword separately. Don’t test a bundle. #858’s table (baseline / cut-the-block / add-a-clause) is the model: it showed the obvious suspect (cut the regulars block) made it worse and the real lever was elsewhere. You cannot tell which token did the work unless you move one at a time.
A wash is a result. If the metric says no difference, say so and decide on other grounds (prose hygiene, owner taste) — don’t manufacture a metric win to justify a change you already made. Honest “no measurable effect” is more useful to the next session than an oversold number.
Dry-run on real production data. Pull live notes/logs (/debug/query, Railway EVENT logs); synthetic fixtures don’t carry the live distribution that produces the failure.
A judge is itself a model — validate a FAIL before acting on it. The eval judge can be wrong, including failing a correct answer. Worked example (#893): the hardened recap eval failed an Opus recap for saying “SGA won the 2025-26 MVP” — and I first read that as an Opus weakness (“over-asserts an outside fact”). The statement is true; the recap had correctly grounded a real recent fact about the exact topic the room was debating, which is what recap is supposed to do. The judge erred — it couldn’t web-verify a recent result and treated “couldn’t verify” as fabrication, against its own rubric. Two compounding traps: (a) acting on a single judge verdict as ground truth, and (b) it nearly produced scar tissue — a “lean recap core” rebuild for a weakness that does not exist (the #861 wash STANDS). The bar a fabrication judge must enforce is “contradicted by search → FAIL,” never “couldn’t verify → FAIL.” Before you act on a judge FAIL, check the judge was right.

Eval fidelity: model production, or you measure nothing

An eval is only worth its verdict if it feeds the model what production feeds it and exercises the path production runs. Three failure modes this epic closed (#886, the ask/memory eval lift):

Feed the production-shaped context, not a barer frame. Every real /ask supplies channel_descriptor (“where am I”) + asker_name (“who’s asking”) + the rich message buffer (reactions / reply-graph / author-tags). The synthetic cases passed none of it, so the model saw a thinner frame than any live call and the verdict measured a different distribution. Fix: feed the same context the cog feeds (cogs/ask.py), verbatim shape.
Wire the REAL tools/pipeline — never a reimplementation. eval_ask_tools now builds the real SportsDataHub (ApiSportsProvider + SgoProvider) and calls the real format_scoreboard; the thin handlers just call hub.live_games()/upcoming_games()/recent_finals() exactly as the cog does. A handler that reimplements cog logic in the eval drifts from prod silently — then it grades a fiction. Reuse the production function; the eval’s job is routing + answer quality, not a parallel implementation.
Grade the model actually SHIPPED per surface — read it from the DB (cost discipline). Don’t grade every model every run. The scheduled suite reads each surface’s enabled model from the per-guild /menu Models override (settings KV) via resolve_eval_models/merge_enabled: flagless → the single enabled model (the surface default UNION any guild override), so it pays 2x ONLY when a guild genuinely split the surface across models. Wiring detail worth remembering: GitHub’s runners cannot reach Railway’s internal DATABASE_URL host, so the read uses DATABASE_PUBLIC_URL (the public proxy) and is fully fail-open — no URL / unreachable → fall back to the surface default, never block the suite on a DB hiccup.

Validate the gap is real before you build the fix (proportionality)

The fix is usually smaller than the issue makes it sound — measure the gap before building machinery for it. Worked example (#892): the issue asked for a live-embedding behavioral eval of chunked recall, and I started building it. On inspection the gap had mostly already been closed: the chunk mechanism (the two-stage rank_by_similarity rerank, base64-f32 encode/decode) was already deterministically unit-tested in tests/test_embeddings.py and gating CI, and a live-embedding eval would have been dormant anyway (evals.yml has no OPENAI_API_KEY, so embedding cases self-skip). The one genuinely untested seam was the cog wiring — that cogs/ask.py:_vector_recall passes chunk_rerank_top through. The proportional fix was a ~10-line deterministic wiring test, not a live eval harness. Pattern: enumerate what already covers the risk (unit tests, an existing eval, a dormant-in-CI gate) before adding more — sunk-cost on a half-built heavier fix is not a reason to ship it. (Pairs with the CLAUDE.md “proportional to the problem” rule and the #567 resume-spine → 10-line cog_unload example.)

The standing directive

Apply this to every piece of text Opus sees — constitution, persona, per-surface prompts, tool descriptions, context-block labels, eval judges. Each is a lever to be optimized by cutting and rewording for Opus, not a place to accrete rules. When you’re tempted to add: stop, find the cause, cut it instead.