tootsies

a discord bot for the tootsies server. ask, recap, discuss, ship features by typing.


Project maintained by mejasonmejason Hosted on GitHub Pages — Theme by mattgraham

Optimizing prompts for Opus: cut and reword, don’t add

The thesis of Epic #849. Read this before you touch any text Opus sees.

Every piece of text the model reads is both a cost and a lever: the constitution, the persona, each per-surface prompt, every tool description, every injected context-block label, even the eval judges. The default instinct when a model misbehaves is to add a rule to correct it. For Opus that instinct is almost always wrong. The right move is to find the existing text that is causing the behavior and cut or reword it.

Why “add a rule” fails on Opus specifically

The shared prompt was tuned, fix by fix, against Sonnet’s failure modes (sycophancy, hedging, stiffness, fence-sitting). Opus does not share those failure modes, and it differs in two ways that make accretion actively harmful:

  1. Opus follows instructions literally. A reductive verb it’s handed (“Compact”, “Synthesize”, “Drop the noise”) it obeys — it compacts. Sonnet largely ignores those words. So a phrase that is inert on Sonnet is a live instruction on Opus.
  2. Opus over-commits. Hand it a maximal instruction (“be exhaustive”, “the complete record”, “breadth so any question has a hit”) and it maximizes — it pads. A counter-rule stacked on top doesn’t cancel the first; now it’s following both, harder.

So when Opus drifts, adding a counter-instruction compounds the problem. It also bloats the prompt, and a longer prompt is itself a cost (tokens, latency, and more surface for cross-rule contradiction).

The method (what to do instead)

  1. Find the cause in existing text; remove or reword it. Don’t write a new rule. Ask: what am I already saying that produces this?
  2. Show before/after verbatim, every edit. No silent rewrites. The diff is the unit of review.
  3. Dry-run on REAL data, not synthetic. Pull production logs/notes (Railway, /debug/query) and run the actual pipeline. A fixture can’t show you what the live distribution does.
  4. Measure the axis you actually care about, with the real machinery. For recall that means max(prose_vector, tags_vector) cosine over the real query set — not a vibe read of the text. For length, the production answer_length telemetry. Name the metric before you edit.
  5. Ablate: cut vs add vs reword, and measure each. (See #858’s ablation table — cutting the regulars block made the symptom worse; an explicit clause fixed it; the real lever was a layer up.) The first thing you blame is usually not the cause.
  6. Length is not the enemy — bloat-without-payoff is. Don’t trim for its own sake. Measure whether the extra text buys the thing (searchability, depth, accuracy). If it doesn’t, it’s pure context cost and should go.
  7. Key everything per model; share only the safety floor. A rule earned against Sonnet’s failures mis-fires on Opus. The constitution HARD rules, the memory fence, and protected-path/preflight guarantees are the one universal base and are never loosened per-model. Everything else is per-model.

Worked examples (this epic)

Evaluating model effectiveness (how to measure, not just what to change)

Half of this epic’s lessons were about evaluation method, because a prompt change is only as trustworthy as the measurement that justified it. What we learned:

  1. Pick the metric and the axis before you edit. “Better” is meaningless. Name the axis — searchability, length, depth-of-record, fabrication rate, adherence — and define how you’ll measure each before touching text, so you can’t rationalize a change after the fact.
  2. Measure with the REAL pipeline, not a proxy. Recall means the production path: max(prose_vector, tags_vector) cosine over the real query set against the real embedding model — because that’s what actually retrieves the note. Eyeballing the prose for “searchability” measures nothing.
  3. n=2 is noise; trust n≈5. A two-sample dry run told a clean story (softened was “~30% shorter and deeper”); n=5 erased it (length a wash, depth inconclusive). Model output is high-variance — a clean small-n result is more likely sampling luck than signal. Run enough samples to see the spread, and report the spread (min–max + sd), not just the mean. A mean without a range hides whether the difference is real.
  4. Validate the metric itself — a broken metric is worse than none. The n=5 depth judge returned coverage counts above the fixed checklist size (depth

    1.0), i.e. it was nonsense, and a careless read would have “concluded” from it. Sanity-bound every derived metric (covered ≤ total) and discard it loudly when it violates the bound, rather than quoting a number that can’t be real.

  5. Isolate the parameter you’re testing. Hold everything else fixed (same source data, same queries, same fixed checklist as the depth denominator) so the only moving part is the one prompt change. A floating denominator (the judge re-deriving the moment count per call) injects variance that swamps the signal.
  6. Ablate cut vs add vs reword separately. Don’t test a bundle. #858’s table (baseline / cut-the-block / add-a-clause) is the model: it showed the obvious suspect (cut the regulars block) made it worse and the real lever was elsewhere. You cannot tell which token did the work unless you move one at a time.
  7. A wash is a result. If the metric says no difference, say so and decide on other grounds (prose hygiene, owner taste) — don’t manufacture a metric win to justify a change you already made. Honest “no measurable effect” is more useful to the next session than an oversold number.
  8. Dry-run on real production data. Pull live notes/logs (/debug/query, Railway EVENT logs); synthetic fixtures don’t carry the live distribution that produces the failure.
  9. A judge is itself a model — validate a FAIL before acting on it. The eval judge can be wrong, including failing a correct answer. Worked example (#893): the hardened recap eval failed an Opus recap for saying “SGA won the 2025-26 MVP” — and I first read that as an Opus weakness (“over-asserts an outside fact”). The statement is true; the recap had correctly grounded a real recent fact about the exact topic the room was debating, which is what recap is supposed to do. The judge erred — it couldn’t web-verify a recent result and treated “couldn’t verify” as fabrication, against its own rubric. Two compounding traps: (a) acting on a single judge verdict as ground truth, and (b) it nearly produced scar tissue — a “lean recap core” rebuild for a weakness that does not exist (the #861 wash STANDS). The bar a fabrication judge must enforce is “contradicted by search → FAIL,” never “couldn’t verify → FAIL.” Before you act on a judge FAIL, check the judge was right.

Eval fidelity: model production, or you measure nothing

An eval is only worth its verdict if it feeds the model what production feeds it and exercises the path production runs. Three failure modes this epic closed (#886, the ask/memory eval lift):

  1. Feed the production-shaped context, not a barer frame. Every real /ask supplies channel_descriptor (“where am I”) + asker_name (“who’s asking”) + the rich message buffer (reactions / reply-graph / author-tags). The synthetic cases passed none of it, so the model saw a thinner frame than any live call and the verdict measured a different distribution. Fix: feed the same context the cog feeds (cogs/ask.py), verbatim shape.
  2. Wire the REAL tools/pipeline — never a reimplementation. eval_ask_tools now builds the real SportsDataHub (ApiSportsProvider + SgoProvider) and calls the real format_scoreboard; the thin handlers just call hub.live_games()/upcoming_games()/recent_finals() exactly as the cog does. A handler that reimplements cog logic in the eval drifts from prod silently — then it grades a fiction. Reuse the production function; the eval’s job is routing + answer quality, not a parallel implementation.
  3. Grade the model actually SHIPPED per surface — read it from the DB (cost discipline). Don’t grade every model every run. The scheduled suite reads each surface’s enabled model from the per-guild /menu Models override (settings KV) via resolve_eval_models/merge_enabled: flagless → the single enabled model (the surface default UNION any guild override), so it pays 2x ONLY when a guild genuinely split the surface across models. Wiring detail worth remembering: GitHub’s runners cannot reach Railway’s internal DATABASE_URL host, so the read uses DATABASE_PUBLIC_URL (the public proxy) and is fully fail-open — no URL / unreachable → fall back to the surface default, never block the suite on a DB hiccup.

Validate the gap is real before you build the fix (proportionality)

The fix is usually smaller than the issue makes it sound — measure the gap before building machinery for it. Worked example (#892): the issue asked for a live-embedding behavioral eval of chunked recall, and I started building it. On inspection the gap had mostly already been closed: the chunk mechanism (the two-stage rank_by_similarity rerank, base64-f32 encode/decode) was already deterministically unit-tested in tests/test_embeddings.py and gating CI, and a live-embedding eval would have been dormant anyway (evals.yml has no OPENAI_API_KEY, so embedding cases self-skip). The one genuinely untested seam was the cog wiring — that cogs/ask.py:_vector_recall passes chunk_rerank_top through. The proportional fix was a ~10-line deterministic wiring test, not a live eval harness. Pattern: enumerate what already covers the risk (unit tests, an existing eval, a dormant-in-CI gate) before adding more — sunk-cost on a half-built heavier fix is not a reason to ship it. (Pairs with the CLAUDE.md “proportional to the problem” rule and the #567 resume-spine → 10-line cog_unload example.)

The standing directive

Apply this to every piece of text Opus sees — constitution, persona, per-surface prompts, tool descriptions, context-block labels, eval judges. Each is a lever to be optimized by cutting and rewording for Opus, not a place to accrete rules. When you’re tempted to add: stop, find the cause, cut it instead.

See also