tootsies
a discord bot for the tootsies server. ask, recap, discuss, ship features by typing.
Project maintained by mejasonmejason
Hosted on GitHub Pages — Theme by mattgraham
Optimizing prompts for Opus: cut and reword, don’t add
The thesis of Epic #849. Read this before you touch any text Opus sees.
Every piece of text the model reads is both a cost and a lever: the
constitution, the persona, each per-surface prompt, every tool description, every
injected context-block label, even the eval judges. The default instinct when a
model misbehaves is to add a rule to correct it. For Opus that instinct is
almost always wrong. The right move is to find the existing text that is causing
the behavior and cut or reword it.
Why “add a rule” fails on Opus specifically
The shared prompt was tuned, fix by fix, against Sonnet’s failure modes
(sycophancy, hedging, stiffness, fence-sitting). Opus does not share those failure
modes, and it differs in two ways that make accretion actively harmful:
- Opus follows instructions literally. A reductive verb it’s handed
(“Compact”, “Synthesize”, “Drop the noise”) it obeys — it compacts. Sonnet
largely ignores those words. So a phrase that is inert on Sonnet is a live
instruction on Opus.
- Opus over-commits. Hand it a maximal instruction (“be exhaustive”, “the
complete record”, “breadth so any question has a hit”) and it maximizes — it
pads. A counter-rule stacked on top doesn’t cancel the first; now it’s
following both, harder.
So when Opus drifts, adding a counter-instruction compounds the problem. It
also bloats the prompt, and a longer prompt is itself a cost (tokens, latency, and
more surface for cross-rule contradiction).
The method (what to do instead)
- Find the cause in existing text; remove or reword it. Don’t write a new
rule. Ask: what am I already saying that produces this?
- Show before/after verbatim, every edit. No silent rewrites. The diff is the
unit of review.
- Dry-run on REAL data, not synthetic. Pull production logs/notes (Railway,
/debug/query) and run the actual pipeline. A fixture can’t show you what the
live distribution does.
- Measure the axis you actually care about, with the real machinery. For
recall that means
max(prose_vector, tags_vector) cosine over the real query
set — not a vibe read of the text. For length, the production answer_length
telemetry. Name the metric before you edit.
- Ablate: cut vs add vs reword, and measure each. (See #858’s ablation table —
cutting the regulars block made the symptom worse; an explicit clause fixed
it; the real lever was a layer up.) The first thing you blame is usually not the
cause.
- Length is not the enemy — bloat-without-payoff is. Don’t trim for its own
sake. Measure whether the extra text buys the thing (searchability, depth,
accuracy). If it doesn’t, it’s pure context cost and should go.
- Key everything per model; share only the safety floor. A rule earned against
Sonnet’s failures mis-fires on Opus. The constitution HARD rules, the memory
fence, and protected-path/preflight guarantees are the one universal base and
are never loosened per-model. Everything else is per-model.
Worked examples (this epic)
- Memory over-compaction (#854/#855). Opus wrote half-length notes. The cause
was reductive verbs (“Compact”/”Synthesize”/”Drop”) it followed literally — not a
missing “be thorough” rule. Fix: remove the verbs and reframe, no additions.
Result: Opus length matched Sonnet (≈6k→13k chars), no rule added. Adding “be
exhaustive” first was scar tissue that treated the symptom.
- Recall reframe, then over-correction (#855). Telling the writer the note is
searched (“write it to be FOUND, matched by meaning + keyword”) was a good,
kept change. But it arrived bundled with maximalism — “the complete record”, “the
full searchable record”, “breadth so any question has a hit, depth so the hit is
complete”, “the small as much as the big”. The instructive part is what the
rigorous dry run found (prod daily #1444, the aggressive vs softened prose,
n=5 each, fixed 75-moment checklist): the two versions are a dead wash on
every metric — length (13.2k vs 14.1k chars, overlapping ranges), and
searchability (recall 0.343 vs 0.344, identical sd). A first n=2 pass had
looked like the softened version was ~30% shorter and deeper; n=5 erased that
— it was sampling noise. The real lessons: (a) Opus normalizes length
regardless of “be exhaustive” wording (it wrote ~13–14k chars either way), so
that maximalism wasn’t even buying the length it cost on paper; (b) the
maximalism didn’t improve recall at all; (c) trust n≈5 over n=2 — a clean
two-sample story is often noise. So cutting the maximalism is justified on
prose hygiene (it was thrice-repeated, redundant — “complete record” and
“full searchable record” and “the small as much as the big”), not on a
measured length/search win. Don’t oversell a metric that isn’t there.
- Ask over-length roast (#858 / #856 — the canonical example, see
docs/opus_length_drift.md). Opus ran long on thread-continuation roasts via a
defend-then-soften reflex. Two things make this the textbook case:
- The obvious cut is usually not the lever — ablate to find the real one. The
instinct was to blame
_OPUS_REGULARS (the don’t-punch-down block) and cut it.
The ablation said otherwise: cutting it made the symptom worse (bf_hater
3.0 → 3.3) and wasn’t free (it’s the lane/villain guard). The real driver was
a length-sanctioning roast rule a layer up (“a full takedown … is the fun”).
- Fix the cause’s LAYER; don’t stack a counter-rule in a different layer. The
roast sanction lives in Layer 1 (
constitution.py CALIBRATION, Sonnet-era,
never Opus-optimized); the length rule lives in Layer 3 (_OPUS_LENGTH). Adding
a length clause to Layer 3 would just contradict Layer 1 one layer down —
two rules fighting. The fix is to reword the Layer 1 line so a roast lands
sharp in a line or two and drops the defend-then-soften reflex. (The same
investigation surfaced Layer-1↔Layer-3 duplication — STAY IN YOUR LANE in
both, DATA INTEGRITY’s 9 bullets vs the lean _OPUS_GROUNDING — i.e. more text
to consolidate, not add.)
- Discourse: measure the model on the REBUILT prompt, not the legacy one (#876 /
#895, full arc in
docs/opus_discourse_eval.md). The first measurement compared
Opus on the Sonnet-tuned 43k prompt against Sonnet, found Opus worse (a same-topic
head-to-head ~6-2), and nearly concluded “keep Sonnet, don’t build.” That measured
Opus on the wrong prompt — the exact confound this epic exists to fix. Given a
lean discourse core, Opus flips to the strongest of the three (open-question
1/8 → 8/8, flag-planted 10/16 → 15/16, open-loop Opus 3/4 vs Sonnet 0/4 on the real
surface). Four sub-traps it hit, each generalizable:
- A pairwise “which is better” judge is high-variance — prefer per-item metrics.
The same-topic head-to-head swung 6-2 → 4-4 → 2-6 on identical conditions; the
stable per-post metrics (quality / open-question / flag-planted) were the
trustworthy signal. Distinct from the judge-ERROR lesson below: this is judge
VARIANCE — a comparative “A vs B” judge is noisier than an absolute per-item one,
so never hang a conclusion on one head-to-head.
- A reused component carries its HOME surface’s framing. Pulling the ask blocks
_OPUS_CITE / _OPUS_TOOLDISC into discourse leaked meta-commentary into a room
post — their “the user” / “your answer” 1:1 framing reads wrong in a broadcast post
(A/B META-leak 2/8 → 0/8 once swapped for a discourse-framed line). Reuse only the
surface-NEUTRAL component (_OPUS_REGULARS); reframe the rest for the surface.
- Dry-run the REAL wired path, including its fallbacks. An offline A/B (direct
_call) looked clean, but the live discourse() path has a forced-search retry
(thinking OFF) the offline harness never exercised — it degraded the lean output
(~4.0 sentences + selection-narration meta-leak vs the primary’s ~2.3). Skipping
that retry on the lean path was the fix. Grade the actual pipeline, not the prompt
in isolation.
- Vivid prompt metaphors get parroted (instruction-echo). “plant YOUR flag first”
surfaced verbatim in ~1-2/12 posts (“I’m planting my flag: …”). Reworded to “lead
with where YOU land” — same instruction, echo 1/12 → 0/12, keystone metrics held
(re-validated because the TASK block is load-bearing). The cause was the prompt’s
own memorable noun, not a missing “don’t narrate yourself” rule.
Evaluating model effectiveness (how to measure, not just what to change)
Half of this epic’s lessons were about evaluation method, because a prompt
change is only as trustworthy as the measurement that justified it. What we learned:
- Pick the metric and the axis before you edit. “Better” is meaningless. Name
the axis — searchability, length, depth-of-record, fabrication rate, adherence —
and define how you’ll measure each before touching text, so you can’t
rationalize a change after the fact.
- Measure with the REAL pipeline, not a proxy. Recall means the production
path:
max(prose_vector, tags_vector) cosine over the real query set against the
real embedding model — because that’s what actually retrieves the note. Eyeballing
the prose for “searchability” measures nothing.
- n=2 is noise; trust n≈5. A two-sample dry run told a clean story (softened was
“~30% shorter and deeper”); n=5 erased it (length a wash, depth inconclusive).
Model output is high-variance — a clean small-n result is more likely sampling
luck than signal. Run enough samples to see the spread, and report the spread
(min–max + sd), not just the mean. A mean without a range hides whether the
difference is real.
- Validate the metric itself — a broken metric is worse than none. The n=5
depth judge returned coverage counts above the fixed checklist size (depth
1.0), i.e. it was nonsense, and a careless read would have “concluded” from it.
Sanity-bound every derived metric (covered ≤ total) and discard it loudly when it
violates the bound, rather than quoting a number that can’t be real.
- Isolate the parameter you’re testing. Hold everything else fixed (same source
data, same queries, same fixed checklist as the depth denominator) so the only
moving part is the one prompt change. A floating denominator (the judge
re-deriving the moment count per call) injects variance that swamps the signal.
- Ablate cut vs add vs reword separately. Don’t test a bundle. #858’s table
(baseline / cut-the-block / add-a-clause) is the model: it showed the obvious
suspect (cut the regulars block) made it worse and the real lever was elsewhere.
You cannot tell which token did the work unless you move one at a time.
- A wash is a result. If the metric says no difference, say so and decide on
other grounds (prose hygiene, owner taste) — don’t manufacture a metric win to
justify a change you already made. Honest “no measurable effect” is more useful
to the next session than an oversold number.
- Dry-run on real production data. Pull live notes/logs (
/debug/query,
Railway EVENT logs); synthetic fixtures don’t carry the live distribution that
produces the failure.
- A judge is itself a model — validate a FAIL before acting on it. The eval
judge can be wrong, including failing a correct answer. Worked example
(#893): the hardened recap eval failed an Opus recap for saying “SGA won the
2025-26 MVP” — and I first read that as an Opus weakness (“over-asserts an
outside fact”). The statement is true; the recap had correctly grounded a
real recent fact about the exact topic the room was debating, which is what
recap is supposed to do. The judge erred — it couldn’t web-verify a recent
result and treated “couldn’t verify” as fabrication, against its own rubric. Two
compounding traps: (a) acting on a single judge verdict as ground truth, and (b)
it nearly produced scar tissue — a “lean recap core” rebuild for a weakness
that does not exist (the #861 wash STANDS). The bar a fabrication judge must
enforce is “contradicted by search → FAIL,” never “couldn’t verify → FAIL.”
Before you act on a judge FAIL, check the judge was right.
Eval fidelity: model production, or you measure nothing
An eval is only worth its verdict if it feeds the model what production feeds it
and exercises the path production runs. Three failure modes this epic closed
(#886, the ask/memory eval lift):
- Feed the production-shaped context, not a barer frame. Every real
/ask
supplies channel_descriptor (“where am I”) + asker_name (“who’s asking”) +
the rich message buffer (reactions / reply-graph / author-tags). The synthetic
cases passed none of it, so the model saw a thinner frame than any live call and
the verdict measured a different distribution. Fix: feed the same context the
cog feeds (cogs/ask.py), verbatim shape.
- Wire the REAL tools/pipeline — never a reimplementation.
eval_ask_tools
now builds the real SportsDataHub (ApiSportsProvider + SgoProvider) and
calls the real format_scoreboard; the thin handlers just call
hub.live_games()/upcoming_games()/recent_finals() exactly as the cog does.
A handler that reimplements cog logic in the eval drifts from prod silently —
then it grades a fiction. Reuse the production function; the eval’s job is
routing + answer quality, not a parallel implementation.
- Grade the model actually SHIPPED per surface — read it from the DB (cost
discipline). Don’t grade every model every run. The scheduled suite reads each
surface’s enabled model from the per-guild
/menu Models override
(settings KV) via resolve_eval_models/merge_enabled: flagless → the single
enabled model (the surface default UNION any guild override), so it pays 2x ONLY
when a guild genuinely split the surface across models. Wiring detail worth
remembering: GitHub’s runners cannot reach Railway’s internal DATABASE_URL
host, so the read uses DATABASE_PUBLIC_URL (the public proxy) and is fully
fail-open — no URL / unreachable → fall back to the surface default, never block
the suite on a DB hiccup.
Validate the gap is real before you build the fix (proportionality)
The fix is usually smaller than the issue makes it sound — measure the gap before
building machinery for it. Worked example (#892): the issue asked for a
live-embedding behavioral eval of chunked recall, and I started building it. On
inspection the gap had mostly already been closed: the chunk mechanism (the
two-stage rank_by_similarity rerank, base64-f32 encode/decode) was already
deterministically unit-tested in tests/test_embeddings.py and gating CI, and a
live-embedding eval would have been dormant anyway (evals.yml has no
OPENAI_API_KEY, so embedding cases self-skip). The one genuinely untested seam
was the cog wiring — that cogs/ask.py:_vector_recall passes
chunk_rerank_top through. The proportional fix was a ~10-line deterministic
wiring test, not a live eval harness. Pattern: enumerate what already covers the
risk (unit tests, an existing eval, a dormant-in-CI gate) before adding more —
sunk-cost on a half-built heavier fix is not a reason to ship it. (Pairs with the
CLAUDE.md “proportional to the problem” rule and the #567 resume-spine → 10-line
cog_unload example.)
The standing directive
Apply this to every piece of text Opus sees — constitution, persona,
per-surface prompts, tool descriptions, context-block labels, eval judges. Each is
a lever to be optimized by cutting and rewording for Opus, not a place to accrete
rules. When you’re tempted to add: stop, find the cause, cut it instead.
See also
docs/opus_length_drift.md (#856/#858) — the worked investigation behind the
ask-roast example: the full ablation table, the 4-layer prompt assembly, the
cross-layer tension, and verbatim before/after rewords of the Layer-1 roast rule.
- Epic #849 (per-model prompt + eval rebuild) and its milestones: ask #859,
memory #860, recap #861, discourse #862, other tiers #863, eval coverage #886.
- The eval-fidelity arc: #886 (coverage lift), #887 (DB-read model axis), #890
(production context + real sports tools), #892 (the proportionality correction),
#893 (the judge-calibration correction).