tootsies

a discord bot for the tootsies server. ask, recap, discuss, ship features by typing.

Project maintained by mejasonmejason Hosted on GitHub Pages — Theme by mattgraham

Opus on discourse: the lean rebuild (Epic #849)

Question (owner): “how do we feel about using opus for discourse i’m not even sure”

Answer: Opus on a LEAN discourse prompt is the strongest of the three. Shipped a per-model lean discourse rebuild (the ask ask_core pattern, applied to discourse).

This doc corrects an earlier conclusion in its own history. The short version: the first measurement compared Opus on the Sonnet-tuned prompt against Sonnet, found Opus worse, and (wrongly) concluded “keep Sonnet, don’t build.” That was the exact confound the epic exists to fix — Opus on un-rebuilt text drifts. On a lean prompt the result flips hard.

The arc (what actually happened, kept honest)

Legacy head-to-head looked like Sonnet won. scripts/dryrun_opus_discourse.py (8 grounded topics, same prompt to both): Sonnet edged the same-topic head-to-head ~6-2 at equal quality. Conclusion at the time: keep Sonnet. But both models ran the legacy 43k Sonnet-tuned prompt — so this measured “Opus on the wrong prompt,” not “Opus is worse at discourse.”
The head-to-head judge is noisy. Re-run on the same conditions it swung 6-2 -> 4-4 -> 2-6. It was over-weighted in both directions; the stable per-post metrics (below) are the trustworthy signal.

Give Opus a lean prompt and it flips. A lean discourse core (the open-loop reframe + reused/rebuilt blocks, run via skip_persona/lean_persona), measured BEFORE (prod Opus-legacy) vs AFTER (lean), 16 topics:

metric	BEFORE (Opus-legacy)	AFTER (Opus-lean)
quality (harsh `must_post` judge)	1/16	6/16
open-question (opens a question vs closes a verdict)	5/16	14/16
flag-planted (her OWN take vs a naked poll)	10/16	15/16
mean sentences	2.3	2.4

Sonnet-legacy sat at quality ~2/8, open ~3-4/8 across runs — so lean Opus clears both legacy conditions on every stable axis.

What’s in the lean core (and why), all evidence-backed

The keystone is the TASK reframe: discourse is an OPENING, not an answer — plant your flag first, then hand the room a question it argues about. That one change drove open-question 1/8 -> 8/8 and (with “never a naked poll”) flag-planted 10/16 -> 15/16.

The walk-through, block by block:

10 inline legacy blocks -> 4: open-loop TASK, LENGTH, STATE-THE-FACT, link line. (LINK/REPASTE/ACTIVE/QUIET/CALIBRATION-triplet collapsed or cut.)
5 shared room blocks (_POST_GROUNDING+4 lessons, _ROOM_DIRECTED, _VOICE_REMINDER, _LENGTH_RULES, _TOOL_DISCIPLINE) -> cut for Opus. Voice/length live in the lean persona; the room-specific ones (_ROOM_DIRECTED + the lessons) were rebuilt as lean Opus versions AND ablated — they didn’t help (substance flat) and made drift worse (naming a pattern to forbid primes Opus to produce it), so cut on evidence.
REGULARS reuses _OPUS_REGULARS (surface-neutral). The link rule is discourse-framed inline, NOT the reused _OPUS_CITE/_OPUS_TOOLDISC: those are ask-1:1-framed (“the user”, “your answer”), and the wired dry run + an 8-topic A/B showed that framing leaks meta-commentary into a room post (META-leak 2/8 with them vs 0/8 with the discourse-framed line, flag 7/8 -> 8/8).

Result: 43,403 -> ~5,400 chars (~87% smaller).

One more real-path fix the dry run caught: the discourse forced-search retry (a second _call with thinking OFF when the first surfaces no URLs) degrades the lean Opus output (an n=10 diagnostic: forced posts ran ~4.0 sentences with selection-narration meta-leak vs the thinking-ON primary’s ~2.3, tight + open). So the forced retry is skipped on the lean Opus path (the lean core already says “never invent a URL / linkless beats skipping”, and enforce_source_links strips any hallucinated URL); legacy/Sonnet keeps it. Post-fix wired dry run: open 10/10, flag 10/10, mean 2.4, forced 0/10.

Wiring (segmented by model, exactly like ask)

ModelRules.discourse_core (None for Sonnet/Haiku -> legacy path byte-identical; set for Opus -> lean core via skip_persona/lean_persona). Same rules_for seam as ask_core. test_prompt_lessons still passes (the legacy path keeps composing the shared blocks). Tests: test_discourse_system_extra_opus_uses_lean_rebuild + test_discourse_system_extra_legacy_unchanged. Eval: scripts/eval_opus_discourse.py (registered, trend-only) — behavioral, mirroring eval_recap (#881): it exercises the REAL discourse() surface on BOTH Sonnet and Opus, with link-rich scenarios (feeds + enriched links + Perplexity + recent chatter + live web_search) AND bare ones, plus a deterministic no-repaste check and two judge-teeth goldens (a closed-loop verdict + a naked poll, both must fail). On the real surface the open-loop gap is stark: Opus 3/4 vs Sonnet 0/4 (Sonnet runs the legacy prompt, which doesn’t enforce open-loop).

The default stays Sonnet (owner’s call to flip)

discourse still DEFAULTS to Sonnet; this lights up only for a guild that opts into Opus discourse on the /menu Models page. The lean rebuild makes that opt-in genuinely strong. Flipping the default to Opus is a separate decision left to the owner — the per-post metrics support it, but it’s a default change, not a code one.