tootsies

a discord bot for the tootsies server. ask, recap, discuss, ship features by typing.


Project maintained by mejasonmejason Hosted on GitHub Pages — Theme by mattgraham

Opus on discourse: the lean rebuild (Epic #849)

Question (owner): “how do we feel about using opus for discourse i’m not even sure”

Answer: Opus on a LEAN discourse prompt is the strongest of the three. Shipped a per-model lean discourse rebuild (the ask ask_core pattern, applied to discourse).

This doc corrects an earlier conclusion in its own history. The short version: the first measurement compared Opus on the Sonnet-tuned prompt against Sonnet, found Opus worse, and (wrongly) concluded “keep Sonnet, don’t build.” That was the exact confound the epic exists to fix — Opus on un-rebuilt text drifts. On a lean prompt the result flips hard.

The arc (what actually happened, kept honest)

  1. Legacy head-to-head looked like Sonnet won. scripts/dryrun_opus_discourse.py (8 grounded topics, same prompt to both): Sonnet edged the same-topic head-to-head ~6-2 at equal quality. Conclusion at the time: keep Sonnet. But both models ran the legacy 43k Sonnet-tuned prompt — so this measured “Opus on the wrong prompt,” not “Opus is worse at discourse.”

  2. The head-to-head judge is noisy. Re-run on the same conditions it swung 6-2 -> 4-4 -> 2-6. It was over-weighted in both directions; the stable per-post metrics (below) are the trustworthy signal.

  3. Give Opus a lean prompt and it flips. A lean discourse core (the open-loop reframe + reused/rebuilt blocks, run via skip_persona/lean_persona), measured BEFORE (prod Opus-legacy) vs AFTER (lean), 16 topics:

    metric BEFORE (Opus-legacy) AFTER (Opus-lean)
    quality (harsh must_post judge) 1/16 6/16
    open-question (opens a question vs closes a verdict) 5/16 14/16
    flag-planted (her OWN take vs a naked poll) 10/16 15/16
    mean sentences 2.3 2.4

    Sonnet-legacy sat at quality ~2/8, open ~3-4/8 across runs — so lean Opus clears both legacy conditions on every stable axis.

What’s in the lean core (and why), all evidence-backed

The keystone is the TASK reframe: discourse is an OPENING, not an answer — plant your flag first, then hand the room a question it argues about. That one change drove open-question 1/8 -> 8/8 and (with “never a naked poll”) flag-planted 10/16 -> 15/16.

The walk-through, block by block:

Result: 43,403 -> ~5,400 chars (~87% smaller).

One more real-path fix the dry run caught: the discourse forced-search retry (a second _call with thinking OFF when the first surfaces no URLs) degrades the lean Opus output (an n=10 diagnostic: forced posts ran ~4.0 sentences with selection-narration meta-leak vs the thinking-ON primary’s ~2.3, tight + open). So the forced retry is skipped on the lean Opus path (the lean core already says “never invent a URL / linkless beats skipping”, and enforce_source_links strips any hallucinated URL); legacy/Sonnet keeps it. Post-fix wired dry run: open 10/10, flag 10/10, mean 2.4, forced 0/10.

Wiring (segmented by model, exactly like ask)

ModelRules.discourse_core (None for Sonnet/Haiku -> legacy path byte-identical; set for Opus -> lean core via skip_persona/lean_persona). Same rules_for seam as ask_core. test_prompt_lessons still passes (the legacy path keeps composing the shared blocks). Tests: test_discourse_system_extra_opus_uses_lean_rebuild + test_discourse_system_extra_legacy_unchanged. Eval: scripts/eval_opus_discourse.py (registered, trend-only) — behavioral, mirroring eval_recap (#881): it exercises the REAL discourse() surface on BOTH Sonnet and Opus, with link-rich scenarios (feeds + enriched links + Perplexity + recent chatter + live web_search) AND bare ones, plus a deterministic no-repaste check and two judge-teeth goldens (a closed-loop verdict + a naked poll, both must fail). On the real surface the open-loop gap is stark: Opus 3/4 vs Sonnet 0/4 (Sonnet runs the legacy prompt, which doesn’t enforce open-loop).

The default stays Sonnet (owner’s call to flip)

discourse still DEFAULTS to Sonnet; this lights up only for a guild that opts into Opus discourse on the /menu Models page. The lean rebuild makes that opt-in genuinely strong. Flipping the default to Opus is a separate decision left to the owner — the per-post metrics support it, but it’s a default change, not a code one.