Send her a voice note and she answers out loud. Share a video and she actually knows what was said in it — not just that one was posted.
The one idea: Tootsies isn’t text-only. Talk to her with a Discord voice note and she talks back; drop a YouTube or X clip and she follows along like she watched it.
A voice note in, a voice note out. No files, no file cards — a real native Discord voice note.
Voice in → voice out
the full flow
The only trigger is a reply to one of her messages using a Discord voice note. That’s because a voice note has no text body — she can’t read an @mention in it — so a direct reply is the one sure signal it’s meant for her.
She never sends a file attachment — the reply shows up as a real Discord voice note, same as one you’d record yourself.
Video “listening”
what happens when you drop a link
Just share a YouTube or X/Twitter link normally — no special command. She works it out in the background so the channel doesn’t wait around. A clip she’s just seen she’ll know about a moment later; one she’s seen before she knows instantly.
Any language, any length — captions are the fast path, audio transcription is the fallback. She also grabs the title, channel, and run time for every clip.
Captions first, then listen
cheapest route that works
Reading a clip doesn’t always mean downloading the audio. She climbs down a cost ladder and stops as soon as she has what she needs.
She stops at rung 1 when she can — captions are free and don’t cost STT time. Only on a miss does she fall to the next rung.
She reads it in the background
the channel never waits
A video can be minutes long. She doesn’t hold up the chat while she fetches and transcribes it — she kicks that work off in the background the moment a link lands, and by the time someone asks her about it she already knows.
The fetch is async. If a clip was just posted she’ll know it a moment later; if it’s been shared before she knows it immediately.
Even X/Twitter clips
not just YouTube
X/Twitter links have their own path: she hits the public FixTweet API to pull the tweet text and, when there’s a video attached, downloads that too. If the CDN isn’t reachable from her server the tweet text is still the floor — she’s never blind to an X post.
X/Twitter bypasses the usual video fetcher and goes through FixTweet instead — no guest tokens, no auth. Tweet text is always the floor even if the video CDN is unreachable.
What powers it
the providers, named simply
Three different tools behind the scenes, each doing one job.
Voice is ElevenLabs end-to-end. Video is a clip fetcher per platform plus a cheap summarizer pass for anything long or not in English.
How to use it
You don’t need any special command for either feature.
to talk to her out loud
reply to one of her messages
record a Discord voice note
send it — she answers back spoken
to share a video she’ll follow
paste a YouTube or X/Twitter link
she picks it up in the background
ask her about it a moment later
Both features are always on when voice is switched on for the server. A typed @mention can also come back spoken or sung when the moment calls for it — text is still the default.
under the hoodElevenLabs (speech + transcription) · yt-dlp (YouTube) · fxtwitter (X / Twitter) · Claude Haiku (clip summaries)
talk to her: reply to one of her messages with a voice note