voice & video · how it works

Tootsies can hear and speak

Send her a voice note and she answers out loud. Share a video and she actually knows what was said in it — not just that one was posted.

The one idea: Tootsies isn’t text-only. Talk to her with a Discord voice note and she talks back; drop a YouTube or X clip and she follows along like she watched it.
🎤 voice note reply to Tootsies 🎵 Tootsies hears → thinks → speaks 🎤 voice note back to you
A voice note in, a voice note out. No files, no file cards — a real native Discord voice note.

Voice in → voice out

the full flow

The only trigger is a reply to one of her messages using a Discord voice note. That’s because a voice note has no text body — she can’t read an @mention in it — so a direct reply is the one sure signal it’s meant for her.

🎤 you speak reply to Tootsies with a voice note 👒 she transcribes ElevenLabs Scribe turns audio → text 🧠 she answers same brain as a typed @mention 🎤 she speaks back native voice note (or sings, if the mood calls for it) a typed @mention can also come back spoken or sung — text is still the default
She never sends a file attachment — the reply shows up as a real Discord voice note, same as one you’d record yourself.

Video “listening”

what happens when you drop a link

Just share a YouTube or X/Twitter link normally — no special command. She works it out in the background so the channel doesn’t wait around. A clip she’s just seen she’ll know about a moment later; one she’s seen before she knows instantly.

📺 clip shared YouTube or X/Twitter HOW SHE GETS THE WORDS captions exist free & fast — grabs them no captions pulls audio + transcribes 📄 long or foreign? summarized down to a short English gist 🎵 she knows it can reference it in conversation happens in the background — a just-posted clip she’ll know about a moment later a clip she’s seen before she knows instantly from her cache
Any language, any length — captions are the fast path, audio transcription is the fallback. She also grabs the title, channel, and run time for every clip.

Captions first, then listen

cheapest route that works

Reading a clip doesn’t always mean downloading the audio. She climbs down a cost ladder and stops as soon as she has what she needs.

1 Caption track free  ·  instant  ·  any language  ·  she grabs it and stops here if it exists no captions 2 Audio download + transcription pulls the audio  ·  ElevenLabs Scribe turns speech → text  ·  clips up to 30 min 🎵 long or foreign 3 English summary a long clip or foreign-language transcript gets distilled to a compact English gist she can work with
She stops at rung 1 when she can — captions are free and don’t cost STT time. Only on a miss does she fall to the next rung.

She reads it in the background

the channel never waits

A video can be minutes long. She doesn’t hold up the chat while she fetches and transcribes it — she kicks that work off in the background the moment a link lands, and by the time someone asks her about it she already knows.

clip posted in the channel t = 0 fetch starts background, silent chat keeps going done cached — instant next time a moment later you ask “what did he say?” she knows no command, no waiting — the channel keeps moving while she works it out a clip she’s seen before is instant from cache — no re-fetch, no re-transcription
The fetch is async. If a clip was just posted she’ll know it a moment later; if it’s been shared before she knows it immediately.

Even X/Twitter clips

not just YouTube

X/Twitter links have their own path: she hits the public FixTweet API to pull the tweet text and, when there’s a video attached, downloads that too. If the CDN isn’t reachable from her server the tweet text is still the floor — she’s never blind to an X post.

🔗 x.com link posted in chat 📋 FixTweet API tweet text + video URL WHAT SHE GETS 💬 tweet text always — the floor 🎬 + video when attached
X/Twitter bypasses the usual video fetcher and goes through FixTweet instead — no guest tokens, no auth. Tweet text is always the floor even if the video CDN is unreachable.

What powers it

the providers, named simply

Three different tools behind the scenes, each doing one job.

🎤 ElevenLabs hears your voice note speaks & sings back 📺 Video fetchers YouTube  ·  X/Twitter captions + audio + metadata Quick summarizer long or foreign clips get a short English gist
Voice is ElevenLabs end-to-end. Video is a clip fetcher per platform plus a cheap summarizer pass for anything long or not in English.

How to use it

You don’t need any special command for either feature.

to talk to her out loud

  • reply to one of her messages
  • record a Discord voice note
  • send it — she answers back spoken

to share a video she’ll follow

  • paste a YouTube or X/Twitter link
  • she picks it up in the background
  • ask her about it a moment later

Both features are always on when voice is switched on for the server. A typed @mention can also come back spoken or sung when the moment calls for it — text is still the default.

under the hoodElevenLabs (speech + transcription) · yt-dlp (YouTube) · fxtwitter (X / Twitter) · Claude Haiku (clip summaries)

talk to her: reply to one of her messages with a voice note