You tokenize the image and then pass it through a vision encoder that is general...

namibj · 2025-07-11T07:09:19 1752217759

They might use YouTube; there's next-frame prediction and multimodal grounding via subtitles and audio available.

IIUC they got the native voice2voice models trained on YT-sourced audio. Skipping any intermediate text form is really helpful for fuzzy speech such as from people slurring/mumbling words. Also having access to a full world model during voice-deciphering obviously helps with situations that are very context-heavy, such as for example (spoken/Kana/phonetic) Japanese (which relies on human understanding of context to parse homophones, and non-phonetic Han (Kanji) in writing to make up for the inability to interject clarification).