You tokenize the image and then pass it through a vision encoder that is generally trained separately from large scale pretraining (using say contrastive captioning) and then added to the model during RLHF. I’m not surprised if the vision encoder is used in pre training now too, this will be a different objective than next token prediction of course (unless they use something like next token prediction for images which I don’t think is the case).
Different models have different encoders, they are not shared as the datasets across models and even model sizes vary. So performance between models will vary.
What you seem to be thinking is that text models were simply calling an API to a vision model, similar to tool-use. That is not what’s happening, it is much more inbuilt, the forward pass is going through the vision architecture to the language architecture. Robotics research has been doing this for a while.
They might use YouTube; there's next-frame prediction and multimodal grounding via subtitles and audio available.
IIUC they got the native voice2voice models trained on YT-sourced audio.
Skipping any intermediate text form is really helpful for fuzzy speech such as from people slurring/mumbling words. Also having access to a full world model during voice-deciphering obviously helps with situations that are very context-heavy, such as for example (spoken/Kana/phonetic) Japanese (which relies on human understanding of context to parse homophones, and non-phonetic Han (Kanji) in writing to make up for the inability to interject clarification).
Different models have different encoders, they are not shared as the datasets across models and even model sizes vary. So performance between models will vary.
What you seem to be thinking is that text models were simply calling an API to a vision model, similar to tool-use. That is not what’s happening, it is much more inbuilt, the forward pass is going through the vision architecture to the language architecture. Robotics research has been doing this for a while.