OpenAI's livestream of GPT-4o Image Generation shows that it is slowwwwwwwwww (maybe 30 seconds per image, which Sam Altman had to spin "it's slow but the generated images are worth it"). Instead of using a diffusion approach, it appears to be generating the image tokens and decoding them akin to the original DALL-E (https://openai.com/index/dall-e/), which allows for streaming partial generations from top to bottom. In contrast, Google's Gemini can generate images and make edits in seconds.
No API yet, and given the slowness I imagine it will cost much more than the $0.03+/image of competitors.
As a user, images feel slightly slower but comparable to the previous generation. Given the significant quality improvement, it's a fair trade-off. Overall, it feels snappy, and the value justifies a higher price.
LLMs are autoregressive, so they can't be (multi-modality) integrated with diffusion image models, only with autoregressive image models (which generate an image via image tokens). Historically those had lower image fidelity than diffusion models. OpenAI now seems to have solved this problem somehow. More than that, they appear far ahead of any available diffusion model, including Midjourney and Imagen 3.
Gemini "integrates" Imagen 3 (a diffusion model) only via a tool that Gemini calls internally with the relevant prompt. So it's not a true multimodal integration, as it doesn't benefit from the advanced prompt understanding of the LLM.
Edit: Apparently Gemini also has an experimental native image generation ability.
Gemini added their multimodal Flash model to Google AI Studio some time ago. It does not use Imagen via tool, it's uses native capabilities to manipulate images, and it's free to try.
No that seems to be indeed a native part of the multimodal Gemini model. I didn't know this existed, it's not available in the normal Gemini interface.
This is a pretty good example of the current state of Google LLMs:
The (no longer, I guess) industry-leading features people actually want are hidden away in some obscure “AI studio” with horrible usability, while the headline Gemini app still often refuses to do anything useful for me. (Disclaimer: I last checked a couple of months ago, after several more of mild amusement/great frustration.)
That's pretty disappointing, it has been out for a while, and we still get top comments like (https://news.ycombinator.com/item?id=43475043) where people clearly think native image generation capability is new. Where do you usually get your updates from for this kind of thing?
Meta has experimented with a hybrid mode, where the LLM uses autoregressive mode for text, but within a set of delimiters will switch to diffusion mode to generate images. In principle it's the best of both worlds.
ByteDance has been working on autoregressive image generation for a while (see VAR, NeurIPS 2024 best paper). Traditionally they weren't in the open-source gang though.
The VAR paper is very impressive. I wonder if OpenAI did something similar. But the main contribution in the new GPT-4o feature doesn't seem to be just image quality (which VAR seems to focus on), but also massively enhanced prompt understanding.
That's overly pessimistic. Diffusion models take an input and produce an output. It's perfectly possible to auto-regressively analyze everything up to the image, use that context to produce a diffusion image, and incorporate the image into subsequent auto-regressive shenanigans. You'll preserve all the conditional probability factorizations the LLM needs while dropping a diffusion model in the middle.
If you look at the examples given, this is the first time I've felt like AI generated images have passed the uncanny valley.
The results are ground breaking in my opinion. How much longer until an AI can generate 30 successive images together and make an ultra realistic movie?
i find this “slow” complaint (/observation— i dont view this comment as a complaint, to be clear) to be quite confusing. slow… compared to what, exactly? you know what is slow? having to prompt and reprompt 15 times to get the stupid model to spell a word correctly and it not only refuses, but is also insistent that it has corrected the error this time. and afaict this is the exact kind of issue this change should address substantially.
im not going to get super hyperbolic and histrionic about “entitlement” and stuff like that, but… literally this technology did not exist until like two years ago, and yet i hear this all the time. “oh this codegen is pretty accurate but it’s slow”, “oh this model is faster and cheaper (oh yeah by the way the results are bad, but hey it’s the cheapest so it’s better)”. like, are we collectively forgetting that the whole point of any of this is correctness and accuracy? am i off-base here?
the value to me of a demonstrably wrong chat completion is essentially zero, and the value of a correct one that anticipates things i hadn’t considered myself is nearly infinite. or, at least, worth much, much more than they are charging, and even _could_ reasonably charge. it’s like people collectively grouse about low quality ai-generated junk out of one side of their mouths, and then complain about how expensive the slop is out of the other side.
hand this tech to someone from 2020 and i guarantee you the last thing you’d hear is that it’s too slow. and how could it be? yeah, everyone should find the best deals / price-value frontier tradeoff for their use case, but, like… what? we are all collectively devaluing that which we lament is being devalued by ai by setting such low standards: ourselves. the crazy thing is that the quickly-generated slop is so bad as to be practically useless, and yet it serves as the basis of comparison for… anything at all. it feels like that “web-scale /dev/null” meme all over again, but for all of human cognition.
> it appears to be generating the image tokens and decoding them akin to the original DALL-E
The animation is a lie. The new 4o with "native" image generating capabilities is a multi-modal model that is connected to a diffusion model. It's not generating images one token at a time, it's calling out to a multi-stage diffusion model that has upscalers.
You can ask 4o about this yourself, it seems to have a strong understanding of how the process works.
There are many clues to indicate that the animation is a lie. For example, it clearly upscales the image using an external tool after the first image renders. As another example, if you ask the model about the tokens inside of its own context, it can't see any pixel tokens.
A model may not have many facts about itself, but it can definitely see what is inside of its own context, and what it sees is a call to an image generation tool.
Finally, and most convincingly, I can't find a single official source where OpenAI claims that the image is being generated pixel-by-pixel inside of the context window.
Sorry but I think you may be mistaken if your only source is ChatGPT. It's not aware of its own creation processes beyond what is included in its system prompt.
No API yet, and given the slowness I imagine it will cost much more than the $0.03+/image of competitors.