OpenAI's livestream of GPT-4o Image Generation shows that it is slowwwwwwwwww (m...

infecto · 2025-03-25T19:23:44 1742930624

As a user, images feel slightly slower but comparable to the previous generation. Given the significant quality improvement, it's a fair trade-off. Overall, it feels snappy, and the value justifies a higher price.

t0lo · 2025-03-26T01:25:25 1742952325

[flagged]

infecto · 2025-03-26T02:22:48 1742955768

I just gave quick feedback on the new release. How should I be writing it?

If anything, your feedback is of low value.

kevmo314 · 2025-03-25T18:46:16 1742928376

Maybe this is the dialup of the era.

ijidak · 2025-03-25T19:24:20 1742930660

Ha. That's a good analogy.

When I first read the parent comment, I thought, maybe this is a long-term architecture concern...

But your message reminded me that we've been here before.

asadm · 2025-03-25T21:08:19 1742936899

specially with the slow loading effect it has.

cubefox · 2025-03-25T18:56:01 1742928961

LLMs are autoregressive, so they can't be (multi-modality) integrated with diffusion image models, only with autoregressive image models (which generate an image via image tokens). Historically those had lower image fidelity than diffusion models. OpenAI now seems to have solved this problem somehow. More than that, they appear far ahead of any available diffusion model, including Midjourney and Imagen 3.

Gemini "integrates" Imagen 3 (a diffusion model) only via a tool that Gemini calls internally with the relevant prompt. So it's not a true multimodal integration, as it doesn't benefit from the advanced prompt understanding of the LLM.

Edit: Apparently Gemini also has an experimental native image generation ability.

SweetSoftPillow · 2025-03-25T19:26:20 1742930780

Gemini added their multimodal Flash model to Google AI Studio some time ago. It does not use Imagen via tool, it's uses native capabilities to manipulate images, and it's free to try.

summerlight · 2025-03-25T19:24:40 1742930680

Your understanding seems outdated, I think people are referring Gemini native image generation

argsnd · 2025-03-25T19:10:51 1742929851

Is this the same for their gemini-2.0-flash-exp-image-generation model?

cubefox · 2025-03-25T20:32:17 1742934737

No that seems to be indeed a native part of the multimodal Gemini model. I didn't know this existed, it's not available in the normal Gemini interface.

lxgr · 2025-03-25T20:42:52 1742935372

This is a pretty good example of the current state of Google LLMs:

The (no longer, I guess) industry-leading features people actually want are hidden away in some obscure “AI studio” with horrible usability, while the headline Gemini app still often refuses to do anything useful for me. (Disclaimer: I last checked a couple of months ago, after several more of mild amusement/great frustration.)

tough · 2025-03-25T20:55:23 1742936123

hey at least now they bought ai.dev and redirected it to their bad ux

vladf · 2025-03-25T23:20:38 1742944838

That's pretty disappointing, it has been out for a while, and we still get top comments like (https://news.ycombinator.com/item?id=43475043) where people clearly think native image generation capability is new. Where do you usually get your updates from for this kind of thing?

johntb86 · 2025-03-25T19:35:52 1742931352

Meta has experimented with a hybrid mode, where the LLM uses autoregressive mode for text, but within a set of delimiters will switch to diffusion mode to generate images. In principle it's the best of both worlds.

echelon · 2025-03-25T19:23:36 1742930616

I expect the Chinese to have an open source answer for this soon.

They haven't been focusing attention on images because the most used image models have been open source. Now they might have a target to beat.

rfoo · 2025-03-25T20:29:34 1742934574

ByteDance has been working on autoregressive image generation for a while (see VAR, NeurIPS 2024 best paper). Traditionally they weren't in the open-source gang though.

cubefox · 2025-03-25T20:37:39 1742935059

The VAR paper is very impressive. I wonder if OpenAI did something similar. But the main contribution in the new GPT-4o feature doesn't seem to be just image quality (which VAR seems to focus on), but also massively enhanced prompt understanding.

hansvm · 2025-03-26T06:13:45 1742969625

> so they can't be integrated

That's overly pessimistic. Diffusion models take an input and produce an output. It's perfectly possible to auto-regressively analyze everything up to the image, use that context to produce a diffusion image, and incorporate the image into subsequent auto-regressive shenanigans. You'll preserve all the conditional probability factorizations the LLM needs while dropping a diffusion model in the middle.

aurareturn · 2025-03-26T08:14:48 1742976888

If you look at the examples given, this is the first time I've felt like AI generated images have passed the uncanny valley.

The results are ground breaking in my opinion. How much longer until an AI can generate 30 successive images together and make an ultra realistic movie?

thehappypm · 2025-03-26T13:26:42 1742995602

One day you’ll just give it a script and get a movie out

brrrrrm · 2025-03-26T15:12:37 1743001957

a premise*

keeganpoppen · 2025-03-26T01:19:46 1742951986

i find this “slow” complaint (/observation— i dont view this comment as a complaint, to be clear) to be quite confusing. slow… compared to what, exactly? you know what is slow? having to prompt and reprompt 15 times to get the stupid model to spell a word correctly and it not only refuses, but is also insistent that it has corrected the error this time. and afaict this is the exact kind of issue this change should address substantially.

im not going to get super hyperbolic and histrionic about “entitlement” and stuff like that, but… literally this technology did not exist until like two years ago, and yet i hear this all the time. “oh this codegen is pretty accurate but it’s slow”, “oh this model is faster and cheaper (oh yeah by the way the results are bad, but hey it’s the cheapest so it’s better)”. like, are we collectively forgetting that the whole point of any of this is correctness and accuracy? am i off-base here?

the value to me of a demonstrably wrong chat completion is essentially zero, and the value of a correct one that anticipates things i hadn’t considered myself is nearly infinite. or, at least, worth much, much more than they are charging, and even _could_ reasonably charge. it’s like people collectively grouse about low quality ai-generated junk out of one side of their mouths, and then complain about how expensive the slop is out of the other side.

hand this tech to someone from 2020 and i guarantee you the last thing you’d hear is that it’s too slow. and how could it be? yeah, everyone should find the best deals / price-value frontier tradeoff for their use case, but, like… what? we are all collectively devaluing that which we lament is being devalued by ai by setting such low standards: ourselves. the crazy thing is that the quickly-generated slop is so bad as to be practically useless, and yet it serves as the basis of comparison for… anything at all. it feels like that “web-scale /dev/null” meme all over again, but for all of human cognition.

Taek · 2025-03-25T23:02:20 1742943740

> it appears to be generating the image tokens and decoding them akin to the original DALL-E

The animation is a lie. The new 4o with "native" image generating capabilities is a multi-modal model that is connected to a diffusion model. It's not generating images one token at a time, it's calling out to a multi-stage diffusion model that has upscalers.

You can ask 4o about this yourself, it seems to have a strong understanding of how the process works.

low_tech_love · 2025-03-26T05:55:06 1742968506

Would it seem otherwise if it was a lie?

Taek · 2025-03-26T12:38:17 1742992697

There are many clues to indicate that the animation is a lie. For example, it clearly upscales the image using an external tool after the first image renders. As another example, if you ask the model about the tokens inside of its own context, it can't see any pixel tokens.

A model may not have many facts about itself, but it can definitely see what is inside of its own context, and what it sees is a call to an image generation tool.

Finally, and most convincingly, I can't find a single official source where OpenAI claims that the image is being generated pixel-by-pixel inside of the context window.

throwaway314155 · 2025-03-26T08:57:23 1742979443

Sorry but I think you may be mistaken if your only source is ChatGPT. It's not aware of its own creation processes beyond what is included in its system prompt.

cchance · 2025-03-26T02:35:47 1742956547

i mean on free chat an image took maybe 2 seconds?