
AI images just got dangerously good (RIP diffusion??)
AI Generated Summary
Airdroplet AI v0.2OpenAI just dropped a major update to its image generation capabilities within ChatGPT, powered by the new GPT-4o model. This feels like a significant leap forward, especially compared to their previous efforts with DALL-E, and potentially challenges other image generation tools like Midjourney, particularly in areas like text rendering and object accuracy. The most surprising thing is that it seems to be integrated directly into the core 4o language model, unlike previous diffusion-based methods.
Here's a breakdown of what's new and how it works:
- A huge upgrade from DALL-E: OpenAI's previous image model, DALL-E, wasn't great. The presenter showed an example image of a programmer with weird text and blurry elements, noting that smaller companies like Midjourney had been far ahead in image quality. The new 4o image generation seems vastly improved in realism and detail.
- Impressive initial performance: Trying the new model with the same prompt (JavaScript programmer), the output is much better. Skin textures are realistic, the mustache looks good, and the overall image is solid.
- Better Text Handling: This is a major point. Previous image AI often struggled with generating legible or correctly placed text. The new model can generate text correctly on things like t-shirts (even getting the JS logo right) and diagrams. While it didn't perfectly code the Fibonacci sequence on a laptop screen, the attempt was much better than expected. It also handled changing text in existing meme images, even intelligently preserving parts of the original image underneath the new text, which is fascinating.
- Reflection and Object Accuracy: A really surprising capability is the model's ability to handle reflections accurately, like a photographer's reflection in a mirror. It can also place images of people in front of surfaces like whiteboards and add text to them with decent results. The model is also much better at handling multiple objects and their relationships within an image, claiming to handle 10-20 distinct concepts compared to the 5-8 other systems struggle with. An example showed it correctly placing 16 specific objects in a grid based on a list.
- Different Technical Approach: This is a big reveal. Unlike traditional diffusion models (like DALL-E or Midjourney) which start with noise and refine it over multiple passes (like refining a blurry image), GPT-4o's image generation is described as an "autoregressive model natively embedded within ChatGPT." It uses the same core architecture as the language model, allowing it to leverage its knowledge base and context in generating images. This is potentially why it's much better at things like text and following precise instructions. The presenter found this technical shift fascinating and wished more details were available.
- Speed and Process: The generation process wasn't lightning fast during the demo, taking a noticeable amount of time. It also generates top-to-bottom in a way that looks different from typical diffusion models, which the presenter found weird given the diffusion explanation.
- Character Consistency: The model seems pretty good at maintaining character consistency across multiple prompts, though it wasn't perfect (like changing a character's hair color in one test). This is useful for creating visual storyboards or consistent elements for projects.
- Safety and Limitations: OpenAI is incorporating safety measures, including C2PA metadata to identify images from GPT-4o. They also have an internal tool to check if an image originated from their model (though this isn't public, which makes sense). They have heightened restrictions on generating images of real people, particularly for potentially harmful content, but seem more flexible than previous models about simply including real figures in non-offensive contexts (like Donald Trump eating ice cream).
- Useful Features: Key practical additions include the ability to specify aspect ratios, use hex codes for colors, and generate images with transparent backgrounds. The transparent background feature worked really well in the demo, producing clean edges, which is something many other tools lack. Upscaling didn't seem to work correctly during the test, generating the same low resolution as the original.
- Integration with Sora: The new 4o image generation capabilities are also integrated into Sora, OpenAI's video model. This means you can use 4o to generate a better starting image to then turn into a video. Sora can also generate multiple images at once, a feature missing from ChatGPT's one-at-a-time generation.
- Implications and Takeaways: The improved text generation is a major win, making these tools much more practical for tasks where text is needed (diagrams, memes, etc.), potentially reducing the need for external editing software. The ability to handle complex prompts and multiple objects opens up new possibilities for using AI images beyond just decorative art. However, the presenter also expressed concern about the potential for spreading misinformation given the increased realism and ease of use.
Overall, this feels like a significant step for OpenAI in the image generation space, catching up to and potentially surpassing competitors like Midjourney in specific capabilities, driven by a potentially different technical approach. The precision, text handling, and integrated nature within 4o make it a powerful tool with exciting, and slightly scary, implications.