Thumbnail for OpenAI's "supermassive black hole" AI model (4.1)

OpenAI's "supermassive black hole" AI model (4.1)

Channel: Wes RothPublished: April 14th, 2025AI Score: 100
48.9K1.1K24212:03

AI Generated Summary

Airdroplet AI v0.2

OpenAI has just dropped a new family of models, headlined by GPT-4.1, and we're finally getting the scoop on what that mysterious "Quasar" model was all along! It turns out Quasar, named after supermassive black holes (a nod many, including the presenter, initially missed), was likely a sneak peek at the new, more efficient GPT-4.1 mini. This release focuses on improved coding, better instruction following, and a massive 1 million token context window, all while tweaking speeds and costs.

Here’s the breakdown of what’s new and what it means:

OpenAI's New Model Lineup: GPT-4.1, Mini, and Nano

  • OpenAI has rolled out three new models: GPT-4.1, GPT-4.1 mini, and GPT-4.1 nano.
  • This release might not be the most mind-blowing one ever, but it sets the stage for even more powerful models coming soon.
  • The naming can be a bit confusing! If you're wondering why GPT-4.1 came out after a "GPT-4.5" was seen in previews, the presenter jokingly suggests it's because 4.10 (four-point-one-zero) is obviously greater than 4.5.

The Mystery of "Quasar" Solved

  • Remember "Quasar," that stealth model that popped up in the LM System Arena? Sam Altman had hinted it was an OpenAI creation.
  • A Quasar, in astronomy, is an incredibly energetic galactic nucleus powered by a supermassive black hole – hence the OpenAI live stream title about "supermassive black holes," which the presenter admits went over their head at first!
  • It seems Quasar was likely a GPT-4.1 mini that OpenAI let slip during their live stream.
  • Good news: Quasar (or its equivalent) is now live and you can access it through the API or test it out in the OpenAI Playground.

Performance Boosts and Benchmarks

  • Supercharged Coding:
    • GPT-4.1 scores an impressive 54.6 on SWE Bench Verified (a software engineering benchmark), which is a 21.4% jump from GPT-4.0. This makes it a top contender for coding tasks.
    • While it's better than previous OpenAI models, it's important to note that Google's Gemini 2.5 Pro (a reasoning model) still leads this benchmark with 63.8. GPT-4.1 is a non-reasoning model.
    • Users in the LM arena have also reported that it's excellent for coding.
  • Massive Context Window:
    • All the new models boast a 1 million token context window, which is huge! This matches what Gemini 2.5 Pro offers.
    • This larger context is a big deal, especially for complex tasks like coding, as it allows the model to "remember" and process much more information. The presenter feels this makes a noticeable difference in the model's power.
    • It can even handle multimodal inputs, like videos, within this long context.
  • Better at Following Instructions:
    • GPT-4.1 is more reliable when it comes to following specific instructions.
    • This ties in nicely with OpenAI's newly released prompting guide, which the presenter thinks many people might have missed. This guide offers tips to get the best out of these models.
  • Needle in the Haystack (with a grain of salt):
    • The new models achieve 100% accuracy on the "needle in the haystack" benchmark (finding a specific piece of info in a large text).
    • However, there's criticism that this isn't a very realistic test. Better tests focus on maintaining context and understanding across large documents.
  • OpenAI's New Benchmark: MRCR (Multi-Round Coreference):
    • To address the limitations of "needle in the haystack," OpenAI developed and open-sourced its own benchmark called MRCR.
    • It tests how well a model can track and recall specific instances from multiple, similar requests (e.g., "write a poem about tapirs" repeated, then asking for "the third poem about tapirs").
    • GPT-4.1 performs much better on this benchmark than GPT-4.0, especially with larger context windows.
    • The 4.1 model generally outperforms the 4.1 mini and nano on MRCR, though the mini shows some strength at extreme context lengths, while nano's performance drops off.
    • (And yes, a tapir is a large, herbivorous mammal, a fun fact shared during the discussion!)
  • Speed vs. Smarts (Latency & MMLU):
    • GPT-4.1 and GPT-4.0 have similar speeds (latency).
    • GPT-4.1 mini and GPT-4.0 mini also have similar speeds.
    • The tiny GPT-4.1 nano is much faster but not as "intelligent" (based on Multilingual MMLU scores). It's likely designed for edge devices where speed is critical.
    • There's a significant intelligence jump from GPT-4.0 mini to GPT-4.1 mini, and a decent one from GPT-4.0 to GPT-4.1.
  • 8-Ears Polyglot Benchmark:
    • On this multilingual benchmark, GPT-4.1 outperforms GPT-4.0.
    • It's not quite as good as OpenAI's "O1 high" and "O3 mini high" models but scores better than "GPT-4.5" (this refers to a preview model, not a generally available one).

How GPT-4.1 Stacks Up

  • GPT-4.1 vs. GPT-4.0:
    • GPT-4.1 is considered smarter than GPT-4.0.
    • Both are non-reasoning models (they don't "think through" problems step-by-step in the same way reasoning models do).
    • Both handle text and image input, with text output.
    • They operate at roughly the same speed.
    • Crucially, GPT-4.1 is about 20% cheaper for input and output tokens than GPT-4.0.
  • GPT-4.1 vs. O3 mini (Reasoning Model):
    • The O3 mini is a reasoning model.
    • Interestingly, the non-reasoning GPT-4.1 is now rated as smart as the O3 mini.
    • GPT-4.1 maintains good speed (comparable to O3 mini) and adds image input capabilities, which O3 mini lacks (it's text-only).
    • The takeaway is that OpenAI is making its non-reasoning models nearly as capable as its reasoning ones, which is a significant step.
    • GPT-4.1 is, however, more expensive than the O3 mini.

Access, Pricing, and Freebies

  • API Access: You can use GPT-4.1, 4.1 mini, and 4.1 nano via the OpenAI API.
  • OpenAI Playground: A great place to test them out without writing code.
  • Windsurf IDE - Free Trial:
    • The Windsurf IDE (an integrated development environment for coding) is offering free, unlimited access to GPT-4.1 from April 14th to April 21st.
    • This isn't a sponsored mention; one of Windsurf's founders announced it during an OpenAI live stream. The presenter plans to try it.
  • Get Free OpenAI Tokens:
    • Developers can earn free daily tokens by sharing feedback, prompts, completions, and traces with OpenAI. This offer runs through April 30th.
    • You can get up to 1 million tokens per day for models like GPT-4.1 and GPT-4.0.
    • You can get up to 10 million tokens per day for mini and nano models.
    • Look for a pop-up on your OpenAI API key generation page to opt-in.
  • Cost Reduction: As mentioned, GPT-4.1 is about 20% cheaper to use than GPT-4.0.

Don't Miss the Prompting Guide!

  • OpenAI quietly released a new prompting guide, primarily for GPT-4.1 but useful for other models too.
  • The presenter emphasizes that small tweaks in how you prompt can lead to big improvements (e.g., a 20% boost in task performance).
  • This guide is key to unlocking the improved instruction-following capabilities of GPT-4.1.

What's Next? The Really Big Guns

  • Rumors suggest an upcoming PhD-level researcher model (potentially the O3 and O4 mini series) that might cost around $20,000 a month!
  • Early previews of these advanced models are reportedly already being used for scientific discovery and generating novel ideas. They are "coming up very soon."
  • The presenter is particularly looking forward to the full O3 and O4 family of models, which are expected to be the real "big hitters."
  • According to "The Information" (a news source), Google's Gemini 2.5 Pro (released March 2025 - this is the date mentioned, though it's in the future) is currently number one in user testing, but OpenAI's O3/O4 family is on the horizon.

AI for Science: A New Era

  • These advanced AI models, especially the upcoming ones, are showing incredible potential to accelerate scientific research.
  • Users of even slightly older advanced models (like O3 Mini High) reported significant boosts in their ability to conduct experiments and speed up their work.
  • The next generation of models, like the full O3/O4 series, could revolutionize how new materials are discovered and scientific breakthroughs are made.

The presenter plans to put these new GPT-4.1 models through a "gauntlet of tests" soon, so stay tuned for more hands-on reviews!