
ARC AGI 2 The $1,000,000 AGI Prize
AI Generated Summary
Airdroplet AI v0.2Okay, let's dive into what's happening with the new ARC AGI 2 prize and benchmark!
Here’s a quick rundown: The ARC AGI benchmark is back with version 2, featuring a completely new set of problems designed to expose the current limitations of even the best AI models, particularly their lack of efficient reasoning and ability to learn new skills on the fly, unlike humans who find these tasks easy. There's a big prize pool, including a $700,000 grand prize, for someone who can achieve 85% accuracy efficiently, pushing AI development towards true, human-like intelligence rather than just scaling up compute.
Here are the key points from the video:
- The ARC AGI benchmark has been updated to ARC AGI 2 with a new set of questions that are hard for AI but easy for humans.
- The new test is designed to be "unsaturated," meaning current AI models score very low, highlighting that we are still far from AGI on this specific measure.
- A major change is that the new test is resistant to "test time compute," which means you can't just throw more money or processing power at a problem to get a higher score anymore.
- In the previous version, increasing the cost of running a model per task often led to higher scores, but that won't work for the grand prize in ARC AGI 2.
- The grand prize requires achieving 85% accuracy on the new questions at an efficiency ratio of around 42 cents per task.
- Current base LLMs score zero on the new test, while existing reasoning systems score less than 4%. That feels pretty low compared to human capability.
- Intriguingly, every single ARC AGI 2 task has been solved quickly and easily by at least two humans tested live, suggesting they target something specific that humans are good at and AI isn't.
- The goal of ARC AGI 2 is not to showcase AI's superhuman abilities but to identify what's fundamentally missing for AGI, focusing on the efficient acquisition of new skills.
- This is not about data memorization or just pattern recognition; it's about understanding and applying rules in novel situations.
- The test challenges capabilities like symbolic interpretation (understanding that shapes represent concepts beyond their appearance), compositional reasoning (applying multiple rules that interact simultaneously), and contextual rule application (changing how rules are applied based on the specific situation).
- Demonstrating solving a few puzzles from the test showed that while initially complex, humans can figure out the underlying patterns and rules through reasoning, even if they make mistakes at first. It feels like solving logic puzzles.
- Efficiency is now a tracked metric on the leaderboard alongside accuracy, meaning performance is no longer reported as a single score but also considers the cost per task.
- This change aims to incentivize finding more computationally efficient methods for achieving intelligence. There's a question raised about whether limiting compute time is a "fair" way to measure intelligence, but it clearly pushes towards efficiency.
- The scoring structure has been adjusted to incentivize conceptual breakthroughs more than just climbing the leaderboard with incremental improvements.
- There's a $75,000 prize for the most significant conceptual contribution, $50,000 for the highest score, and the grand prize is now $700,000.
- The current leaderboard shows humans at 100% (since all included tasks were solved by at least two people), while the best AI models score very low (e.g., O3 low at 4% but costing $200/task, Architects at 2.5% and lower cost, DeepSeq R1 at 1.3% and just $0.08/task).
- The best models are still far below the 85% accuracy needed for the grand prize, showing lots of room for improvement.
- There are betting markets on whether the previous ARC AGI (2024 data) grand prize will be claimed by the end of 2025 (27% chance) and whether anyone will score 70%+ on the new ARC AGI 2 within three months (8% chance). These odds suggest experts are skeptical of rapid progress on this specific benchmark.
- New approaches are being explored, such as one project (by Isaac Liao) attempting to solve 20% of the evaluation set without any pre-training or datasets, just using inference time gradient descent on the puzzle itself. This highlights that researchers are trying fundamentally different ways to tackle these problems.
- A key requirement for the ARC prize submissions is that they must be open source. This is great because any successful approaches or discoveries will be shared with the community, accelerating collective progress towards AGI.
- It's encouraged to try out the daily puzzles yourself on the ARC prize website (arcprize.org) to get a feel for the types of problems and see how you stack up against current AI (spoiler: you're likely smarter at these specific tasks!).
Overall, the ARC AGI 2 benchmark is a fascinating attempt to define and measure a specific type of flexible, efficient intelligence that current AI models lack, pushing the field beyond brute-force scaling towards genuine reasoning and skill acquisition.