Thumbnail for Apple: “AI Can’t Think” — Are They Wrong?

Apple: “AI Can’t Think” — Are They Wrong?

Channel: Matthew BermanPublished: June 9th, 2025AI Score: 98
50.6K1.8K74922:06

AI Generated Summary

Airdroplet AI v0.2

Apple has stirred up quite a buzz with its recent paper, "The Illusion of Thinking," claiming that the powerful large language models (LLMs) we see today aren't truly "thinking" and aren't much better than simpler models. The paper suggests these advanced models, known as Large Reasoning Models (LRMs) like OpenAI's O3/O4 and Claude 3.7/4.0 thinking versions, might be "cheating" by being trained on existing benchmarks (data contamination) and lack true generalization abilities, especially when Apple is perceived as lagging in the AI race. This video dives into Apple's arguments, critiques its methodology, and ultimately poses a compelling counter-argument based on the models' ability to generate code.

Here's a breakdown of the key points and insights:

  • Apple's Core Claim: The Illusion of Thinking:

    • Apple asserts we're vastly overestimating the capabilities of LRMs, which are designed for reasoning tasks and churn through tokens during a "thinking" process.
    • It questions if these models possess generalizable reasoning or if they're just leveraging different forms of pattern matching.
    • The paper also asks how their performance scales with increasing problem complexity and how they compare to non-thinking LLMs with the same computational budget.
  • Critique of Current Benchmarks:

    • Apple argues that existing evaluations, primarily focused on mathematical and coding benchmarks, suffer from "data contamination" – meaning the models were likely trained on the very data they're being tested on. This is effectively "cheating."
    • These benchmarks also fail to provide insights into the structure and quality of the models' reasoning processes, or "thinking traces."
    • The feeling is that models are essentially doing well on these tests because they've seen them before, not because they're truly intelligent.
  • Apple's Proposed New Benchmarks: Puzzles:

    • Instead of traditional math problems, Apple proposes using "controllable puzzle environments" like Tower of Hanoi, Checkers Jumping, River Crossing, and Blocks World.
    • These puzzles allow for systematic variation of complexity by adding more elements (e.g., more disks in Tower of Hanoi), which increases the number of moves needed to solve them.
    • The advantage of puzzles is that they are harder to "memorize" and offer fine-grained control over complexity, emphasizing algorithmic reasoning and precise evaluation. This is similar to the Arc AGI benchmark, which many models struggle with.
  • Comparing Thinking vs. Non-Thinking Models:

    • A fascinating insight from the paper is that if non-thinking LLMs are given an equivalent "inference token budget" (the same total number of tokens for computation), they can often match the performance of thinking models on some benchmarks like MATH 500 and AME 2024.
    • Non-thinking models achieve this by using a "pass at K" approach, generating many candidate solutions and picking the best one, rather than "thinking" through a single chain of thought.
    • However, the performance gap between thinking and non-thinking models widens on newer benchmarks like AME 2025. This presents an interpretive challenge: does it mean thinking models are genuinely better for complex problems, or is AME 2025 simply less contaminated? Interestingly, human performance on AME 2025 was higher than on AME 2024, suggesting AME 2025 might be less complex, lending credence to the contamination theory for earlier benchmarks.
  • Puzzle Experiment Results:

    • Tests using Claude 3.7 and DeepSeq R1 (models that allow visibility into their chain of thought) showed consistent patterns across all four puzzles.
    • Low Complexity: Both thinking and non-thinking models performed equally well.
    • Medium Complexity: Thinking models performed significantly better.
    • High Complexity: Both models failed, with accuracy dropping to zero. This suggests a "scaling limit" where reasoning capabilities collapse beyond a certain threshold.
    • Another interesting finding was how models use tokens as complexity increases. Initially, they use more tokens, but beyond a critical point, token usage actually decreases as performance sharply drops – almost as if they're "giving up."
  • Overthinking and Incorrect Paths:

    • For simpler problems, thinking models often find the correct solution early in their thought process but then "overthink" by continuing to explore incorrect solutions, which wastes computation.
    • As problems become moderately more complex, this trend reverses; models first explore incorrect solutions and only later arrive at the correct ones. This is what we'd generally want, but it highlights inefficiencies.
  • The Algorithm Problem:

    • Perhaps the most surprising finding: even when the models were explicitly provided with the algorithm to solve a puzzle (meaning they only needed to execute the steps, not devise them), their performance did not improve, and the "collapse" still occurred at roughly the same point. This seriously questions the models' ability to follow logical steps and perform exact computation.
  • A Counterpoint from Ilya Sutskever:

    • The video contrasts Apple's skepticism with Ilya Sutskever's (OpenAI co-founder) optimistic view that AI will achieve Artificial General Intelligence (AGI). Sutskever argues that since the brain is a biological computer, a digital computer should be sufficient to simulate and surpass human intelligence. This directly challenges Apple's assertion that current models aren't truly "thinking."
  • The Video's Main Critique and Code Generation:

    • The video's biggest issue with Apple's paper is its narrow focus. Puzzles are just one "slice" of intelligence; current models excel at many other tasks like image generation, video creation, and most importantly, code generation.
    • Crucially, the Apple paper did not test the models' ability to write code to solve these puzzles. The closest they came was providing a pseudo-algorithm, but the models still had to solve it in natural language.
    • To demonstrate this oversight, the presenter ran an experiment using Claude 3.7, asking it to write HTML/JavaScript code to simulate and solve the Tower of Hanoi. Claude successfully generated functional code that could solve the puzzle for 10 and even 20 disks (requiring over a million moves).
    • This raises a critical question: If an AI can write code that correctly applies the mathematical formula to solve a complex puzzle, isn't that a form of problem-solving and "thinking"? This capability seems to bypass the limitations Apple highlighted in natural language reasoning.

In summary, while Apple's paper provides valuable insights into the limitations of current LRMs in pure algorithmic reasoning and highlights potential issues with benchmark contamination, it might be overlooking the multifaceted nature of AI intelligence. The ability of these models to generate code to solve complex problems, a capability not tested by Apple, presents a powerful counter-argument to the idea that they are merely "illusions of thinking." We need to rethink how we evaluate AI to truly understand its evolving capabilities.