
Claude 4 is not what you think...
Channel: Matthew BermanPublished: May 22nd, 2025AI Score: 98
85.7K2.9K39913:03
AI Generated Summary
Airdroplet AI v0.2Here's a summary of the video about Anthropic's Claude 4 models:
Anthropic has launched Claude 4, available in two versions: Opus and Sonnet. The big news is that Anthropic seems to be making a strategic pivot, stepping back from the general chatbot competition dominated by players like OpenAI and Google. Instead, they are focusing heavily on becoming a leader in building sophisticated AI agents, particularly for complex tasks and coding, introducing new features and API capabilities to support this direction.
Here are the key topics and details discussed:
- Anthropic has released Claude 4, available in two sizes: Opus and Sonnet.
- Opus is being claimed as the "world's best coding model," which hints strongly at Anthropic's new focus.
- A major emphasis is placed on the models' ability to handle "long horizon tasks," meaning complex jobs that can take tens of minutes or even hours to complete without losing context or failing.
- Both Claude 4 models are "hybrid models," offering the option of near-instant responses or an "extended thinking" mode for more complex problems.
- The extended thinking mode incorporates tool use, which is considered a standard but necessary feature today.
- Available tools currently include web search, Google Drive search, Gmail search, and Calendar search. Anthropic has also integrated their MCP framework more deeply into the API for connecting other tools.
- A unique and cool feature is that both Opus and Sonnet can use tools in parallel, sending multiple requests simultaneously for potentially much greater efficiency than sequential processing.
- The models are reportedly much better at managing their internal memory, crucial for handling complex, multi-step tasks.
- Anthropic's new API includes four key features aimed at enabling agent development: a code execution tool (seems to be Python initially), an MCP connector to link external tools, a Files API to easily give Claude access to local code files and repositories, and prompt caching for up to an hour, which is noted as the way to achieve the most efficient and cheapest usage.
- The presenter hit his rate limit quickly with only a few prompts, indicating he'll need to subscribe to a higher tier for more thorough testing.
- The core strategic shift is that Anthropic has "given up on the chatbot race," believing OpenAI, Google, and Microsoft have won that battle.
- Anthropic is instead repositioning itself as an "infrastructure company," providing the tools and models necessary to build the best coding and general agents. The presenter feels this focus is "required to win" in a specific niche.
- Claude 4 Sonnet is already available in GitHub Copilot and is their default option, which the presenter calls "incredible."
- Early evaluations in GitHub Copilot show Sonnet 4 delivering up to a 10% improvement over the previous generation in "agentic scenarios," driven by sharper tool use, better instruction following, and stronger coding instincts.
- Claude 4 is also available in other major coding platforms like Cursor and Windsurf.
- The video sponsor, Box AI, is highlighted as a good partner for Claude 4 due to its ability to handle long horizon tasks and memory.
- Box AI helps extract metadata from documents (contracts, invoices, etc.), automate workflows, and allows users to ask questions and do deep dives into their company's data.
- For developers, Box AI handles the complex RAG (Retrieval Augmented Generation) pipeline automatically, so you don't need to manage things like vector databases or chunking. It also offers enterprise-level security and compliance.
- Using Claude Code with Box SDKs is easy, as Claude can process the Box developer documentation to understand how to build with it. A demo was shown of building a backend contract generation tool using Box, DocGen, and Claude Code.
- Benchmark results are presented but should be taken "with a grain of salt." On SWE bench verified (software engineering), Opus 4 and Sonnet 4 show significant improvement, with Sonnet 4 scoring slightly higher than Opus 4 with parallel compute.
- The presenter's initial, anecdotal usage found Opus to be faster than Sonnet at outputting code, despite the benchmark showing Sonnet slightly ahead on SWE bench. He plans to test this further.
- On TerminalBench, Opus 4 wins. Gemini 1.5 Pro is mentioned as the presenter's favorite coding model to date before testing Claude 4 extensively.
- Other benchmarks (GPQA Diamond, Multilingual Q&A, High School Math) also show improvements, especially with agentic tool use. Visual reasoning scores were similar to the previous model.
- A surprising point is raised based on analysis by John Shoneth: comparing Sonnet 4 to Sonnet 3.7 on Anthropic's own provided benchmarks, about half actually showed a decrease in performance, while others improved or stayed the same. The presenter finds this "kind of nuts" and "very interesting," noting benchmarks are often the most favorable view before real-world usage ("vibe checks").
- Anthropic claims they've addressed previous issues where models were either too "lazy" or "tried too hard" with coding, believing they've now "dialed it in."
- Safety is a focus: Claude 4 models are significantly less likely (65%) to use shortcuts or loopholes compared to Sonnet 3.7 on tasks susceptible to such behavior.
- Improved memory capabilities are highlighted again, stating Opus 4 dramatically outperforms previous models. Memory for agents is seen as the "key ingredient" to making them "hyper-personal."
- Claude 4 is designed to get better the more you use it, learning and developing a shorthand with the user through creating and maintaining memory files for better long-term awareness and coherence.
- Thinking processes are now summarized by a smaller model, but the raw, detailed "chains of thought" are not publicly available unless you contact sales, implying you'll likely have to "pay up" to access them.
- Claude Code is now generally available with new extensions for VS Code and JetBrains for direct IDE integration, allowing inline edits.
- An SDK for Claude Code is also being released, further supporting the infrastructure/agent building focus.
- An example shows how Claude Code can be tagged in a GitHub pull request comment to automatically address reviewer feedback, fix errors, and prepare the PR for review, which the presenter thinks is very cool.
- Pricing for Claude 4 Opus is provided: $15 per million input tokens and $75 per million output tokens, with a 200k context window and a 50% discount for batch processing. (Sonnet pricing was not mentioned in the transcript's pricing section).
- The presenter plans to conduct thorough testing and release a video soon.