
I ranked every AI based on vibes
AI Generated Summary
Airdroplet AI v0.2Okay, here's a summary of the video ranking various AI models based on personal experience and "vibes."
This video dives into a ranking of numerous AI models from the perspective of someone running a chat application using them. It covers the pros, cons, prices, speeds, and specific use cases for each model, highlighting surprising insights and personal feelings about their performance and impact on the AI landscape. The ranking is based on practical experience and value rather than just raw benchmarks.
Here are the key points about each model discussed:
- GPT-4o: Placed in B tier. It's seen as the standard middle-ground model. It improved on the original GPT-4 in speed and price, setting a baseline, but doesn't necessarily excel beyond that compared to newer options. It's fine as a default but a bit expensive.
- GPT-4o Mini: Initially high A tier, later moved to B tier. It was very underrated for its time, being much faster and cheaper than 4o. It kicked off the movement towards cheaper, faster smaller models, even though it's not necessarily recommended for use now due to newer, better options. Its historical impact was significant.
- Gemini 2.0 Flash: Earned an S tier spot and is the favorite model overall. It's revolutionary due to its incredible price-to-performance ratio, being slightly cheaper than 4o Mini but smarter and comparable or better than 4o. It's insanely fast, making it feel almost free to use and ideal as a default model because you get quick answers and can reroll faster if needed. It excels in the "green" zone on intelligence vs. cost charts.
- O1 (Thinking): This model had a revolutionary impact by introducing prolific "thinking" and chain of thought capabilities, forcing a re-evaluation of AI progress. However, it's not recommended for use now because it's absurdly expensive ($15/million input tokens, $60/million output) and doesn't expose its thinking data over the API, making it difficult to understand why it does what it does.
- QwQ: Ranked in F tier. It's controversial, but the personal experience was extremely negative, requiring excessive effort to get any usable output due to looping and behavior issues ("weight watcher" metric invented). Despite being reasonably priced, it's considered largely unusable in practice.
- Quen 2.5 / Llama 3.3: Placed in C tier. These models are not great on their own, providing inconsistent and not very good answers while potentially using more compute than needed. Their strength lies in their architecture working incredibly well with Grok's custom chips, enabling inference speeds up to 100 times faster, making them useful purely for their speed when run on that infrastructure.
- DeepSeek R1 (Standard): Revolutionary impact, considered a game-changer for bringing reasoning power to open source. It's very smart but too slow for practical use because API providers struggle to run it at reasonable speeds (10-20 tokens/sec vs. faster models).
- DeepSeek R1 Distilled (specifically Llama Distilled mentioned): Placed high in A tier (implicitly, near R1). Distills the knowledge of the larger R1 model onto smaller, easier-to-run bases like Llama and Quen. Performs much better in terms of speed than the standard R1 and can run well on Grok chips. It's been impressive for coding challenges, though not quite as smart as the full R1.
- O3 Mini: Ranked in S tier. This model was a pleasant surprise, being both cheaper ($1.10/million input, $4.40/million output) and better than 4o, clearly priced to compete with DeepSeek R1. It's very fast, but its major catch is the lack of exposed thinking data over the API, leading to opaque loading states. Despite this, its incredible value (costing 1/5th of Claude for similar usage) makes it the go-to default for hard problems. It demonstrates OpenAI's capability when pushed by competition.
- Claude 3.5 Sonnet: Placed in S tier, despite its high cost. It's more expensive than 4o and benchmarks comparably to much cheaper models like Flash. Its S tier status comes from its massive quality jump for developers, particularly its excellence at code and handling UI tasks without aggressive rewriting. It was the first model genuinely great at using tools and agents, kickstarting the agentic revolution and many AI dev tools. Its absurd cost is a significant pain point, forcing changes in pricing for services that use it.
- Claude 3.7 Sonnet (Reasoning): Placed in A tier. It's smart but can sometimes "go off the rails," and the reasoning version likes to aggressively call tools, increasing costs. A surprising positive is that Anthropic chose to expose its reasoning data, unlike OpenAI and Google, even admitting they don't fully understand why it works, which is seen as a positive move for community understanding.
- Claude 3.7 Sonnet (Standard): Placed in B tier. It performs comparably to the reasoning version but without the exposed thinking data. It remains expensive.
- Gemini 2.5 Pro: A speculative A tier placement. It's a very new model that performs exceptionally well on benchmarks. However, its price is currently unknown, and while reasoning is visible in AI Studio, it's not exposed over the API, which is seen as a weird limitation. Its final tier depends heavily on its eventual pricing relative to other models.
- DeepSeek V3: Considered revolutionary. It kicked off a new era for open AI models, being cheaper than 4o Mini on release and performing close to 3.5. Its initial speed was fast, making it a pioneer for fast models. Its terrible UI was a direct motivation for building T3Chat. Despite subsequent API performance issues and R1 getting more attention, it's considered an incredibly underrated, potentially the best non-thinking model if hosting were better.
- DeepSeek V3-0324: Placed high in A tier. A quiet but significant update to V3 that performs better than GPT 4.5. Its pricing ($0.27/million input, $1.10/million output) is drastically cheaper than 4.5, highlighting its immense value. This model is expected to power the next generation (R2), potentially making it the best open-source model ever. It fundamentally changed the presenter's view on AI.
- GPT 4.5: Ranked in D tier. Despite early access, it was hard to figure out what it was good at (not great at code). It was marketed as more personal but lost blind preference polls to 4.0 for creative writing. It's absurdly expensive ($75/million input) and only available via bring-your-own-key on T3 Chat. While it's a massive model with impressive world knowledge and better prose than previous OpenAI models, its performance doesn't justify the cost, and it feels "dumber" than the 75x cheaper O3 Mini. It likely serves as a foundational model for future releases.
- O1 Pro: Placed in a "meme tier" due to its outrageous price ($150/million input, $600/million output). It can solve some very hard problems (5-10% better than the next best), but this marginal gain doesn't justify being 50-100 times more expensive. It also provided a terrible user experience on the ChatGPT site, often breaking or failing. Its cost reflects the difficulty/compute required to run it, not necessarily a massive profit margin.
- DeepSeek R1 Quen Distilled: Placed at the top of D tier. It's a fine model that performs well and runs on Grok, but it's less quirky and seemingly less smart than the Llama Distilled version, making it harder to justify personally.
- Gemini Flashlight: Placed in B tier. It's a smaller, faster, dumber, and incredibly cheap model ($0.075/million input, $0.30/million output). Its issue is that standard 2.0 Flash is only slightly more expensive but much smarter and already very fast and cheap, making the marginal discount for Flashlight not compelling enough to switch.
- Grok3: Ranked in F tier. It's literally unusable for developers because the promised API hasn't been released despite being announced over a month ago, with questionable communication and behavior from the company (XAI). This lack of access prevents benchmarking or practical use, leading to suspicion that they are hiding poor performance metrics. Its inability to be used by developers earns it the lowest tier.
In daily practice, the default model is Gemini 2.0 Flash for most tasks due to its speed, cost, and capabilities (search, image/PDF). If Flash doesn't provide a satisfactory answer or the problem is hard, the next step is usually O3 Mini, potentially with reasoning enabled. For problems specifically involving CSS or UI, Claude (specifically 3.5 Sonnet) is the go-to, despite its cost, because of its superior performance in those areas. The Llama Distilled version is also used sometimes for coding challenges or run in parallel while waiting for slower models. This tiered approach ensures efficiency for common tasks while having powerful options for more complex ones.