Archive

Every piece, ordered by the period it covers. benchr launched on May 30, 2026 — items listed under earlier months are retrospective coverage of that period, and each article's dateline shows its actual publication date. Looking for a pick by task instead? Browse best AI by use case. Reviews of individual models, head-to-head comparisons, and short essays on the practice of working with these tools. Subscribe via RSS if you want new pieces without checking back.

July 2026

1 Jul·26

Claude Sonnet 5 launches: Mythos-class architecture at a mid-tier price.

$2/$10 intro pricing, a 128K max output, and a SWE-bench Verified score that edges out last month's Opus 4.8 flagship.

June 2026

28 Jun·26

GPT-5.6: OpenAI ships Sol, Terra, and Luna behind a government gate.

A new frontier series in limited preview to about 20 approved partners. Announced at $5/$30, $2.50/$15, and $1/$6 — with context windows and API IDs still unpublished.
19 Jun·26

Your computer can't run the big open models. Here's what actually works.

Why DeepSeek and Llama 70B won't load on your laptop, and the four real fixes, from quantization to renting a cloud GPU.
19 Jun·26

How to fine-tune an open model on your own data without owning a GPU.

QLoRA puts custom 7B–70B models within reach for a few dollars of rented GPU time. Here's exactly what it takes.
19 Jun·26

“CUDA out of memory”: why it happens, and how to run a model that's too big for your card.

The five real causes of the GPU OOM error, the fixes in order from free to last-resort, and when you simply need a bigger card.
19 Jun·26

Renting a GPU vs. paying per token: when self-hosting an open model is actually cheaper.

The honest break-even math. GPU dollars-per-hour against API dollars-per-million-tokens, and why utilization decides it.
16 Jun·26

The U.S. pulled Claude Fable 5 and Mythos 5.

An export-control order suspended Anthropic's two most capable models worldwide. What happened, who it hits, and what to do.
13 Jun·26

Which Claude model should you use in 2026?

Four models you can buy, one you can't. A plain decision guide to Anthropic's lineup, by task and by budget.
13 Jun·26

Claude Fable 5 vs GPT-5.5 vs Gemini 3.1 Pro.

The three newest frontier models, head to head on price, context, the benchmarks each lab actually published, and which one fits which job.
13 Jun·26

Claude Opus 4.8 vs Gemini 3.1 Pro.

The two strongest frontier models with fully published benchmarks, head to head. Coding accuracy or cheap strong reasoning, that's the choice.
10 Jun·26

GPT-5.4, reviewed: the value pick OpenAI doesn't advertise.

The seven-week flagship, reviewed after losing the crown — when the price-performance story gets honest.
10 Jun·26

Claude Fable 5 is the Mythos-class model you can finally use.

$10/$50, a 1M context, classifiers that hand risky work back to Opus 4.8, and a two-week free window.
09 Jun·26

Claude Opus 4.8 is live, and the pressure is on coding and agents.

Anthropic's new flagship lands at $5/$25, with a fast mode and a Mythos-class security model arriving in the coming weeks.
09 Jun·26

Anthropic's Project Glasswing and the case for Claude as a security tool.

An expansion to ~200 critical-infrastructure orgs, and what Claude Security and the Mythos preview actually do.
09 Jun·26

GPT-5.5 pricing, and the 272K-token cliff that doubles your bill.

Standard rates, the long-context reprice that hits the whole session, and why most teams shouldn't start on GPT-5.5 Pro.
09 Jun·26

Google's developer AI at I/O 2026: from autocomplete to agents.

Gemini 3.5 Flash as the agent default, the Gemini CLI to Antigravity migration on June 18, and what breaks.
09 Jun·26

Grok 4.3 is now xAI's default, and old slugs bill at its prices.

Legacy model aliases now redirect to Grok 4.3 and are billed at its rates. What that means for your API costs.
09 Jun·26

Google's AI Search is becoming an agent layer, not a summary box.

Background information agents and agentic booking, set against the click-loss data from Pew, Ahrefs, and Seer.
09 Jun·26

The UK just forced Google to give publishers AI Search control.

A world-first CMA opt-out: refuse to feed Google's AI features without losing your standard search ranking.
09 Jun·26

WebMCP and the push to make websites agent-readable.

A proposed standard from Google and Microsoft that lets your site hand agents a list of tools instead of making them guess.
06 Jun·26

DeepSeek vs OpenAI Pricing: Cost Comparison & Quality Trade-offs.

DeepSeek V4-Pro is 65% cheaper than GPT-5 and outperforms it on coding benchmarks. Here's the full breakdown.
06 Jun·26

Anthropic Claude API Pricing Guide: Opus 4.8, Sonnet, & Haiku.

Complete Claude API pricing breakdown. Opus 4.8 at $5, Sonnet 4.6 at $3, Haiku 4.5 at $1 per million input tokens. Plus the 90% caching discount explained.
06 Jun·26

OpenAI API Pricing Guide: GPT-5.5, GPT-5, and GPT-5 Mini Costs.

GPT-5.5 at $5, GPT-5 at $1.25, GPT-5 Mini at $0.25 per million input tokens. Batch and caching discounts explained.
06 Jun·26

Cheapest LLM API 2026: Ultra-Low Cost AI Models Ranked.

DeepSeek V4-Flash leads at $0.14 per million tokens. Every sub-$1 model ranked, plus self-hosted open weights.
06 Jun·26

AI Model Pricing Comparison 2026: Cost per Million Tokens.

Complete comparison of API token pricing across OpenAI, Anthropic, Google, DeepSeek, and open-weights. Sourced from official docs.

May 2026

30 May·26

Claude Mythos: the model you can't use.

Anthropic built a frontier model, then said it won't sell it. Here's what Mythos Preview is, and why it's locked away.
30 May·26

Claude Cowork: the desktop agent that isn't for coders.

Give it a goal, point it at your files, let it work. Claude Code's engine, aimed at everyone who isn't a coder.
30 May·26

GPT-5.5, reviewed: is the upgrade off GPT-5 worth it.

OpenAI put the gains into agentic coding and computer use, at roughly double GPT-5's API price. Who moves, who waits.
30 May·26

Gemini 3.1 Pro, reviewed.

The reasoning leap is real. What to watch is the long-context bill and how fast a 3.5 Pro could supersede it.
30 May·26

Gemini 3.5 Flash, reviewed.

Cheap frontier for agent loops, fast on output, and priced to undercut Pro. Just don't confuse it with the old budget Flash.
30 May·26

Grok 4.3, reviewed.

The one model that reads X and the live web on its own. Where that wins outright, and where it doesn't.
30 May·26

DeepSeek-V4, reviewed.

An MIT-licensed model that codes like a paid one and costs nothing to download. The real decision is how you run it.
30 May·26

Qwen3.6, reviewed.

The value here isn't one model. It's a free, Apache-licensed family in two sizes that covers most jobs at zero licensing cost.
30 May·26

Kimi K2.6, reviewed.

An open-weight trillion-parameter model that runs a swarm of sub-agents across thousands of steps. Free to download, cheap on the API.
30 May·26

Llama 4, reviewed.

A 10-million-token context on open weights still turns heads. But Meta has moved on, and Llama 4 is the last open Llama.
30 May·26

Mistral Large 3, reviewed.

The largest open-weight model from a major lab, released under Apache-2.0. Frontier-scale weights you can download.
30 May·26

ChatGPT Images 2.0, reviewed.

The first image model that gets text right. Where GPT Image 2 nails dense layouts, and where it still fumbles.
30 May·26

Opus 4.8 vs GPT-5.5: the coder's flagship vs the daily driver.

Both charge $5 per million input tokens. After that they pull apart fast. Here's where each one wins and where it loses.
30 May·26

Gemini 3.1 Pro vs GPT-5.5: reasoning vs knowledge work.

These two flagships aim at different scoreboards. One chases hardest-mode reasoning, the other all-round professional work. Picking between them starts with that.
30 May·26

Grok 4.3 vs ChatGPT: when live context wins.

Grok wires into the live web and X. ChatGPT is the all-rounder. The choice is whether your question is about right now.
30 May·26

ChatGPT vs Claude vs Gemini: the everyday pick for 2026.

Three subscriptions priced within a dollar of each other, three different default models. Here's which one is worth yours.
30 May·26

Claude vs ChatGPT for long-form writing.

Before voice or style, one boring number decides a lot: how much can each model write in one pass?
30 May·26

AI search engines compared: Perplexity vs ChatGPT Search vs Google AI.

An AI answer is only as good as your ability to check it. The question is which one shows its work.
30 May·26

The best free coding model: DeepSeek vs Qwen vs Kimi.

Open weights, zero dollars, real code. Three families you can download or chat with for free, ranked by what they score.
30 May·26

The best AI for video in 2026: Veo on top, Sora on the way out.

The tool everyone expected to win is leaving the market. The one that leads brings its own soundtrack.
30 May·26

The best AI tools for social media.

Captions, hooks, and repurposing — which tool earns its place for each platform.
30 May·26

The best free AI for coding.

What you actually get at $0 from Copilot, Cursor, and friends, and the moment the meter starts.
30 May·26

The best AI for writing anything long.

Drafting, essays, and long-form, ranked by voice and how far each model holds a thread before the prose sags.
30 May·26

The best AI for students who want to actually learn.

Studying, summarizing, and problem-solving: what is fair game, what gets you in trouble, and the free picks worth using.
30 May·26

The best AI for resumes and cover letters.

Tailoring to the job, getting past the ATS, and the AI habits that quietly sink an application.
30 May·26

The best AI for email, built-in or standalone.

Drafting, replying, and clearing the inbox: when Gmail and Outlook's own AI is enough and when a separate tool wins.
30 May·26

The best AI for spreadsheets and the formulas you hate.

Excel Copilot, Sheets' Gemini, and pasting into a chat model: which one actually gets the formula right.
30 May·26

The best AI for research without the fake citations.

Literature review and summarizing sources, with the tools that cite honestly versus the ones that make references up.
30 May·26

The best AI for Arabic-English translation.

Which models move cleanly between Arabic and English both ways, and the places every one of them still breaks.
30 May·26

The best AI for Saudi and Gulf Arabic.

Where the models hold Khaleeji dialect and where they slide back into MSA or drift toward Egyptian.
30 May·26

The best AI for customer service at a real business.

Off-the-shelf resolution bots, platform agents, or build-your-own: what each costs and which fits your support volume.
30 May·26

The best free AI with no subscription.

The tools that are genuinely free with no credit card in 2026, and the exact point where each free tier taps out.
30 May·26

The AI agent that checks out for you.

How agentic shopping works, who is building it, and where it can go wrong.
30 May·26

Are AI hallucinations fixed yet?

What got better by 2026, what did not, and the setups that cut made-up answers.
30 May·26

Which AI providers train on your chats.

Who learns from your conversations by default, how to opt out, and what stays private.
30 May·26

Do AI text detectors actually work?

The false-positive problem, who gets wrongly flagged, and what to do instead.
30 May·26

Do you actually need a reasoning model?

When the extra cost and latency of a thinking model pays off, and when it's wasted.
30 May·26

How to get cited inside AI answers.

GEO and AEO tactics that get your pages quoted by ChatGPT, Perplexity, and AI Overviews.
30 May·26

When the model remembers you.

How persistent memory works across chats, what it buys you, and the privacy trade.
30 May·26

What zero-click search did to the web.

AI Overviews and chat answers keep users on the results page. The real numbers on clicks lost.
30 May·26

Claude Opus 4.8, reviewed.

Same price as 4.7, a small leaderboard bump, one benchmark it loses, and a real honesty gain that catches its own bugs.
30 May·26

Claude Sonnet 4.6, reviewed.

The $3/$15 daily-driver tier. When it's the right default, when to drop to Haiku, and when to pay for Opus.
30 May·26

Claude Haiku 4.5, reviewed.

The $1/$5 cost-control tier. Where the cheapest Claude is genuinely enough, and where cheap turns expensive.
30 May·26

Cutting your token bill.

Where AI token spend comes from, and the five levers that bring it down: routing, caching, batching, shorter output, lower effort.
21 May·26

Why the benchmarks stopped telling you anything.

MMLU is saturated, HumanEval is gamed. A field guide to what's left worth reading.
16 May·26

The million-token context was always a marketing number.

Most long-context workloads still belong in a retrieval system, with the narrow cases where the long window is worth the bill.
11 May·26

Voice models compared: ElevenLabs, Whisper, OpenAI, Cartesia.

Real latency numbers, Arabic narration tests, and the voice model worth shipping with right now.
8 May·26

The price-per-use-case table.

What you pay for AI in 2026 by workload — chat, RAG, agents, batch — with five commercial models compared.
5 May·26

Prompt engineering did not die. It got narrower.

Three techniques that still consistently improve outputs in 2026, with before-and-after examples.
4 May·26

AI for Arabic content: a working report on five models.

How Modern Standard, Saudi, Egyptian, and Levantine Arabic come out the other side of Claude, GPT-5, Gemini 3, Qwen 3, and Llama 4.
2 May·26

Multimodal capability ranking: twelve images, four models.

Vision tested across Claude, GPT-5, Gemini 3, and Llama 4. The winner is not the one in the marketing campaigns.

April 2026

28 Apr·26

GPT-5 vs Claude Opus 4.7: seven tasks, scored.

A refactor, a landing page, an obscure legal question, a recipe, a paper summary, a difficult email, and a broken script.
22 Apr·26

Claude Opus 4.7, reviewed.

A 1,200-line refactoring task, a 200-page PDF, a multilingual stress test, and what it costs to use the thing daily.
17 Apr·26

RAG vs fine-tuning, with the math.

Cost numbers across both approaches, and the three specific scenarios where fine-tuning still pays off.

March 2026

18 Mar·26

AI agents, eighteen months in.

A skeptic's field report on LangGraph, OpenAI Assistants v2, Anthropic's computer use, and Autogen.
7 Mar·26

Running models on your own machine.

Hardware, software, actual tokens-per-second on three quantizations, and when local is genuinely worth it.
1 Mar·26

Gemini 3 Pro, reviewed

Brilliant at one specific workflow, competent at most others, and strange in ways the model card does not explain.

February 2026

25 Feb·26

Small language models, in working use.

Phi-4, Gemma 3, and the workloads where sub-10B parameter models quietly win.
11 Feb·26

Context windows compared, across four frontier models.

When the million-token window pays off, and when it's just expensive retrieval done badly.
1 Feb·26

The coding assistants shootout: Cursor, Copilot, Windsurf, Cody.

Four assistants given the same feature on the same codebase. The bugs they shipped were not equally distributed.

January 2026

18 Jan·26

The open-weight tier right now: Llama 4, Mistral, Qwen, DeepSeek.

Where open weights have caught up to closed models, and the two categories where they still haven't.
4 Jan·26

GPT-5, reviewed.

Where GPT-5 differs from Claude in ways that matter — speed, breadth, and the cracks that show on niche technical questions.

Archive

July 2026

Claude Sonnet 5 launches: Mythos-class architecture at a mid-tier price.

June 2026

GPT-5.6: OpenAI ships Sol, Terra, and Luna behind a government gate.

Your computer can't run the big open models. Here's what actually works.

How to fine-tune an open model on your own data without owning a GPU.

“CUDA out of memory”: why it happens, and how to run a model that's too big for your card.

Renting a GPU vs. paying per token: when self-hosting an open model is actually cheaper.

The U.S. pulled Claude Fable 5 and Mythos 5.

Which Claude model should you use in 2026?

Claude Fable 5 vs GPT-5.5 vs Gemini 3.1 Pro.

Claude Opus 4.8 vs Gemini 3.1 Pro.

GPT-5.4, reviewed: the value pick OpenAI doesn't advertise.

Claude Fable 5 is the Mythos-class model you can finally use.

Claude Opus 4.8 is live, and the pressure is on coding and agents.

Anthropic's Project Glasswing and the case for Claude as a security tool.

GPT-5.5 pricing, and the 272K-token cliff that doubles your bill.

Google's developer AI at I/O 2026: from autocomplete to agents.

Grok 4.3 is now xAI's default, and old slugs bill at its prices.

Google's AI Search is becoming an agent layer, not a summary box.

The UK just forced Google to give publishers AI Search control.

WebMCP and the push to make websites agent-readable.

DeepSeek vs OpenAI Pricing: Cost Comparison & Quality Trade-offs.

Anthropic Claude API Pricing Guide: Opus 4.8, Sonnet, & Haiku.

OpenAI API Pricing Guide: GPT-5.5, GPT-5, and GPT-5 Mini Costs.

Cheapest LLM API 2026: Ultra-Low Cost AI Models Ranked.

AI Model Pricing Comparison 2026: Cost per Million Tokens.

May 2026

Claude Mythos: the model you can't use.

Claude Cowork: the desktop agent that isn't for coders.

GPT-5.5, reviewed: is the upgrade off GPT-5 worth it.

Gemini 3.1 Pro, reviewed.

Gemini 3.5 Flash, reviewed.

Grok 4.3, reviewed.

DeepSeek-V4, reviewed.

Qwen3.6, reviewed.

Kimi K2.6, reviewed.

Llama 4, reviewed.

Mistral Large 3, reviewed.

ChatGPT Images 2.0, reviewed.

Opus 4.8 vs GPT-5.5: the coder's flagship vs the daily driver.

Gemini 3.1 Pro vs GPT-5.5: reasoning vs knowledge work.

Grok 4.3 vs ChatGPT: when live context wins.

ChatGPT vs Claude vs Gemini: the everyday pick for 2026.

Claude vs ChatGPT for long-form writing.

AI search engines compared: Perplexity vs ChatGPT Search vs Google AI.

The best free coding model: DeepSeek vs Qwen vs Kimi.

The best AI for video in 2026: Veo on top, Sora on the way out.

The best AI tools for social media.

The best free AI for coding.

The best AI for writing anything long.

The best AI for students who want to actually learn.

The best AI for resumes and cover letters.

The best AI for email, built-in or standalone.

The best AI for spreadsheets and the formulas you hate.

The best AI for research without the fake citations.

The best AI for Arabic-English translation.

The best AI for Saudi and Gulf Arabic.

The best AI for customer service at a real business.

The best free AI with no subscription.

The AI agent that checks out for you.

Are AI hallucinations fixed yet?

Which AI providers train on your chats.

Do AI text detectors actually work?

Do you actually need a reasoning model?

How to get cited inside AI answers.

When the model remembers you.

What zero-click search did to the web.

Claude Opus 4.8, reviewed.

Claude Sonnet 4.6, reviewed.

Claude Haiku 4.5, reviewed.

Cutting your token bill.

Why the benchmarks stopped telling you anything.

The million-token context was always a marketing number.

Voice models compared: ElevenLabs, Whisper, OpenAI, Cartesia.

The price-per-use-case table.

Prompt engineering did not die. It got narrower.

AI for Arabic content: a working report on five models.

Multimodal capability ranking: twelve images, four models.