Archive
Every piece, in the order it ran. Reviews of individual models, head-to-head comparisons, and short essays on the practice of working with these tools.
May 2026
-
21 May·26
Why the benchmarks stopped telling you anything.
MMLU is saturated, HumanEval is gamed. A field guide to what's left worth reading.
-
16 May·26
The million-token context was always a marketing number.
Most long-context workloads still belong in a retrieval system, with the narrow cases where the long window is worth the bill.
-
11 May·26
Voice models compared: ElevenLabs, Whisper, OpenAI, Cartesia.
Real latency numbers, Arabic narration tests, and the voice model worth shipping with right now.
-
5 May·26
Prompt engineering did not die. It got narrower.
Three techniques that still consistently improve outputs in 2026, with before-and-after examples.
April 2026
-
27 Apr·26
AI for Arabic content: a working report on five models.
How Modern Standard, Saudi, Egyptian, and Levantine Arabic come out the other side of Claude, GPT-5, Gemini 3, Qwen 3, and Llama 4.
-
17 Apr·26
RAG vs fine-tuning, with the math.
Cost numbers across both approaches, and the three specific scenarios where fine-tuning still pays off.
-
8 Apr·26
The price-per-use-case table.
What you actually pay for AI in 2026 by workload — chat, RAG, agents, batch — with five commercial models compared.
March 2026
-
28 Mar·26
Multimodal capability ranking: twelve images, four models.
Vision tested across Claude, GPT-5, Gemini 3, and Llama 4. The winner is not the one in the marketing campaigns.
-
18 Mar·26
AI agents, eighteen months in.
A skeptic's field report on LangGraph, OpenAI Assistants v2, Anthropic's computer use, and Autogen.
-
7 Mar·26
Running models on your own machine.
Hardware, software, actual tokens-per-second on three quantizations, and when local is genuinely worth it.
February 2026
-
25 Feb·26
Small language models, in working use.
Phi-4, Gemma 3, and the workloads where sub-10B parameter models quietly win.
-
11 Feb·26
Context windows compared, across four frontier models.
When the million-token window pays off, and when it's just expensive retrieval done badly.
-
1 Feb·26
The coding assistants shootout: Cursor, Copilot, Windsurf, Cody.
Four assistants given the same feature on the same codebase. The bugs they shipped were not equally distributed.
January 2026
-
18 Jan·26
The open-weight tier right now: Llama 4, Mistral, Qwen, DeepSeek.
Where open weights have caught up to closed models, and the two categories where they still haven't.
-
4 Jan·26
GPT-5, reviewed.
Where GPT-5 differs from Claude in ways that matter — speed, breadth, and the cracks that show on niche technical questions.
December 2025
-
14 Dec·25
Gemini 3 Pro, reviewed
Brilliant at one specific workflow, competent at most others, and strange in ways the model card does not explain.
-
1 Dec·25
GPT-5 vs Claude Opus 4.7: seven tasks, scored.
A refactor, a landing page, an obscure legal question, a recipe, a paper summary, a difficult email, and a broken script.