Archive

Every piece, in the order it ran. Reviews of individual models, head-to-head comparisons, and short essays on the practice of working with these tools.

May 2026

21 May·26

Why the benchmarks stopped telling you anything.

MMLU is saturated, HumanEval is gamed. A field guide to what's left worth reading.
16 May·26

The million-token context was always a marketing number.

Most long-context workloads still belong in a retrieval system, with the narrow cases where the long window is worth the bill.
11 May·26

Voice models compared: ElevenLabs, Whisper, OpenAI, Cartesia.

Real latency numbers, Arabic narration tests, and the voice model worth shipping with right now.
5 May·26

Prompt engineering did not die. It got narrower.

Three techniques that still consistently improve outputs in 2026, with before-and-after examples.

April 2026

March 2026

February 2026

January 2026

December 2025

November 2025

14 Nov·25

Claude Opus 4.7, reviewed.

A 1,200-line refactoring task, a 200-page PDF, a multilingual stress test, and what it costs to use the thing daily.

Archive

May 2026

Why the benchmarks stopped telling you anything.

The million-token context was always a marketing number.

Voice models compared: ElevenLabs, Whisper, OpenAI, Cartesia.

Prompt engineering did not die. It got narrower.

April 2026

AI for Arabic content: a working report on five models.

RAG vs fine-tuning, with the math.

The price-per-use-case table.

March 2026

Multimodal capability ranking: twelve images, four models.

AI agents, eighteen months in.

Running models on your own machine.

February 2026

Small language models, in working use.

Context windows compared, across four frontier models.

The coding assistants shootout: Cursor, Copilot, Windsurf, Cody.

January 2026

The open-weight tier right now: Llama 4, Mistral, Qwen, DeepSeek.

GPT-5, reviewed.

December 2025

Gemini 3 Pro, reviewed

GPT-5 vs Claude Opus 4.7: seven tasks, scored.

November 2025

Claude Opus 4.7, reviewed.