Essay·Covers March 2026·Published May 30, 2026

AI agents, eighteen months in

A skeptic's read of LangGraph, OpenAI Assistants v2, Anthropic's computer use, and Autogen, plus the Frankenstein problem of chaining LLM calls.

By the benchr team · Updated May 30, 2026 · View changelog

Frameworks covered 4 LangGraph · Assistants · CU · Autogen

Hype-to-ship gap 18 mo Since the agent wave broke

Right number of agents 1 For nearly every workload

Hard caps required All Token, time, spend, retries

Eighteen months after the 2024 agentic wave broke, the pitch is mostly the same and the working code mostly is not. Vendor demos show agents booking flights, refactoring whole repositories, running back-office workflows nobody touches, and the demos look magical. Production deployments of the same frameworks look like duct tape: small bounded workflows held together with hard token caps, wall-clock timeouts, and a kill switch a human can reach.

This piece covers the four agentic stacks worth taking seriously right now (LangGraph, OpenAI Assistants v2, Anthropic's computer use, and Microsoft Autogen) and explains which one fits which production shape. The verdicts come from each framework's documented architecture, the labs' positioning, and the consistent open community discussion of how each behaves once it ships. None of these stacks produces an agent worth running unattended for high-stakes work. Two of them work fine with significant guardrails. The other two are mostly useful for what they teach you about where the wall currently sits.

The reliability of an agent loop is a frontier-model property more than a framework property. A framework shapes the topology of the loop and how much visibility you get into what the model is doing, but it can't make the model plan better across many tool calls. The underlying capability story is covered in the Opus review and the public benchmark record. For now, the model that holds together longest in an agent loop is Claude Opus 4.7, and if you are building on top of a single model that is the one to pick.

What you should use each framework for

The four frameworks are not direct substitutes for each other. Each one has a different design point and a different cost-of-error profile. The right framework for you is the one whose design point matches the workload you are about to put on it.

The framework choice falls out of the workload, not the other way around. Production loop with bounded risk → LangGraph. Demo speed → Assistants v2. UI with no API → computer use. High-stakes action → human in the loop. Adding more agents rarely helps.

LangGraph: explicit topology, production-readiness

LangGraph's design point is that the LLM doesn't decide the topology of the loop; you do. The agent is a graph of nodes (each node is either an LLM call, a tool call, or a deterministic function) and the edges between them are explicit. The framework gives you the control surface that lets you keep the loop bounded.

That's the right design point for production. The community discussion across the LangChain forum and the broader agent-research community is consistent that LangGraph is the framework most often shipping in production by 2026. It earns that not by being the easiest to write but by being the thing that survives contact with a real production workload. If you are building an agent that touches money, customer data, or anything else expensive to get wrong, LangGraph gives you the visibility to debug it and the structure to bolt guardrails onto.

The cost is development speed. LangGraph asks you to write more code per agent than the higher-level frameworks, which is a trade worth making in production and a waste of time in a demo.

OpenAI Assistants v2: fast prototype, slower debug

The current production iteration of OpenAI's Assistants API is higher-level than LangGraph. You describe the tools and the instructions; the platform handles the loop. It's the closest thing in the space to "just describe what you want."

The trade is visibility. When an Assistants v2 agent does the wrong thing, debugging means reading the logs of which tools were called in which order. OpenAI exposes those logs but doesn't make them pleasant to navigate. For a working developer that friction outweighs the speed gain you get on the first iteration; for a demo, where the whole point is showing a thing working, speed wins easily. So Assistants v2 is the right pick while you are still figuring out whether the agent should exist at all. Once you've decided it should and you're putting it in production, move it to LangGraph. OpenAI has hinted at a v3 successor; that has not landed in general availability as of this snapshot.

Anthropic's computer use: a different kind of agent

Anthropic's computer use is a different category. Instead of calling APIs, the agent sees a virtual screen, moves the mouse, types, and reads the result. That unlocks tasks with no API surface: a desktop application, a website with no clean programmatic interface, a vendor product that keeps its functionality locked behind the UI.

The community discussion of computer use is consistent on the strength and the weakness. The strength is that the model can drive production software end-to-end on workflows that would otherwise be unreachable. The weakness is that the agent breaks when the UI changes, which UIs do often. A compose interface that ships a redesign breaks every downstream agent until the prompts and the visual selectors are re-tuned. That makes computer use the right tool for long-tail tasks against UIs that rarely change. Point it at a high-volume workflow whose UI ships updates every month and the API path will hold up far better.

Microsoft Autogen: the multi-agent trap

Autogen's pitch is multi-agent. Instead of one agent doing everything, compose a team of specialist agents that collaborate. The community discussion of multi-agent setups across Autogen and the imitators that followed it converges on the same observation: adding agents adds opportunities for them to confuse each other, not intelligence. The conversation between the agents stays internally coherent even as the output drifts further from the goal.

Call it the Frankenstein problem. There may be tasks where the multi-agent decomposition wins, but the community has not produced a convincing example yet. If you find yourself reaching for a multi-agent setup, spend that same effort on a better single agent first. It's almost always good enough, and the multi-agent version almost always ends up worse.

The distance between a demo that dazzles and a deployment you'd trust unattended is the whole problem. Closing it is the next two years of work.

Where each framework lands today

Strength of fit for production work, public-report consensus.

LangGraph: bounded production loops

Strong

Computer use: stable-UI automation

Good

Assistants v2: prototypes and demos

Autogen: multi-agent production

Weak

1 The number of agents you should have. For nearly every workload.

Where agents work in 2026

There are a few categories where an agent in production is worth trusting today.

The first is high-volume, low-stakes classification or routing. When the cost of being wrong on a single case is small and the volume is large, a high-but-imperfect success rate is a measurable productivity gain. The wrong answers get caught downstream, by humans or by retries or by a simple sanity check, and the agent pays for itself on everything else.

The second is tightly-scoped tool calling: single tool, single decision, clear stopping condition. Search the docs and return the relevant section, or look up a customer record, or fetch the weather. These are agents in the most generous sense, closer to an LLM with one function call bolted on, and they work because there's no loop to fall out of.

The third is human-in-the-loop assistance. The agent does the legwork and a human approves the action. That is the model behind every coding assistant that ships production code, and it works because the human catches the failures the agent would otherwise commit. The coding-assistants shootout walks through where the four mainstream products in that pattern land.

1. Observe state

Read the world (API, database, screenshot).

↓

2. Plan and reason

LLM picks the next action from a tool list.

↓

3. Execute tool call

Side effects happen here. This is where money disappears.

↓

4. Loop or terminate

Goal met → done. Else → step 1. Cap the loop.

Mar 2024 LangChain Agents
First broadly-used framework. Tool-calling templates that hid the loop.
Aug 2024 LangGraph
Explicit graph topology. The framework people ship.
Sep 2024 OpenAI Assistants v2
Higher-level API. Faster to prototype, harder to debug.
Oct 2024 Anthropic computer use
Agent that sees a screen and uses a mouse and keyboard.
2025 Multi-agent everywhere
Autogen and friends. Mostly worse than one well-designed agent.

LangGraph

Production Best topology control

Assistants v2

Prototypes Fastest idea-to-demo

Computer use

UI tasks Brittle on dynamic UIs

Autogen

Research Multi-agent experiments

Where agents do not work yet

Long-horizon planning. Anything that needs the agent to hold coherence across more than five or six tool calls. The frontier models are getting better at this, but they aren't there yet. The capability story is why the benchmarks stopped telling you anything in compressed form: the benchmarks measure the cases where the model has already solved the planning problem, and production loops live in the cases where it hasn't.

High-stakes autonomous action. Anything where a wrong action costs money, trust, or safety. A success rate that's fine for email classification is nowhere near fine for charging a customer's credit card. Wherever a single bad action can survive downstream review, put a human in the loop instead of betting on a more confident agent.

Open-ended exploration. Tasks without a clear stopping condition. The agent will eventually do something useful, and then it will keep going, and the trouble starts there. Cap every loop. Wall-clock timeouts and token budgets are non-negotiable.

What to build, if you're building

For a developer who wants to try this, the working setup in 2026 is LangGraph as the topology framework, Claude Opus 4.7 as the backing model, and a single agent with hard caps. Build one agent for one task. Get it to a reliability level you would ship before reaching for the second. Most teams over-reach on agent scope and under-invest in the guardrails. The fix is to invert the ratio.

For the venture-capital pitch that says agents will replace knowledge workers in three years: the pitch is wrong on the timeline. The path from current capability to general-purpose autonomous agents runs through a stretch of work on reliability, recovery, and tool design that is not sexy enough to fund easily. The agents that end up mattering will be the ones built carefully on the narrow set of capabilities the models have today, rather than on the speculative ones they keep getting promised. The prompt-engineering piece covers what your prompts inside the loop should look like across whichever framework you end up using.

The frameworks will keep moving and the model under them will keep improving. What is unlikely to budge is the trade between scope and reliability: widen the scope and reliability drops, narrow it and reliability climbs. Pick narrow and ship. The wide-scope demos will keep coming, and the wide-scope production deployments will stay rare.

Frequently asked

Which AI agent framework should I use in 2026?

LangGraph for production workflows where you need explicit topology control. OpenAI Assistants v2 for fast prototyping. Anthropic computer use for UI automation. Multi-agent frameworks like Autogen usually underperform a single well-designed agent.

Are AI agents production-ready?

Narrowly. Agents work for high-volume low-stakes classification, tightly-scoped single-tool calls, and human-in-the-loop assistance. They fail at long-horizon planning, high-stakes autonomous action, and open-ended exploration.

Why do multi-agent setups underperform single agents?

Adding agents adds opportunities for them to confuse each other in ways a single agent would not. The conversation between agents stays internally coherent while the output drifts further from the goal.

How do I prevent agents from burning my budget?

Hard token caps per session, wall-clock timeouts, automatic alerts when a session cost crosses a low ceiling, and per-day spending limits at the provider level. The surprise invoice is the canonical first lesson for any team running unattended loops.

Changelog

May 25, 2026 — Rewrote the framework sections so the verdicts are grounded in each framework's documented architecture and the public community discussion. Added a decision-tree SVG mapping workload type to framework choice.
March 18, 2026 — Originally published.

References

LangChain, "LangGraph," langchain.com/langgraph, accessed May 2026.
OpenAI, "Assistants API overview," platform.openai.com/docs/assistants/overview, accessed May 2026.
Anthropic, "Introducing computer use," anthropic.com/news/3-5-models-and-computer-use, October 2024.
Microsoft, "AutoGen," microsoft.github.io/autogen, accessed May 2026.