Community data · April 2026

Agent Framework Benchmarks

Community-reported performance data for major agent frameworks. Numbers represent self-reported ranges from community submissions using gpt-4o unless noted. Methodology and caveats are described below the table.

Highest success rate

Lowest median latency

Lowest token usage

1.Instructor1,080
2.Vercel AI SDK1,150
3.PydanticAI1,180

Framework	Task type	Median latency	Tokens / run	Success rate	Reports
OpenAI Agents SDK by OpenAI	Tool-calling single agent	820ms 680–1,100ms	1,240 900–1,600	94% 91–96%	38
PydanticAI by Pydantic	Structured extraction	850ms 700–1,150ms	1,180 880–1,500	93% 90–96%	22
LangChain by LangChain	Tool-calling single agent	1,050ms 800–1,600ms	1,580 1,100–2,200	89% 85–93%	74
LangGraph by LangChain	Multi-step graph workflow	1,120ms 850–1,700ms	1,650 1,200–2,400	91% 87–95%	41
CrewAI by CrewAI	Multi-agent crew (2 agents)	3,400ms 2,100–5,800ms	4,200 2,800–7,000	87% 82–92%	53
AutoGen by Microsoft	Group chat (2 agents + proxy)	4,100ms 2,600–7,200ms	5,100 3,200–9,400	85% 79–91%	49
smolagents by Hugging Face	Code agent (Python execution)	910ms 720–1,300ms	1,320 950–1,900	88% 83–93%	27
LlamaIndex by LlamaIndex	RAG query + agent reasoning	1,280ms 900–2,100ms	2,100 1,400–3,400	92% 88–96%	35
Vercel AI SDK by Vercel	Tool-calling single agent	780ms 620–1,050ms	1,150 850–1,500	93% 90–96%	31
Mastra by Mastra	Tool-calling single agent	920ms 740–1,350ms	1,380 980–1,900	90% 86–94%	18
Semantic Kernel by Microsoft	Plugin-based agent	1,150ms 880–1,700ms	1,520 1,050–2,200	91% 87–95%	29
Haystack by deepset	RAG pipeline + agent	1,200ms 850–1,950ms	1,900 1,300–3,000	91% 87–95%	24
Instructor by Jason Liu	Structured extraction	760ms 600–1,000ms	1,080 800–1,400	95% 92–98%	42
DSPy by Stanford NLP	Optimized prompt pipeline	1,400ms 1,000–2,400ms	1,850 1,200–3,100	89% 84–94%	21
Google ADK by Google	Tool-calling single agent	950ms 750–1,400ms	1,300 900–1,800	92% 88–96%	26

Framework-specific notes

LangChain:Overhead from chain construction and callback system.

LangGraph:Higher latency reflects multi-node graph traversal overhead.

CrewAI:Higher numbers expected — measures full multi-agent run, not single inference.

AutoGen:Conversation-based overhead. Varies significantly by task complexity.

smolagents:Code execution adds latency on first run due to sandbox startup.

LlamaIndex:Measured on RAG+agent tasks. Retrieval latency included.

Vercel AI SDK:Thin abstraction over provider APIs. Edge-compatible streaming adds minimal overhead.

Mastra:TypeScript-native. Workflow engine adds slight overhead compared to raw SDK calls.

Semantic Kernel:Planner overhead varies by strategy. Measured with sequential planner.

Haystack:Pipeline construction overhead included. Retrieval latency varies by document store.

Instructor:Minimal wrapper overhead. Retry logic adds latency only on validation failures.

DSPy:Compile-time optimization overhead not included. Runtime latency only.

Google ADK:Measured with Gemini 1.5 Pro. Session management adds minimal overhead.

Task type definitions

Tool-calling single agent

Agent receives a natural language query, selects and calls one or more tools, synthesizes a final response.

search("latest AI news") → summarize → respond

Multi-agent collaboration

Two or more agents collaborate on a task with handoffs or group chat. Measured from input to final output.

researcher → writer → final summary

RAG query + agent reasoning

Retrieval from a vector store followed by agent reasoning over the retrieved context.

embed query → retrieve chunks → reason → answer

Structured extraction

Agent extracts structured data from unstructured input and returns a validated Pydantic model.

parse email → extract fields → validate schema

Plugin-based agent

Agent selects and executes plugins (skills) based on a planner strategy. Common in enterprise frameworks with modular tool registration.

receive goal → plan steps → execute plugins → assemble result

Optimized prompt pipeline

Compiled prompt chain where prompts are auto-optimized at build time and executed at inference time with minimal overhead.

define signature → compile with examples → execute optimized pipeline

Methodology and limitations

Data source. All numbers are community-reported via self-submission. Reports include the framework version, model used, task description, and measured latency/token counts. Each entry shows the number of community reports it is based on.

Not controlled experiments. Submissions come from different hardware, network conditions, and task implementations. Ranges are wide intentionally — the median is a rough center, not a guarantee.

Multi-agent comparisons are not apples-to-apples. CrewAI and AutoGen numbers measure end-to-end multi-agent runs. Single-agent frameworks are measured per-agent. Higher latency and token usage for multi-agent frameworks is expected and does not indicate inefficiency.

Success rate definition. Community-defined task completion. A run is marked successful if it produced the expected output format and did not error, hallucinate tool calls, or loop. Threshold varies by submitter.

Data as of April 2026. Framework releases change performance characteristics significantly. Always benchmark against your specific task and model combination.

Compare framework features Choosing guide →