Community data · April 2026

Agent Framework Benchmarks

Community-reported performance data for major agent frameworks. Numbers represent self-reported ranges from community submissions using gpt-4o unless noted. Methodology and caveats are described below the table.

Highest success rate

Lowest median latency

Lowest token usage

FrameworkTask typeMedian latencyTokens / runSuccess rateReports
Tool-calling single agent820ms

680–1,100ms

1,240

900–1,600

94%

91–96%

38
PydanticAI

by Pydantic

Structured extraction850ms

700–1,150ms

1,180

880–1,500

93%

90–96%

22
LangChain

by LangChain

Tool-calling single agent1,050ms

800–1,600ms

1,580

1,100–2,200

89%

85–93%

74
LangGraph

by LangChain

Multi-step graph workflow1,120ms

850–1,700ms

1,650

1,200–2,400

91%

87–95%

41
CrewAI

by CrewAI

Multi-agent crew (2 agents)3,400ms

2,100–5,800ms

4,200

2,800–7,000

87%

82–92%

53
AutoGen

by Microsoft

Group chat (2 agents + proxy)4,100ms

2,600–7,200ms

5,100

3,200–9,400

85%

79–91%

49
smolagents

by Hugging Face

Code agent (Python execution)910ms

720–1,300ms

1,320

950–1,900

88%

83–93%

27
LlamaIndex

by LlamaIndex

RAG query + agent reasoning1,280ms

900–2,100ms

2,100

1,400–3,400

92%

88–96%

35
Vercel AI SDK

by Vercel

Tool-calling single agent780ms

620–1,050ms

1,150

850–1,500

93%

90–96%

31
Mastra

by Mastra

Tool-calling single agent920ms

740–1,350ms

1,380

980–1,900

90%

86–94%

18
Semantic Kernel

by Microsoft

Plugin-based agent1,150ms

880–1,700ms

1,520

1,050–2,200

91%

87–95%

29
Haystack

by deepset

RAG pipeline + agent1,200ms

850–1,950ms

1,900

1,300–3,000

91%

87–95%

24
Instructor

by Jason Liu

Structured extraction760ms

600–1,000ms

1,080

800–1,400

95%

92–98%

42
DSPy

by Stanford NLP

Optimized prompt pipeline1,400ms

1,000–2,400ms

1,850

1,200–3,100

89%

84–94%

21
Google ADK

by Google

Tool-calling single agent950ms

750–1,400ms

1,300

900–1,800

92%

88–96%

26

Framework-specific notes

LangChain:Overhead from chain construction and callback system.
LangGraph:Higher latency reflects multi-node graph traversal overhead.
CrewAI:Higher numbers expected — measures full multi-agent run, not single inference.
AutoGen:Conversation-based overhead. Varies significantly by task complexity.
smolagents:Code execution adds latency on first run due to sandbox startup.
LlamaIndex:Measured on RAG+agent tasks. Retrieval latency included.
Vercel AI SDK:Thin abstraction over provider APIs. Edge-compatible streaming adds minimal overhead.
Mastra:TypeScript-native. Workflow engine adds slight overhead compared to raw SDK calls.
Semantic Kernel:Planner overhead varies by strategy. Measured with sequential planner.
Haystack:Pipeline construction overhead included. Retrieval latency varies by document store.
Instructor:Minimal wrapper overhead. Retry logic adds latency only on validation failures.
DSPy:Compile-time optimization overhead not included. Runtime latency only.
Google ADK:Measured with Gemini 1.5 Pro. Session management adds minimal overhead.

Task type definitions

Tool-calling single agent

Agent receives a natural language query, selects and calls one or more tools, synthesizes a final response.

search("latest AI news") → summarize → respond

Multi-agent collaboration

Two or more agents collaborate on a task with handoffs or group chat. Measured from input to final output.

researcher → writer → final summary

RAG query + agent reasoning

Retrieval from a vector store followed by agent reasoning over the retrieved context.

embed query → retrieve chunks → reason → answer

Structured extraction

Agent extracts structured data from unstructured input and returns a validated Pydantic model.

parse email → extract fields → validate schema

Plugin-based agent

Agent selects and executes plugins (skills) based on a planner strategy. Common in enterprise frameworks with modular tool registration.

receive goal → plan steps → execute plugins → assemble result

Optimized prompt pipeline

Compiled prompt chain where prompts are auto-optimized at build time and executed at inference time with minimal overhead.

define signature → compile with examples → execute optimized pipeline

Methodology and limitations

Data source. All numbers are community-reported via self-submission. Reports include the framework version, model used, task description, and measured latency/token counts. Each entry shows the number of community reports it is based on.

Not controlled experiments. Submissions come from different hardware, network conditions, and task implementations. Ranges are wide intentionally — the median is a rough center, not a guarantee.

Multi-agent comparisons are not apples-to-apples. CrewAI and AutoGen numbers measure end-to-end multi-agent runs. Single-agent frameworks are measured per-agent. Higher latency and token usage for multi-agent frameworks is expected and does not indicate inefficiency.

Success rate definition. Community-defined task completion. A run is marked successful if it produced the expected output format and did not error, hallucinate tool calls, or loop. Threshold varies by submitter.

Data as of April 2026. Framework releases change performance characteristics significantly. Always benchmark against your specific task and model combination.