Agent Framework Benchmarks
Community-reported performance data for major agent frameworks. Numbers represent self-reported ranges from community submissions using gpt-4o unless noted. Methodology and caveats are described below the table.
Highest success rate
- 1.Instructor95%
- 2.OpenAI Agents SDK94%
- 3.PydanticAI93%
Lowest median latency
- 1.Instructor760ms
- 2.Vercel AI SDK780ms
- 3.OpenAI Agents SDK820ms
Lowest token usage
- 1.Instructor1,080
- 2.Vercel AI SDK1,150
- 3.PydanticAI1,180
| Framework | Task type | Median latency | Tokens / run | Success rate | Reports |
|---|---|---|---|---|---|
OpenAI Agents SDK by OpenAI | Tool-calling single agent | 820ms 680–1,100ms | 1,240 900–1,600 | 94% 91–96% | 38 |
PydanticAI by Pydantic | Structured extraction | 850ms 700–1,150ms | 1,180 880–1,500 | 93% 90–96% | 22 |
LangChain by LangChain | Tool-calling single agent | 1,050ms 800–1,600ms | 1,580 1,100–2,200 | 89% 85–93% | 74 |
LangGraph by LangChain | Multi-step graph workflow | 1,120ms 850–1,700ms | 1,650 1,200–2,400 | 91% 87–95% | 41 |
CrewAI by CrewAI | Multi-agent crew (2 agents) | 3,400ms 2,100–5,800ms | 4,200 2,800–7,000 | 87% 82–92% | 53 |
AutoGen by Microsoft | Group chat (2 agents + proxy) | 4,100ms 2,600–7,200ms | 5,100 3,200–9,400 | 85% 79–91% | 49 |
smolagents by Hugging Face | Code agent (Python execution) | 910ms 720–1,300ms | 1,320 950–1,900 | 88% 83–93% | 27 |
LlamaIndex by LlamaIndex | RAG query + agent reasoning | 1,280ms 900–2,100ms | 2,100 1,400–3,400 | 92% 88–96% | 35 |
Vercel AI SDK by Vercel | Tool-calling single agent | 780ms 620–1,050ms | 1,150 850–1,500 | 93% 90–96% | 31 |
Mastra by Mastra | Tool-calling single agent | 920ms 740–1,350ms | 1,380 980–1,900 | 90% 86–94% | 18 |
Semantic Kernel by Microsoft | Plugin-based agent | 1,150ms 880–1,700ms | 1,520 1,050–2,200 | 91% 87–95% | 29 |
Haystack by deepset | RAG pipeline + agent | 1,200ms 850–1,950ms | 1,900 1,300–3,000 | 91% 87–95% | 24 |
Instructor by Jason Liu | Structured extraction | 760ms 600–1,000ms | 1,080 800–1,400 | 95% 92–98% | 42 |
DSPy by Stanford NLP | Optimized prompt pipeline | 1,400ms 1,000–2,400ms | 1,850 1,200–3,100 | 89% 84–94% | 21 |
Google ADK by Google | Tool-calling single agent | 950ms 750–1,400ms | 1,300 900–1,800 | 92% 88–96% | 26 |
Framework-specific notes
Task type definitions
Tool-calling single agent
Agent receives a natural language query, selects and calls one or more tools, synthesizes a final response.
search("latest AI news") → summarize → respondMulti-agent collaboration
Two or more agents collaborate on a task with handoffs or group chat. Measured from input to final output.
researcher → writer → final summaryRAG query + agent reasoning
Retrieval from a vector store followed by agent reasoning over the retrieved context.
embed query → retrieve chunks → reason → answerStructured extraction
Agent extracts structured data from unstructured input and returns a validated Pydantic model.
parse email → extract fields → validate schemaPlugin-based agent
Agent selects and executes plugins (skills) based on a planner strategy. Common in enterprise frameworks with modular tool registration.
receive goal → plan steps → execute plugins → assemble resultOptimized prompt pipeline
Compiled prompt chain where prompts are auto-optimized at build time and executed at inference time with minimal overhead.
define signature → compile with examples → execute optimized pipelineMethodology and limitations
Data source. All numbers are community-reported via self-submission. Reports include the framework version, model used, task description, and measured latency/token counts. Each entry shows the number of community reports it is based on.
Not controlled experiments. Submissions come from different hardware, network conditions, and task implementations. Ranges are wide intentionally — the median is a rough center, not a guarantee.
Multi-agent comparisons are not apples-to-apples. CrewAI and AutoGen numbers measure end-to-end multi-agent runs. Single-agent frameworks are measured per-agent. Higher latency and token usage for multi-agent frameworks is expected and does not indicate inefficiency.
Success rate definition. Community-defined task completion. A run is marked successful if it produced the expected output format and did not error, hallucinate tool calls, or loop. Threshold varies by submitter.
Data as of April 2026. Framework releases change performance characteristics significantly. Always benchmark against your specific task and model combination.