Бенчмарки Function Calling LLM¶
~5 минут чтения
Предварительно: Function Calling и Tool Use, Воркфлоу AI-агентов
Модель с 92% на MMLU может полностью провалиться при цепочке из трёх API-вызовов. BFCL тестирует 2000 пар вопрос-функция: лучший результат -- 70.85% (GLM-4.5), а GPT-5 набирает лишь 59.22%. MCPMark идёт дальше -- реальные CRUD-операции через Notion, GitHub, PostgreSQL: лучший pass@1 всего 52.6% (GPT-5 Medium), а Claude Opus 4.1 при стоимости $1165 за прогон получает 29.9%. Это ключевой разрыв между "умной моделью" и "рабочим агентом".
Part 1: Overview¶
Why Function Calling Benchmarks Matter¶
Key Insight:
A model that scores 90% on a math test might completely fail when asked to chain three API calls, manage context across a 10-turn conversation, or know when not to use a tool.
Traditional Benchmarks Don't Help: - MMLU, HumanEval don't test tool use - Production requires multi-step reasoning - Context management across conversations
Part 2: Berkeley Function Calling Leaderboard (BFCL)¶
2.1 What Makes BFCL Different¶
Developed by: UC Berkeley researchers Scale: 2,000 question-function-answer pairs Languages: Python, Java, JavaScript, REST API
2.2 Test Categories¶
| Category | What It Measures | Why It Matters |
|---|---|---|
| Simple Function Calling | Single function invocation | Baseline competency |
| Parallel Function Calling | Multiple simultaneous calls | Batch operations efficiently |
| Multiple Function Selection | Choosing correct tool from many | Decision under choice overload |
| Relevance Detection | Knowing when NOT to call | Preventing hallucinated actions |
| Multi-turn Interactions | Sustained conversations with context | Memory and long-horizon planning |
| Multi-step Reasoning | Sequential calls, outputs feed inputs | Complex workflow orchestration |
2.3 BFCL Leaderboard (October 2025)¶
| Rank | Model | Score |
|---|---|---|
| 1 | GLM-4.5 (FC) | 70.85% |
| 2 | Claude Opus 4.1 | 70.36% |
| 3 | Claude Sonnet 4 | 70.29% |
| 4-6 | Chinese models | 65-70% |
| 7 | GPT-5 | 59.22% |
2.4 Key Findings¶
Split Personality:
Top AIs ace one-shot questions but stumble when they must remember context, manage long conversations, or decide when not to act.
Chinese Models Leading: - GLM-4.5 (FC) tops the leaderboard - Strong performance on function calling specifically
Part 3: MCPMark — Real-World Stress Testing¶
3.1 What is MCPMark¶
Purpose: Test models on realistic Model Context Protocol (MCP) use Tasks: 127 high-quality tasks by domain experts Scope: Full CRUD operations (not just read-heavy)
3.2 MCP Environments Tested¶
- Notion
- GitHub
- Filesystem
- PostgreSQL
- Playwright
3.3 MCPMark Leaderboard¶
| Model | Pass@1 | Pass@4 | Pass^4 | Cost/Run | Time |
|---|---|---|---|---|---|
| GPT-5 Medium | 52.6% | 68.5% | 33.9% | $127.46 | 478s |
| Claude Opus 4.1 | 29.9% | - | - | $1,165.45 | 362s |
| Claude Sonnet 4 | 28.1% | 44.9% | 12.6% | $252.41 | 218s |
| o3 | 25.4% | 43.3% | 12.6% | $113.94 | 169s |
| Qwen-3-Coder | 24.8% | 40.9% | 12.6% | $36.46 | 274s |
3.4 Metrics Explained¶
| Metric | Meaning |
|---|---|
| Pass@1 | Success rate on first attempt |
| Pass@4 | Success within four attempts |
| Pass^4 | Consistency (all four succeed) |
3.5 Complexity Factor¶
Average per task: - 16.2 execution turns - 17.4 tool calls
Typical MCPMark Task: 1. Read current state from Notion database 2. Process data through multiple transformations 3. Make decisions based on constraints 4. Update records across multiple systems 5. Verify changes meet specifications
Part 4: Model Deep Dives¶
4.1 GPT-5: Cost-Effective Generalist¶
Pricing: $1.25/M input, $10/M output
Strengths: - Best MCPMark performance (52.6% pass@1) - Reasonable cost structure - Strong multimodal capabilities
Best For: Production applications requiring reliable multi-step workflows where cost matters
4.2 Claude Family: Premium Reasoning¶
Claude Sonnet 4: - $3/M input, $15/M output - MCPMark: 28.1% pass@1 - Strong reasoning on complex problems
Claude Opus 4.1: - $15/M input, $75/M output - "Best coding model in the world" - Autonomously played Pokémon Red for 24 hours
Best For: Enterprise applications where code quality and reasoning depth justify premium pricing
4.3 Gemini 2.5: Multimodal Powerhouse¶
Strengths: - Native tool calling without prompt engineering - Excellent multimodal understanding - Strong agentic capabilities
Best For: Applications requiring multimodal reasoning and Google service integration
4.4 Qwen-3: Efficient Alternative¶
Strengths: - Best cost efficiency ($36.46 per run) - Fast execution times - Hermes-style tool use
Best For: Budget-conscious development and rapid prototyping
Part 5: Cost-Performance Tradeoff¶
5.1 Cost per Successful Task¶
| Model | Cost/Run | Pass@1 | Cost/Success |
|---|---|---|---|
| Qwen-3-Coder | $36.46 | 24.8% | ~$147 |
| GPT-5 Medium | $127.46 | 52.6% | ~$242 |
| Claude Sonnet 4 | $252.41 | 28.1% | ~$898 |
5.2 Monthly Cost Projection (10,000 successful tasks)¶
| Model | Cost/Success | Monthly Cost (10K successes) |
|---|---|---|
| Qwen-3-Coder | ~$147 | ~$1,470K |
| GPT-5 Medium | ~$242 | ~$2,420K |
| Claude Sonnet 4 | ~$898 | ~$8,980K |
5.3 Key Insight¶
A model that's 10% more accurate but 14x more expensive might not be the right choice. Calculate total cost including retries for failed attempts.
Part 6: Infrastructure Layer¶
6.1 Production Challenges¶
Benchmarks assume perfect conditions. Production requires:
| Challenge | What It Means |
|---|---|
| Authentication | Managing OAuth flows across services |
| Error handling | Recovering from transient failures |
| Multi-tenancy | Isolating customer data |
| Monitoring | Tracking success rates and costs |
| Schema management | Keeping function definitions current |
6.2 MCP Server Quality Matters¶
Klavis AI Strata MCP Server: - 2x success rate vs official GitHub implementation - 1.6x better vs official Notion implementation - Progressive discovery guides agents through tools
6.3 Why Infrastructure Affects Scores¶
- Function calling depends on reliable tool availability
- Authentication failures break multi-step workflows
- Rate limit handling prevents cascade failures
- Proper error handling improves agent resilience
Part 7: Interview-Relevant Numbers¶
BFCL Statistics¶
| Metric | Value |
|---|---|
| Question-function pairs | 2,000 |
| Top score | 70.85% (GLM-4.5) |
| GPT-5 score | 59.22% |
| Claude Sonnet 4 score | 70.29% |
MCPMark Statistics¶
| Metric | Value |
|---|---|
| Tasks | 127 |
| Top pass@1 | 52.6% (GPT-5) |
| Avg execution turns | 16.2 |
| Avg tool calls | 17.4 |
Cost Comparison¶
| Model | Input/M | Output/M | Run Cost |
|---|---|---|---|
| Claude Opus 4.1 | $15 | $75 | $1,165 |
| Claude Sonnet 4 | $3 | $15 | $252 |
| GPT-5 | $1.25 | $10 | $127 |
| Qwen-3-Coder | $0.50 | $2 | $36 |
Part 8: Model Selection Guide¶
By Use Case¶
| Need | Recommended Model | Reason |
|---|---|---|
| Best overall performance | GPT-5 Medium | 52.6% MCPMark pass@1 |
| Best cost-efficiency | Qwen-3-Coder | $36.46 per run |
| Best reasoning depth | Claude Opus 4.1 | Premium tier quality |
| Google ecosystem | Gemini 2.5 Pro | Native integration |
Decision Framework¶
- Match benchmark to application type
- Calculate cost-per-successful-task
- Test on multi-turn scenarios matching your workflows
- Consider infrastructure quality
Заблуждение: высокий BFCL-скор = хорошая production-модель
BFCL тестирует изолированные вызовы функций в контролируемой среде. MCPMark показывает reality check: лучшая модель на BFCL (GLM-4.5, 70.85%) даже не входит в топ MCPMark. GPT-5 набирает 59.22% на BFCL, но 52.6% pass@1 на MCPMark -- потому что реальные задачи требуют 16.2 execution turns и 17.4 tool calls в среднем. Всегда тестируйте на задачах, приближённых к вашему production workload.
Заблуждение: дорогая модель = лучший результат за деньги
Claude Opus 4.1 стоит \(1165 за прогон MCPMark и даёт 29.9% pass@1. Cost per success: ~\)3900. GPT-5 Medium -- \(127 за прогон и 52.6% pass@1, cost per success: ~\)242. Qwen-3-Coder: \(36 за прогон и 24.8% pass@1, cost per success: ~\)147. Самая дешёвая модель per success -- не самая дорогая и не самая точная. Всегда считайте cost per successful task, включая retries.
Заблуждение: pass@1 -- единственная метрика для агентов
Pass@1 показывает "сработало с первого раза", но в production retry -- норма. Pass@4 у GPT-5 Medium = 68.5% (vs 52.6% pass@1), прирост +30%. Но Pass^4 (consistency, все 4 попытки успешны) = 33.9% -- модель нестабильна. Для production-critical задач нужна consistency (Pass^4), для задач с retry -- Pass@4.
Interview Questions¶
Q: Чем BFCL отличается от MCPMark и когда использовать каждый?
Red flag: "Это одинаковые бенчмарки для function calling"
Strong answer: "BFCL (Berkeley) тестирует 2000 изолированных пар вопрос-функция: simple, parallel, multi-turn, relevance detection. Контролируемая среда, Python/Java/JS/REST. MCPMark тестирует 127 реальных CRUD-задач через MCP-серверы (Notion, GitHub, PostgreSQL, Playwright) -- в среднем 16.2 turns и 17.4 tool calls на задачу. BFCL для сравнения базовых capabilities, MCPMark для оценки production readiness агента."
Q: Как правильно сравнивать модели по cost-efficiency для tool use?
Red flag: "Берём самую дешёвую модель по цене за токен"
Strong answer: "Нужно считать cost per successful task, а не cost per run. Qwen-3-Coder: \(36/run при 24.8% pass@1 = ~\)147/success. GPT-5 Medium: \(127/run при 52.6% = ~\)242/success. Claude Sonnet 4: \(252/run при 28.1% = ~\)898/success. Самая дешёвая per-token модель не всегда самая выгодная. Плюс нужно учитывать pass@4 vs pass@1 -- если retry допустим, GPT-5 при 68.5% pass@4 становится ещё выгоднее."
Q: Почему инфраструктура MCP-серверов влияет на результаты бенчмарков?
Red flag: "Инфраструктура не влияет, результат зависит только от модели"
Strong answer: "Klavis AI Strata MCP Server показывает 2x success rate vs официальная реализация GitHub и 1.6x vs Notion. Function calling зависит от reliable tool availability: authentication failures ломают multi-step workflows, rate limit handling предотвращает cascade failures, progressive discovery помогает агенту ориентироваться в доступных инструментах. Одна и та же модель покажет разные результаты на разных MCP-серверах."
Sources¶
- Klavis AI -- "Function Calling and Agentic AI in 2025"
- Berkeley Gorilla -- Function Calling Leaderboard
- MCPMark -- Official Benchmark
- Berkeley -- BFCL v2 AST Evaluation Paper
See Also¶
- Гайд по бенчмаркам LLM -- контекст function calling benchmarks среди 30+ других бенчмарков LLM
- Воркфлоу AI-агентов -- архитектуры агентов, для которых function calling -- ключевая capability
- Ценообразование API LLM -- cost per successful task напрямую зависит от model pricing
- Каскадная маршрутизация LLM -- routing strategies оптимизируют cost/quality для tool use workloads
- ReAct и техники рассуждений -- ReAct prompting как основа function calling в agent loops