Перейти к содержанию

Бенчмарки Function Calling LLM

~5 минут чтения

Предварительно: Function Calling и Tool Use, Воркфлоу AI-агентов

Модель с 92% на MMLU может полностью провалиться при цепочке из трёх API-вызовов. BFCL тестирует 2000 пар вопрос-функция: лучший результат -- 70.85% (GLM-4.5), а GPT-5 набирает лишь 59.22%. MCPMark идёт дальше -- реальные CRUD-операции через Notion, GitHub, PostgreSQL: лучший pass@1 всего 52.6% (GPT-5 Medium), а Claude Opus 4.1 при стоимости $1165 за прогон получает 29.9%. Это ключевой разрыв между "умной моделью" и "рабочим агентом".


Part 1: Overview

Why Function Calling Benchmarks Matter

Key Insight:

A model that scores 90% on a math test might completely fail when asked to chain three API calls, manage context across a 10-turn conversation, or know when not to use a tool.

Traditional Benchmarks Don't Help: - MMLU, HumanEval don't test tool use - Production requires multi-step reasoning - Context management across conversations


Part 2: Berkeley Function Calling Leaderboard (BFCL)

2.1 What Makes BFCL Different

Developed by: UC Berkeley researchers Scale: 2,000 question-function-answer pairs Languages: Python, Java, JavaScript, REST API

2.2 Test Categories

Category What It Measures Why It Matters
Simple Function Calling Single function invocation Baseline competency
Parallel Function Calling Multiple simultaneous calls Batch operations efficiently
Multiple Function Selection Choosing correct tool from many Decision under choice overload
Relevance Detection Knowing when NOT to call Preventing hallucinated actions
Multi-turn Interactions Sustained conversations with context Memory and long-horizon planning
Multi-step Reasoning Sequential calls, outputs feed inputs Complex workflow orchestration

2.3 BFCL Leaderboard (October 2025)

Rank Model Score
1 GLM-4.5 (FC) 70.85%
2 Claude Opus 4.1 70.36%
3 Claude Sonnet 4 70.29%
4-6 Chinese models 65-70%
7 GPT-5 59.22%

2.4 Key Findings

Split Personality:

Top AIs ace one-shot questions but stumble when they must remember context, manage long conversations, or decide when not to act.

Chinese Models Leading: - GLM-4.5 (FC) tops the leaderboard - Strong performance on function calling specifically


Part 3: MCPMark — Real-World Stress Testing

3.1 What is MCPMark

Purpose: Test models on realistic Model Context Protocol (MCP) use Tasks: 127 high-quality tasks by domain experts Scope: Full CRUD operations (not just read-heavy)

3.2 MCP Environments Tested

  • Notion
  • GitHub
  • Filesystem
  • PostgreSQL
  • Playwright

3.3 MCPMark Leaderboard

Model Pass@1 Pass@4 Pass^4 Cost/Run Time
GPT-5 Medium 52.6% 68.5% 33.9% $127.46 478s
Claude Opus 4.1 29.9% - - $1,165.45 362s
Claude Sonnet 4 28.1% 44.9% 12.6% $252.41 218s
o3 25.4% 43.3% 12.6% $113.94 169s
Qwen-3-Coder 24.8% 40.9% 12.6% $36.46 274s

3.4 Metrics Explained

Metric Meaning
Pass@1 Success rate on first attempt
Pass@4 Success within four attempts
Pass^4 Consistency (all four succeed)

3.5 Complexity Factor

Average per task: - 16.2 execution turns - 17.4 tool calls

Typical MCPMark Task: 1. Read current state from Notion database 2. Process data through multiple transformations 3. Make decisions based on constraints 4. Update records across multiple systems 5. Verify changes meet specifications


Part 4: Model Deep Dives

4.1 GPT-5: Cost-Effective Generalist

Pricing: $1.25/M input, $10/M output

Strengths: - Best MCPMark performance (52.6% pass@1) - Reasonable cost structure - Strong multimodal capabilities

Best For: Production applications requiring reliable multi-step workflows where cost matters

4.2 Claude Family: Premium Reasoning

Claude Sonnet 4: - $3/M input, $15/M output - MCPMark: 28.1% pass@1 - Strong reasoning on complex problems

Claude Opus 4.1: - $15/M input, $75/M output - "Best coding model in the world" - Autonomously played Pokémon Red for 24 hours

Best For: Enterprise applications where code quality and reasoning depth justify premium pricing

4.3 Gemini 2.5: Multimodal Powerhouse

Strengths: - Native tool calling without prompt engineering - Excellent multimodal understanding - Strong agentic capabilities

Best For: Applications requiring multimodal reasoning and Google service integration

4.4 Qwen-3: Efficient Alternative

Strengths: - Best cost efficiency ($36.46 per run) - Fast execution times - Hermes-style tool use

Best For: Budget-conscious development and rapid prototyping


Part 5: Cost-Performance Tradeoff

5.1 Cost per Successful Task

Model Cost/Run Pass@1 Cost/Success
Qwen-3-Coder $36.46 24.8% ~$147
GPT-5 Medium $127.46 52.6% ~$242
Claude Sonnet 4 $252.41 28.1% ~$898

5.2 Monthly Cost Projection (10,000 successful tasks)

Model Cost/Success Monthly Cost (10K successes)
Qwen-3-Coder ~$147 ~$1,470K
GPT-5 Medium ~$242 ~$2,420K
Claude Sonnet 4 ~$898 ~$8,980K

5.3 Key Insight

A model that's 10% more accurate but 14x more expensive might not be the right choice. Calculate total cost including retries for failed attempts.


Part 6: Infrastructure Layer

6.1 Production Challenges

Benchmarks assume perfect conditions. Production requires:

Challenge What It Means
Authentication Managing OAuth flows across services
Error handling Recovering from transient failures
Multi-tenancy Isolating customer data
Monitoring Tracking success rates and costs
Schema management Keeping function definitions current

6.2 MCP Server Quality Matters

Klavis AI Strata MCP Server: - 2x success rate vs official GitHub implementation - 1.6x better vs official Notion implementation - Progressive discovery guides agents through tools

6.3 Why Infrastructure Affects Scores

  • Function calling depends on reliable tool availability
  • Authentication failures break multi-step workflows
  • Rate limit handling prevents cascade failures
  • Proper error handling improves agent resilience

Part 7: Interview-Relevant Numbers

BFCL Statistics

Metric Value
Question-function pairs 2,000
Top score 70.85% (GLM-4.5)
GPT-5 score 59.22%
Claude Sonnet 4 score 70.29%

MCPMark Statistics

Metric Value
Tasks 127
Top pass@1 52.6% (GPT-5)
Avg execution turns 16.2
Avg tool calls 17.4

Cost Comparison

Model Input/M Output/M Run Cost
Claude Opus 4.1 $15 $75 $1,165
Claude Sonnet 4 $3 $15 $252
GPT-5 $1.25 $10 $127
Qwen-3-Coder $0.50 $2 $36

Part 8: Model Selection Guide

By Use Case

Need Recommended Model Reason
Best overall performance GPT-5 Medium 52.6% MCPMark pass@1
Best cost-efficiency Qwen-3-Coder $36.46 per run
Best reasoning depth Claude Opus 4.1 Premium tier quality
Google ecosystem Gemini 2.5 Pro Native integration

Decision Framework

  1. Match benchmark to application type
  2. Calculate cost-per-successful-task
  3. Test on multi-turn scenarios matching your workflows
  4. Consider infrastructure quality

Заблуждение: высокий BFCL-скор = хорошая production-модель

BFCL тестирует изолированные вызовы функций в контролируемой среде. MCPMark показывает reality check: лучшая модель на BFCL (GLM-4.5, 70.85%) даже не входит в топ MCPMark. GPT-5 набирает 59.22% на BFCL, но 52.6% pass@1 на MCPMark -- потому что реальные задачи требуют 16.2 execution turns и 17.4 tool calls в среднем. Всегда тестируйте на задачах, приближённых к вашему production workload.

Заблуждение: дорогая модель = лучший результат за деньги

Claude Opus 4.1 стоит \(1165 за прогон MCPMark и даёт 29.9% pass@1. Cost per success: ~\)3900. GPT-5 Medium -- \(127 за прогон и 52.6% pass@1, cost per success: ~\)242. Qwen-3-Coder: \(36 за прогон и 24.8% pass@1, cost per success: ~\)147. Самая дешёвая модель per success -- не самая дорогая и не самая точная. Всегда считайте cost per successful task, включая retries.

Заблуждение: pass@1 -- единственная метрика для агентов

Pass@1 показывает "сработало с первого раза", но в production retry -- норма. Pass@4 у GPT-5 Medium = 68.5% (vs 52.6% pass@1), прирост +30%. Но Pass^4 (consistency, все 4 попытки успешны) = 33.9% -- модель нестабильна. Для production-critical задач нужна consistency (Pass^4), для задач с retry -- Pass@4.


Interview Questions

Q: Чем BFCL отличается от MCPMark и когда использовать каждый?

❌ Red flag: "Это одинаковые бенчмарки для function calling"

✅ Strong answer: "BFCL (Berkeley) тестирует 2000 изолированных пар вопрос-функция: simple, parallel, multi-turn, relevance detection. Контролируемая среда, Python/Java/JS/REST. MCPMark тестирует 127 реальных CRUD-задач через MCP-серверы (Notion, GitHub, PostgreSQL, Playwright) -- в среднем 16.2 turns и 17.4 tool calls на задачу. BFCL для сравнения базовых capabilities, MCPMark для оценки production readiness агента."

Q: Как правильно сравнивать модели по cost-efficiency для tool use?

❌ Red flag: "Берём самую дешёвую модель по цене за токен"

✅ Strong answer: "Нужно считать cost per successful task, а не cost per run. Qwen-3-Coder: \(36/run при 24.8% pass@1 = ~\)147/success. GPT-5 Medium: \(127/run при 52.6% = ~\)242/success. Claude Sonnet 4: \(252/run при 28.1% = ~\)898/success. Самая дешёвая per-token модель не всегда самая выгодная. Плюс нужно учитывать pass@4 vs pass@1 -- если retry допустим, GPT-5 при 68.5% pass@4 становится ещё выгоднее."

Q: Почему инфраструктура MCP-серверов влияет на результаты бенчмарков?

❌ Red flag: "Инфраструктура не влияет, результат зависит только от модели"

✅ Strong answer: "Klavis AI Strata MCP Server показывает 2x success rate vs официальная реализация GitHub и 1.6x vs Notion. Function calling зависит от reliable tool availability: authentication failures ломают multi-step workflows, rate limit handling предотвращает cascade failures, progressive discovery помогает агенту ориентироваться в доступных инструментах. Одна и та же модель покажет разные результаты на разных MCP-серверах."


Sources

  1. Klavis AI -- "Function Calling and Agentic AI in 2025"
  2. Berkeley Gorilla -- Function Calling Leaderboard
  3. MCPMark -- Official Benchmark
  4. Berkeley -- BFCL v2 AST Evaluation Paper

See Also