Бенчмарки Function Calling LLM¶

~5 минут чтения

Предварительно: Function Calling и Tool Use, Воркфлоу AI-агентов

Модель с 92% на MMLU может полностью провалиться при цепочке из трёх API-вызовов. BFCL тестирует 2000 пар вопрос-функция: лучший результат -- 70.85% (GLM-4.5), а GPT-5 набирает лишь 59.22%. MCPMark идёт дальше -- реальные CRUD-операции через Notion, GitHub, PostgreSQL: лучший pass@1 всего 52.6% (GPT-5 Medium), а Claude Opus 4.1 при стоимости $1165 за прогон получает 29.9%. Это ключевой разрыв между "умной моделью" и "рабочим агентом".

Part 1: Overview¶

Why Function Calling Benchmarks Matter¶

Key Insight:

A model that scores 90% on a math test might completely fail when asked to chain three API calls, manage context across a 10-turn conversation, or know when not to use a tool.

Traditional Benchmarks Don't Help: - MMLU, HumanEval don't test tool use - Production requires multi-step reasoning - Context management across conversations

Part 2: Berkeley Function Calling Leaderboard (BFCL)¶

2.1 What Makes BFCL Different¶

Developed by: UC Berkeley researchers Scale: 2,000 question-function-answer pairs Languages: Python, Java, JavaScript, REST API

2.2 Test Categories¶

Category	What It Measures	Why It Matters
Simple Function Calling	Single function invocation	Baseline competency
Parallel Function Calling	Multiple simultaneous calls	Batch operations efficiently
Multiple Function Selection	Choosing correct tool from many	Decision under choice overload
Relevance Detection	Knowing when NOT to call	Preventing hallucinated actions
Multi-turn Interactions	Sustained conversations with context	Memory and long-horizon planning
Multi-step Reasoning	Sequential calls, outputs feed inputs	Complex workflow orchestration

2.3 BFCL Leaderboard (October 2025)¶

Rank	Model	Score
1	GLM-4.5 (FC)	70.85%
2	Claude Opus 4.1	70.36%
3	Claude Sonnet 4	70.29%
4-6	Chinese models	65-70%
7	GPT-5	59.22%

2.4 Key Findings¶

Split Personality:

Top AIs ace one-shot questions but stumble when they must remember context, manage long conversations, or decide when not to act.

Chinese Models Leading: - GLM-4.5 (FC) tops the leaderboard - Strong performance on function calling specifically

Part 3: MCPMark — Real-World Stress Testing¶

3.1 What is MCPMark¶

Purpose: Test models on realistic Model Context Protocol (MCP) use Tasks: 127 high-quality tasks by domain experts Scope: Full CRUD operations (not just read-heavy)

3.2 MCP Environments Tested¶

Notion
GitHub
Filesystem
PostgreSQL
Playwright

3.3 MCPMark Leaderboard¶

Model	Pass@1	Pass@4	Pass^4	Cost/Run	Time
GPT-5 Medium	52.6%	68.5%	33.9%	$127.46	478s
Claude Opus 4.1	29.9%	-	-	$1,165.45	362s
Claude Sonnet 4	28.1%	44.9%	12.6%	$252.41	218s
o3	25.4%	43.3%	12.6%	$113.94	169s
Qwen-3-Coder	24.8%	40.9%	12.6%	$36.46	274s

3.4 Metrics Explained¶

Metric	Meaning
Pass@1	Success rate on first attempt
Pass@4	Success within four attempts
Pass^4	Consistency (all four succeed)

3.5 Complexity Factor¶

Average per task: - 16.2 execution turns - 17.4 tool calls

Typical MCPMark Task: 1. Read current state from Notion database 2. Process data through multiple transformations 3. Make decisions based on constraints 4. Update records across multiple systems 5. Verify changes meet specifications

Part 4: Model Deep Dives¶

4.1 GPT-5: Cost-Effective Generalist¶

Pricing: $1.25/M input, $10/M output

Strengths: - Best MCPMark performance (52.6% pass@1) - Reasonable cost structure - Strong multimodal capabilities

Best For: Production applications requiring reliable multi-step workflows where cost matters

4.2 Claude Family: Premium Reasoning¶

Claude Sonnet 4: - $3/M input, $15/M output - MCPMark: 28.1% pass@1 - Strong reasoning on complex problems

Claude Opus 4.1: - $15/M input, $75/M output - "Best coding model in the world" - Autonomously played Pokémon Red for 24 hours

Best For: Enterprise applications where code quality and reasoning depth justify premium pricing

4.3 Gemini 2.5: Multimodal Powerhouse¶

Strengths: - Native tool calling without prompt engineering - Excellent multimodal understanding - Strong agentic capabilities

Best For: Applications requiring multimodal reasoning and Google service integration

4.4 Qwen-3: Efficient Alternative¶

Strengths: - Best cost efficiency ($36.46 per run) - Fast execution times - Hermes-style tool use

Best For: Budget-conscious development and rapid prototyping

Part 5: Cost-Performance Tradeoff¶

5.1 Cost per Successful Task¶

Model	Cost/Run	Pass@1	Cost/Success
Qwen-3-Coder	$36.46	24.8%	~$147
GPT-5 Medium	$127.46	52.6%	~$242
Claude Sonnet 4	$252.41	28.1%	~$898

5.2 Monthly Cost Projection (10,000 successful tasks)¶

Model	Cost/Success	Monthly Cost (10K successes)
Qwen-3-Coder	~$147	~$1,470K
GPT-5 Medium	~$242	~$2,420K
Claude Sonnet 4	~$898	~$8,980K

5.3 Key Insight¶

A model that's 10% more accurate but 14x more expensive might not be the right choice. Calculate total cost including retries for failed attempts.

Part 6: Infrastructure Layer¶

6.1 Production Challenges¶

Benchmarks assume perfect conditions. Production requires:

Challenge	What It Means
Authentication	Managing OAuth flows across services
Error handling	Recovering from transient failures
Multi-tenancy	Isolating customer data
Monitoring	Tracking success rates and costs
Schema management	Keeping function definitions current

6.2 MCP Server Quality Matters¶

Klavis AI Strata MCP Server: - 2x success rate vs official GitHub implementation - 1.6x better vs official Notion implementation - Progressive discovery guides agents through tools

6.3 Why Infrastructure Affects Scores¶

Function calling depends on reliable tool availability
Authentication failures break multi-step workflows
Rate limit handling prevents cascade failures
Proper error handling improves agent resilience

Part 7: Interview-Relevant Numbers¶

BFCL Statistics¶

Metric	Value
Question-function pairs	2,000
Top score	70.85% (GLM-4.5)
GPT-5 score	59.22%
Claude Sonnet 4 score	70.29%

MCPMark Statistics¶

Metric	Value
Tasks	127
Top pass@1	52.6% (GPT-5)
Avg execution turns	16.2
Avg tool calls	17.4

Cost Comparison¶

Model	Input/M	Output/M	Run Cost
Claude Opus 4.1	$15	$75	$1,165
Claude Sonnet 4	$3	$15	$252
GPT-5	$1.25	$10	$127
Qwen-3-Coder	$0.50	$2	$36

Part 8: Model Selection Guide¶

By Use Case¶

Need	Recommended Model	Reason
Best overall performance	GPT-5 Medium	52.6% MCPMark pass@1
Best cost-efficiency	Qwen-3-Coder	$36.46 per run
Best reasoning depth	Claude Opus 4.1	Premium tier quality
Google ecosystem	Gemini 2.5 Pro	Native integration

Decision Framework¶

Match benchmark to application type
Calculate cost-per-successful-task
Test on multi-turn scenarios matching your workflows
Consider infrastructure quality

Заблуждение: высокий BFCL-скор = хорошая production-модель

BFCL тестирует изолированные вызовы функций в контролируемой среде. MCPMark показывает reality check: лучшая модель на BFCL (GLM-4.5, 70.85%) даже не входит в топ MCPMark. GPT-5 набирает 59.22% на BFCL, но 52.6% pass@1 на MCPMark -- потому что реальные задачи требуют 16.2 execution turns и 17.4 tool calls в среднем. Всегда тестируйте на задачах, приближённых к вашему production workload.

Заблуждение: дорогая модель = лучший результат за деньги

Claude Opus 4.1 стоит $1165 за прогон MCPMark и даёт 29.9% pass@1. Cost per success: ~$3900. GPT-5 Medium -- $127 за прогон и 52.6% pass@1, cost per success: ~$242. Qwen-3-Coder: $36 за прогон и 24.8% pass@1, cost per success: ~$147. Самая дешёвая модель per success -- не самая дорогая и не самая точная. Всегда считайте cost per successful task, включая retries.

Заблуждение: pass@1 -- единственная метрика для агентов

Pass@1 показывает "сработало с первого раза", но в production retry -- норма. Pass@4 у GPT-5 Medium = 68.5% (vs 52.6% pass@1), прирост +30%. Но Pass^4 (consistency, все 4 попытки успешны) = 33.9% -- модель нестабильна. Для production-critical задач нужна consistency (Pass^4), для задач с retry -- Pass@4.

Interview Questions¶

Q: Чем BFCL отличается от MCPMark и когда использовать каждый?

Red flag: "Это одинаковые бенчмарки для function calling"

Strong answer: "BFCL (Berkeley) тестирует 2000 изолированных пар вопрос-функция: simple, parallel, multi-turn, relevance detection. Контролируемая среда, Python/Java/JS/REST. MCPMark тестирует 127 реальных CRUD-задач через MCP-серверы (Notion, GitHub, PostgreSQL, Playwright) -- в среднем 16.2 turns и 17.4 tool calls на задачу. BFCL для сравнения базовых capabilities, MCPMark для оценки production readiness агента."

Q: Как правильно сравнивать модели по cost-efficiency для tool use?

Red flag: "Берём самую дешёвую модель по цене за токен"

Strong answer: "Нужно считать cost per successful task, а не cost per run. Qwen-3-Coder: $36/run при 24.8% pass@1 = ~$147/success. GPT-5 Medium: $127/run при 52.6% = ~$242/success. Claude Sonnet 4: $252/run при 28.1% = ~$898/success. Самая дешёвая per-token модель не всегда самая выгодная. Плюс нужно учитывать pass@4 vs pass@1 -- если retry допустим, GPT-5 при 68.5% pass@4 становится ещё выгоднее."

Q: Почему инфраструктура MCP-серверов влияет на результаты бенчмарков?

Red flag: "Инфраструктура не влияет, результат зависит только от модели"

Strong answer: "Klavis AI Strata MCP Server показывает 2x success rate vs официальная реализация GitHub и 1.6x vs Notion. Function calling зависит от reliable tool availability: authentication failures ломают multi-step workflows, rate limit handling предотвращает cascade failures, progressive discovery помогает агенту ориентироваться в доступных инструментах. Одна и та же модель покажет разные результаты на разных MCP-серверах."

Sources¶

Klavis AI -- "Function Calling and Agentic AI in 2025"
Berkeley Gorilla -- Function Calling Leaderboard
MCPMark -- Official Benchmark
Berkeley -- BFCL v2 AST Evaluation Paper