# TÜBİTAK Math Olympiad Benchmark Test: In-Depth Cost and Performance Comparison of 8 Large Models

> A benchmark test of 8 mainstream large language models (LLMs) on high school math Olympiad questions reveals the complex trade-off between cost and performance.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-25T00:32:00.000Z
- 最近活动: 2026-05-25T00:51:00.253Z
- 热度: 159.7
- 关键词: LLM, benchmark, math reasoning, cost-performance, DeepSeek, GPT-4, Claude, Gemini
- 页面链接: https://www.zingnex.cn/en/forum/thread/tubitak-8
- Canonical: https://www.zingnex.cn/forum/thread/tubitak-8
- Markdown 来源: floors_fallback

---

## Introduction to TÜBİTAK Math Olympiad Benchmark Test: In-Depth Cost and Performance Comparison of 8 Large Models

This test compares the performance of 8 mainstream large language models (LLMs) on 32 multiple-choice questions from the 34th TÜBİTAK High School Math Olympiad 2026. Key findings: Some models have converging performance (5 models scored full marks) but significant cost differences (the cost of the most expensive full-score model is 22 times that of the cheapest). Cost-effectiveness becomes a critical factor for LLM selection. The test was published by BYALPERENK on GitHub on May 25, 2026, aiming to fill the gap where traditional benchmarks only focus on accuracy and ignore cost.

## Project Background and Motivation

With the rapid development of large language models, developers and enterprises face selection challenges: Traditional benchmarks only focus on accuracy but ignore cost—for the same correct answer, some models cost $8 while others only $0.36. This project selected 32 multiple-choice questions from the 2026 TÜBİTAK Math Olympiad as the test set, compared 8 mainstream models, and analyzed the real relationship between cost and performance.

## Test Design and Methodology

### Dataset Construction
Three layers of verification were used to convert the official TÜBİTAK PDF into structured JSON:
1. Cross-model conversion verification (content extraction comparison between GPT-4.5, Gemini3.5 Flash, Claude Sonnet4.6)
2. Manual visual review (HTML viewer + MathJax to check formulas and OCR errors)
3. Structural verification (Python script to check field integrity, etc.)

### Evaluation Method
Questions were sent via the OpenRouter API with a unified prompt requiring the model to select a unique answer using mathematical reasoning and output in a specific format. All models were enabled with reasoning mode, and the temperature parameter was not set to ensure comparability. Answers were extracted via regular expressions to get the last matching item.

## Key Findings

1. **DeepSeek v4 Pro: King of Cost-Effectiveness**: Achieved 100% accuracy with a total cost of $0.36, 22 times cheaper than the same full-score model Claude Sonnet4.6 ($8.01).
2. **5 models scored full marks**: DeepSeek v4 Pro, GPT-4.5, Mistral Medium3.5, Qwen3.7 Max, Claude Sonnet4.6 all achieved 100% accuracy, indicating that model capabilities have reached saturation at this difficulty level.
3. **Large difference in token efficiency**: GPT-4.5 used only 81K output tokens, while Mistral Medium3.5 used 769K (nearly 10 times), affecting latency and quotas.
4. **Gemini3.5 Flash: A Balanced Choice**: Achieved 96.88% accuracy with a cost of $1.22, suitable for scenarios where perfect performance is not required.
5. **Cheapest ≠ Most Cost-Effective**: Grok4.3 had the lowest cost per correct answer ($0.0107), but its accuracy was only 87.5%, so error costs need to be weighed.

## Complete Test Results

| Model | Accuracy | Input Tokens | Output Tokens | Total Cost | Cost per Correct Answer |
|------|--------|-----------|-----------|--------|------------------|
| Claude Sonnet 4.6 | 100.00% | 8,597 | 532,192 | $8.01 | $0.2503 |
| DeepSeek v4 Pro | 100.00% | 8,087 | 407,400 | $0.36 | $0.0112 |
| Mistral Medium 3.5 | 100.00% | 7,954 | 769,192 | $5.78 | $0.1807 |
| GPT-4.5 | 100.00% | 7,425 | 81,633 | $2.49 | $0.0777 |
| Qwen 3.7 Max | 100.00% | 7,967 | 379,019 | $2.86 | $0.0895 |
| Gemini 3.5 Flash | 96.88% | 7,520 | 134,209 | $1.22 | $0.0393 |
| GLM 5.1 | 93.75% | 7,555 | 582,747 | $1.80 | $0.0601 |
| Grok 4.3 | 87.50% | 11,174 | 114,316 | $0.30 | $0.0107 |

The total cost for testing all 8 models is approximately $22.82 (prices as of May 2026).

## Cost Calculation Method

Costs are based on each model's token usage and OpenRouter pricing in May 2026:
- DeepSeek v4 Pro: Input $0.435 per million tokens, Output $0.87 per million tokens
- GLM5.1: $0.98/$3.08
- Grok4.3: $1.25/$2.50
- Gemini3.5 Flash: $1.50/$9.00
- Mistral Medium3.5: $1.50/$7.50
- Qwen3.7 Max: $2.50/$7.50
- Claude Sonnet4.6: $3.00/$15.00
- GPT-4.5: $5.00/$30.00

Note: In OpenAI/OpenRouter API, completion_tokens already include reasoning tokens, so output cost only calculates response_tokens to avoid double billing.

## Limitations and Practical Application Insights

### Limitations
1. Small sample size (n=32), wide confidence interval
2. Single run; differences may still exist even at low temperature
3. Price fluctuations (reflects pricing on the test day)
4. OpenRouter routing differences may affect quality
5. Capability ceiling effect (5 models with full marks cannot distinguish cutting-edge models)
6. Only looks at final answers, does not evaluate reasoning process

### Application Insights
- Cost-sensitive scenarios: Choose DeepSeek v4 Pro (full marks for $0.36)
- Latency-sensitive scenarios: Choose GPT-4.5 (81K output tokens)
- Acceptable minor errors: Choose Gemini3.5 Flash ($1.22, 96.88% accuracy)
- Exploratory applications: Choose Grok4.3 (low cost per attempt)

## Conclusion and Outlook

This test reveals a trend: LLM performance tends to saturate on tasks of specific difficulty, and cost efficiency becomes a key differentiating dimension. Enterprises should select models based on task difficulty and cost sensitivity rather than blindly pursuing the strongest model. Model providers need to optimize reasoning efficiency and pricing strategies. The project's open-source code provides a reusable framework; future extensions can cover more disciplines and difficulty levels to track the evolution of LLM cost-effectiveness.