Reading

TÜBİTAK Math Olympiad Benchmark Test: In-Depth Cost and Performance Comparison of 8 Large Models

A benchmark test of 8 mainstream large language models (LLMs) on high school math Olympiad questions reveals the complex trade-off between cost and performance.

LLMbenchmarkmath reasoningcost-performanceDeepSeekGPT-4ClaudeGemini

Published 2026-05-25 08:32Recent activity 2026-05-25 08:51Estimated read 9 min

Section 01

Introduction to TÜBİTAK Math Olympiad Benchmark Test: In-Depth Cost and Performance Comparison of 8 Large Models

This test compares the performance of 8 mainstream large language models (LLMs) on 32 multiple-choice questions from the 34th TÜBİTAK High School Math Olympiad 2026. Key findings: Some models have converging performance (5 models scored full marks) but significant cost differences (the cost of the most expensive full-score model is 22 times that of the cheapest). Cost-effectiveness becomes a critical factor for LLM selection. The test was published by BYALPERENK on GitHub on May 25, 2026, aiming to fill the gap where traditional benchmarks only focus on accuracy and ignore cost.

Section 02

Project Background and Motivation

With the rapid development of large language models, developers and enterprises face selection challenges: Traditional benchmarks only focus on accuracy but ignore cost—for the same correct answer, some models cost $8 while others only $0.36. This project selected 32 multiple-choice questions from the 2026 TÜBİTAK Math Olympiad as the test set, compared 8 mainstream models, and analyzed the real relationship between cost and performance.

Section 03

Test Design and Methodology

Dataset Construction

Three layers of verification were used to convert the official TÜBİTAK PDF into structured JSON:

Cross-model conversion verification (content extraction comparison between GPT-4.5, Gemini3.5 Flash, Claude Sonnet4.6)
Manual visual review (HTML viewer + MathJax to check formulas and OCR errors)
Structural verification (Python script to check field integrity, etc.)

Evaluation Method

Questions were sent via the OpenRouter API with a unified prompt requiring the model to select a unique answer using mathematical reasoning and output in a specific format. All models were enabled with reasoning mode, and the temperature parameter was not set to ensure comparability. Answers were extracted via regular expressions to get the last matching item.

Section 04

Key Findings

DeepSeek v4 Pro: King of Cost-Effectiveness: Achieved 100% accuracy with a total cost of $0.36, 22 times cheaper than the same full-score model Claude Sonnet4.6 ($8.01).
5 models scored full marks: DeepSeek v4 Pro, GPT-4.5, Mistral Medium3.5, Qwen3.7 Max, Claude Sonnet4.6 all achieved 100% accuracy, indicating that model capabilities have reached saturation at this difficulty level.
Large difference in token efficiency: GPT-4.5 used only 81K output tokens, while Mistral Medium3.5 used 769K (nearly 10 times), affecting latency and quotas.
Gemini3.5 Flash: A Balanced Choice: Achieved 96.88% accuracy with a cost of $1.22, suitable for scenarios where perfect performance is not required.
Cheapest ≠ Most Cost-Effective: Grok4.3 had the lowest cost per correct answer ($0.0107), but its accuracy was only 87.5%, so error costs need to be weighed.

Section 05

Complete Test Results

Model	Accuracy	Input Tokens	Output Tokens	Total Cost	Cost per Correct Answer
Claude Sonnet 4.6	100.00%	8,597	532,192	$8.01	$0.2503
DeepSeek v4 Pro	100.00%	8,087	407,400	$0.36	$0.0112
Mistral Medium 3.5	100.00%	7,954	769,192	$5.78	$0.1807
GPT-4.5	100.00%	7,425	81,633	$2.49	$0.0777
Qwen 3.7 Max	100.00%	7,967	379,019	$2.86	$0.0895
Gemini 3.5 Flash	96.88%	7,520	134,209	$1.22	$0.0393
GLM 5.1	93.75%	7,555	582,747	$1.80	$0.0601
Grok 4.3	87.50%	11,174	114,316	$0.30	$0.0107

The total cost for testing all 8 models is approximately $22.82 (prices as of May 2026).

Section 06

Cost Calculation Method

Costs are based on each model's token usage and OpenRouter pricing in May 2026:

DeepSeek v4 Pro: Input $0.435 per million tokens, Output $0.87 per million tokens
GLM5.1: $0.98/$3.08
Grok4.3: $1.25/$2.50
Gemini3.5 Flash: $1.50/$9.00
Mistral Medium3.5: $1.50/$7.50
Qwen3.7 Max: $2.50/$7.50
Claude Sonnet4.6: $3.00/$15.00
GPT-4.5: $5.00/$30.00

Note: In OpenAI/OpenRouter API, completion_tokens already include reasoning tokens, so output cost only calculates response_tokens to avoid double billing.

Section 07

Limitations and Practical Application Insights

Limitations

Small sample size (n=32), wide confidence interval
Single run; differences may still exist even at low temperature
Price fluctuations (reflects pricing on the test day)
OpenRouter routing differences may affect quality
Capability ceiling effect (5 models with full marks cannot distinguish cutting-edge models)
Only looks at final answers, does not evaluate reasoning process

Application Insights

Cost-sensitive scenarios: Choose DeepSeek v4 Pro (full marks for $0.36)
Latency-sensitive scenarios: Choose GPT-4.5 (81K output tokens)
Acceptable minor errors: Choose Gemini3.5 Flash ($1.22, 96.88% accuracy)
Exploratory applications: Choose Grok4.3 (low cost per attempt)

Section 08

Conclusion and Outlook

This test reveals a trend: LLM performance tends to saturate on tasks of specific difficulty, and cost efficiency becomes a key differentiating dimension. Enterprises should select models based on task difficulty and cost sensitivity rather than blindly pursuing the strongest model. Model providers need to optimize reasoning efficiency and pricing strategies. The project's open-source code provides a reusable framework; future extensions can cover more disciplines and difficulty levels to track the evolution of LLM cost-effectiveness.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15