# Mathematical Boundaries of Repeated Sampling Voting: How Two Calls Predict the Accuracy Curve of LLM Reasoning

> This article introduces a mathematical theoretical study on the repeated reasoning voting mechanism of large language models (LLMs), revealing a method to predict the accuracy boundaries of majority voting using only two independent calls and providing a new theoretical framework for computational optimization during testing.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-05T05:40:09.000Z
- 最近活动: 2026-05-06T03:50:12.357Z
- 热度: 126.8
- 关键词: 大语言模型, 重复采样, 多数投票, 测试时计算, 统计学习理论, 推理优化, 不确定性量化
- 页面链接: https://www.zingnex.cn/en/forum/thread/llm-56d69adf
- Canonical: https://www.zingnex.cn/forum/thread/llm-56d69adf
- Markdown 来源: floors_fallback

---

## [Introduction] Mathematical Boundaries of Repeated Sampling Voting: Predicting the Accuracy Curve of LLM Reasoning with Two Calls

This article presents a mathematical theoretical study on the repeated reasoning voting mechanism of large language models (LLMs). The core finding is that the accuracy boundaries under any majority voting budget can be predicted using only two independent calls, providing a new theoretical framework for computational optimization during testing.

## Research Background: Dilemmas in Computational Testing

Current LLM reasoning faces a dilemma: a single call may be unstable, while multiple sampling votes improve accuracy but increase computational costs. In practice, there is a lack of theoretical guidance to determine when to stop sampling and the expected benefits, leading to over-investment in resources or insufficient sampling. Additionally, the benefits of repeated sampling are uneven—some samples have inherent uncertainty, and simply increasing calls cannot resolve systematic errors.

## Core Finding: Two-Moment Theory

The study establishes a concise mathematical framework and finds that the binary correctness of repeated reasoning can be characterized by two statistics:
1. First moment: Average accuracy of a single call, reflecting the model's overall mastery but failing to distinguish between stable and occasional correct cases;
2. Second moment: Correctness correlation, estimating the correlation coefficient between different calls for the same sample via two calls, revealing recoverable noise and stable systematic errors.
Based on these two moments, the accuracy interval under any voting budget can be bounded, and the effect can be predicted without multiple actual calls.

## Technical Approach: Breakthrough with Three-Atom Extremizers

The study addresses the infinite-dimensional moment problem using convex optimization duality theory, proving that the optimal boundary for any finite voting budget can be achieved by a discrete distribution containing only three atoms. This finding has three implications:
- Computational tractability: Simplified to a low-dimensional optimization problem;
- Accuracy guarantee: The boundary is exact rather than approximate;
- Interpretability: The three atoms correspond to the intuitive classification of easy, hard, and medium samples.

## Practical Results: Closed-Form Solution and Certification Criterion for Three-Vote Majority Voting

The study focuses on three-vote majority voting (the minimal useful budget) and finds that it has a concise closed-form solution, with the interval width strictly bounded within 1/8. It also proposes the 'Certified Improvement Criterion': when the two moments meet specific conditions, it can be proven that three-vote voting is necessarily better than single-vote. This provides practitioners with a decision-making tool: using two calls to determine whether increasing votes is worth it and the range of benefits.

## Experimental Validation and Model Mixing Effects

Validated on the QNLI and QQP datasets, the actual observed three-vote and five-vote accuracies fall within the theoretically predicted intervals. When adjusting sampling temperature or using random model mixing, 'weaker' model configurations can sometimes surpass 'stronger' single-call configurations through voting, providing new ideas for model integration strategies.

## Practical Significance and Future Outlook

Significance for engineering practitioners:
1. Cost-benefit analysis: Predict return on investment before increasing computation;
2. Dynamic budget allocation: Adjust sampling times based on sample difficulty;
3. Model selection guidance: Determine between a single strong model or a multi-medium model voting strategy.
Macroscopically, this study promotes the shift of AI from experience-driven to theory-guided, and the two-moment framework may become a fundamental tool in the field.