# TEMPO: Continuous Expansion of Test-Time Training via EM Algorithm

> TEMPO formalizes test-time training as an EM algorithm, solving the performance bottleneck of existing TTT methods through alternating iterations of policy optimization and critic recalibration, and achieving significant breakthroughs at AIME 2024.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-21T10:01:04.000Z
- 最近活动: 2026-04-22T04:24:59.086Z
- 热度: 130.6
- 关键词: 测试时训练, EM算法, 强化学习, 推理模型, 奖励校准, 自举学习, 持续改进
- 页面链接: https://www.zingnex.cn/en/forum/thread/tempo-em
- Canonical: https://www.zingnex.cn/forum/thread/tempo-em
- Markdown 来源: floors_fallback

---

## Introduction: TEMPO—An EM Algorithm Innovation to Solve Test-Time Training Bottlenecks

TEMPO formalizes Test-Time Training (TTT) as an Expectation-Maximization (EM) algorithm. Through alternating iterations of policy optimization and critic recalibration, it addresses the bottleneck where existing TTT methods quickly hit a plateau after initial performance gains. This method has achieved significant breakthroughs in mathematical reasoning tasks like AIME 2024, providing a new paradigm for continuously expanding model capabilities during the inference phase.

## Background: Potential and Existing Bottlenecks of Test-Time Training

### Paradigm of Test-Time Training
After deployment, large language models have fixed parameters. Test-Time Training (TTT) proposes continuing learning during the inference phase: when facing test samples, update parameters using unlabeled data before inference, theoretically breaking through pre-training limitations.

### Bottlenecks of Existing TTT Methods
Existing methods face the problem of hitting a plateau after rapid performance gains; increasing computational resources no longer yields benefits, and even "degradation" occurs—accuracy drops and output diversity is lost.

### Root Cause of the Problem
The core issue is bootstrap reward signal drift: the policy model and reward model are coupled, and the feedback loop leads to inaccurate reward criteria. The model tends to give itself high scores, losing objectivity.

## Core Methods of TEMPO: EM Framework and Critic Recalibration

### Formalization of EM Algorithm
TEMPO re-formalizes TTT as an instance of the EM algorithm:
- **E-step**: Evaluate the potential reward of unlabeled problems based on the current policy
- **M-step**: Optimize policy parameters based on estimated rewards
Existing TTT only performs incomplete EM iterations (missing critic adjustment after policy updates).

### Critic Recalibration Mechanism
The key innovation is alternating policy optimization and critic recalibration:
1. Policy Refinement: Multi-round policy optimization on unlabeled problems
2. Critic Recalibration: Update the reward model using a small amount of labeled data to restore objective criteria
3. Cyclic Iteration: Ensure rewards do not drift, and policy optimization is based on reliable feedback

### Theoretical Guarantees
From the perspective of variational inference, EM iterations continuously tighten the Evidence Lower Bound (ELBO), ensuring a monotonic increase in log-likelihood, which explains the continuous performance improvement.

## Experimental Evidence: Performance Breakthroughs of TEMPO

### Models and Datasets
- **Models**: Qwen3 series (7B/14B/32B), OLMO3 series (7B/14B)
- **Tasks**: AIME 2024 (math competition), GSM8K, MATH, GPQA

### Key Results
- OLMO3-7B on AIME 2024: Baseline 33.0% → TEMPO 51.1% (+18.1%)
- Qwen3-14B on AIME 2024: Baseline 42.3% → TEMPO 65.8% (+23.5%)
Performance continues to improve with increased computational resources, with no plateau.

### Comparison and Diversity
TEMPO significantly outperforms baselines like standard TTT, fixed Critic, and online Critic; it also maintains high output diversity, avoiding homogenization.

## In-depth Analysis: Why Does the EM Mechanism Work?

### Stable Reward Quality
- Standard TTT: Reward quality (correlation coefficient with true accuracy) drops from 0.85 to 0.45
- TEMPO: Reward quality remains above 0.80

### Smooth Policy Trajectory
- Standard TTT: Parameters oscillate and converge to low-quality local optima
- TEMPO: Parameters move smoothly toward high-quality regions

### Computational Efficiency
Recalibration frequency is low (once every 10-20 rounds of policy optimization), so the overall impact on computational cost is limited, and the performance-computation trade-off is better than baselines.

## Research Implications: New Paradigm for Test-Time Learning

### New Direction for Test-Time Computing
Using test-time computing for real learning allows models to "learn while thinking" and dynamically improve their capabilities, rather than just generating sample votes.

### Theoretical Basis for Bootstrap Learning
The EM perspective proves that bootstrapping is feasible; the key is to maintain reward objectivity, providing direction for the design of complex bootstrap mechanisms.

### New Deployment Paradigm
Small-scale foundation models reduce costs; when facing tasks, they are specialized via TTT, allowing each user/session to have a specialized model, lowering the threshold for AI system deployment.

## Limitations and Future Directions

### Current Limitations
1. Relies on a small amount of labeled data for critic calibration
2. TTT is several times slower than standard inference
3. Mainly validated on mathematical reasoning; generalization to other domains needs verification

### Future Research
- Unlabeled calibration: Adversarial calibration or meta-learning
- Efficient implementation: Reduce inference latency
- Multi-task TTT: Share experience to accelerate adaptation to new tasks
- Theoretical deepening: Convergence guarantees and complexity bounds under the EM framework
