Zing Forum

Reading

TEMPO: Continuous Expansion of Test-Time Training via EM Algorithm

TEMPO formalizes test-time training as an EM algorithm, solving the performance bottleneck of existing TTT methods through alternating iterations of policy optimization and critic recalibration, and achieving significant breakthroughs at AIME 2024.

测试时训练EM算法强化学习推理模型奖励校准自举学习持续改进
Published 2026-04-21 18:01Recent activity 2026-04-22 12:24Estimated read 8 min
TEMPO: Continuous Expansion of Test-Time Training via EM Algorithm
1

Section 01

Introduction: TEMPO—An EM Algorithm Innovation to Solve Test-Time Training Bottlenecks

TEMPO formalizes Test-Time Training (TTT) as an Expectation-Maximization (EM) algorithm. Through alternating iterations of policy optimization and critic recalibration, it addresses the bottleneck where existing TTT methods quickly hit a plateau after initial performance gains. This method has achieved significant breakthroughs in mathematical reasoning tasks like AIME 2024, providing a new paradigm for continuously expanding model capabilities during the inference phase.

2

Section 02

Background: Potential and Existing Bottlenecks of Test-Time Training

Paradigm of Test-Time Training

After deployment, large language models have fixed parameters. Test-Time Training (TTT) proposes continuing learning during the inference phase: when facing test samples, update parameters using unlabeled data before inference, theoretically breaking through pre-training limitations.

Bottlenecks of Existing TTT Methods

Existing methods face the problem of hitting a plateau after rapid performance gains; increasing computational resources no longer yields benefits, and even "degradation" occurs—accuracy drops and output diversity is lost.

Root Cause of the Problem

The core issue is bootstrap reward signal drift: the policy model and reward model are coupled, and the feedback loop leads to inaccurate reward criteria. The model tends to give itself high scores, losing objectivity.

3

Section 03

Core Methods of TEMPO: EM Framework and Critic Recalibration

Formalization of EM Algorithm

TEMPO re-formalizes TTT as an instance of the EM algorithm:

  • E-step: Evaluate the potential reward of unlabeled problems based on the current policy
  • M-step: Optimize policy parameters based on estimated rewards Existing TTT only performs incomplete EM iterations (missing critic adjustment after policy updates).

Critic Recalibration Mechanism

The key innovation is alternating policy optimization and critic recalibration:

  1. Policy Refinement: Multi-round policy optimization on unlabeled problems
  2. Critic Recalibration: Update the reward model using a small amount of labeled data to restore objective criteria
  3. Cyclic Iteration: Ensure rewards do not drift, and policy optimization is based on reliable feedback

Theoretical Guarantees

From the perspective of variational inference, EM iterations continuously tighten the Evidence Lower Bound (ELBO), ensuring a monotonic increase in log-likelihood, which explains the continuous performance improvement.

4

Section 04

Experimental Evidence: Performance Breakthroughs of TEMPO

Models and Datasets

  • Models: Qwen3 series (7B/14B/32B), OLMO3 series (7B/14B)
  • Tasks: AIME 2024 (math competition), GSM8K, MATH, GPQA

Key Results

  • OLMO3-7B on AIME 2024: Baseline 33.0% → TEMPO 51.1% (+18.1%)
  • Qwen3-14B on AIME 2024: Baseline 42.3% → TEMPO 65.8% (+23.5%) Performance continues to improve with increased computational resources, with no plateau.

Comparison and Diversity

TEMPO significantly outperforms baselines like standard TTT, fixed Critic, and online Critic; it also maintains high output diversity, avoiding homogenization.

5

Section 05

In-depth Analysis: Why Does the EM Mechanism Work?

Stable Reward Quality

  • Standard TTT: Reward quality (correlation coefficient with true accuracy) drops from 0.85 to 0.45
  • TEMPO: Reward quality remains above 0.80

Smooth Policy Trajectory

  • Standard TTT: Parameters oscillate and converge to low-quality local optima
  • TEMPO: Parameters move smoothly toward high-quality regions

Computational Efficiency

Recalibration frequency is low (once every 10-20 rounds of policy optimization), so the overall impact on computational cost is limited, and the performance-computation trade-off is better than baselines.

6

Section 06

Research Implications: New Paradigm for Test-Time Learning

New Direction for Test-Time Computing

Using test-time computing for real learning allows models to "learn while thinking" and dynamically improve their capabilities, rather than just generating sample votes.

Theoretical Basis for Bootstrap Learning

The EM perspective proves that bootstrapping is feasible; the key is to maintain reward objectivity, providing direction for the design of complex bootstrap mechanisms.

New Deployment Paradigm

Small-scale foundation models reduce costs; when facing tasks, they are specialized via TTT, allowing each user/session to have a specialized model, lowering the threshold for AI system deployment.

7

Section 07

Limitations and Future Directions

Current Limitations

  1. Relies on a small amount of labeled data for critic calibration
  2. TTT is several times slower than standard inference
  3. Mainly validated on mathematical reasoning; generalization to other domains needs verification

Future Research

  • Unlabeled calibration: Adversarial calibration or meta-learning
  • Efficient implementation: Reduce inference latency
  • Multi-task TTT: Share experience to accelerate adaptation to new tasks
  • Theoretical deepening: Convergence guarantees and complexity bounds under the EM framework