Reading

TEMPO: Continuous Expansion of Test-Time Training via EM Algorithm

TEMPO formalizes test-time training as an EM algorithm, solving the performance bottleneck of existing TTT methods through alternating iterations of policy optimization and critic recalibration, and achieving significant breakthroughs at AIME 2024.

测试时训练EM算法强化学习推理模型奖励校准自举学习持续改进

Published 2026-04-21 18:01Recent activity 2026-04-22 12:24Estimated read 8 min

TEMPO: Continuous Expansion of Test-Time Training via EM Algorithm

Section 01

Introduction: TEMPO—An EM Algorithm Innovation to Solve Test-Time Training Bottlenecks

TEMPO formalizes Test-Time Training (TTT) as an Expectation-Maximization (EM) algorithm. Through alternating iterations of policy optimization and critic recalibration, it addresses the bottleneck where existing TTT methods quickly hit a plateau after initial performance gains. This method has achieved significant breakthroughs in mathematical reasoning tasks like AIME 2024, providing a new paradigm for continuously expanding model capabilities during the inference phase.

Section 02

Background: Potential and Existing Bottlenecks of Test-Time Training

Paradigm of Test-Time Training

After deployment, large language models have fixed parameters. Test-Time Training (TTT) proposes continuing learning during the inference phase: when facing test samples, update parameters using unlabeled data before inference, theoretically breaking through pre-training limitations.

Bottlenecks of Existing TTT Methods

Existing methods face the problem of hitting a plateau after rapid performance gains; increasing computational resources no longer yields benefits, and even "degradation" occurs—accuracy drops and output diversity is lost.

Root Cause of the Problem

The core issue is bootstrap reward signal drift: the policy model and reward model are coupled, and the feedback loop leads to inaccurate reward criteria. The model tends to give itself high scores, losing objectivity.

Section 03

Core Methods of TEMPO: EM Framework and Critic Recalibration

Formalization of EM Algorithm

TEMPO re-formalizes TTT as an instance of the EM algorithm:

E-step: Evaluate the potential reward of unlabeled problems based on the current policy
M-step: Optimize policy parameters based on estimated rewards Existing TTT only performs incomplete EM iterations (missing critic adjustment after policy updates).

Critic Recalibration Mechanism

The key innovation is alternating policy optimization and critic recalibration:

Policy Refinement: Multi-round policy optimization on unlabeled problems
Critic Recalibration: Update the reward model using a small amount of labeled data to restore objective criteria
Cyclic Iteration: Ensure rewards do not drift, and policy optimization is based on reliable feedback

Theoretical Guarantees

From the perspective of variational inference, EM iterations continuously tighten the Evidence Lower Bound (ELBO), ensuring a monotonic increase in log-likelihood, which explains the continuous performance improvement.

Section 04

Experimental Evidence: Performance Breakthroughs of TEMPO

Models and Datasets

Models: Qwen3 series (7B/14B/32B), OLMO3 series (7B/14B)
Tasks: AIME 2024 (math competition), GSM8K, MATH, GPQA

Key Results

OLMO3-7B on AIME 2024: Baseline 33.0% → TEMPO 51.1% (+18.1%)
Qwen3-14B on AIME 2024: Baseline 42.3% → TEMPO 65.8% (+23.5%) Performance continues to improve with increased computational resources, with no plateau.

Comparison and Diversity

TEMPO significantly outperforms baselines like standard TTT, fixed Critic, and online Critic; it also maintains high output diversity, avoiding homogenization.

Section 05

In-depth Analysis: Why Does the EM Mechanism Work?

Stable Reward Quality

Standard TTT: Reward quality (correlation coefficient with true accuracy) drops from 0.85 to 0.45
TEMPO: Reward quality remains above 0.80

Smooth Policy Trajectory

Standard TTT: Parameters oscillate and converge to low-quality local optima
TEMPO: Parameters move smoothly toward high-quality regions

Computational Efficiency

Recalibration frequency is low (once every 10-20 rounds of policy optimization), so the overall impact on computational cost is limited, and the performance-computation trade-off is better than baselines.

Section 06

Research Implications: New Paradigm for Test-Time Learning

New Direction for Test-Time Computing

Using test-time computing for real learning allows models to "learn while thinking" and dynamically improve their capabilities, rather than just generating sample votes.

Theoretical Basis for Bootstrap Learning

The EM perspective proves that bootstrapping is feasible; the key is to maintain reward objectivity, providing direction for the design of complex bootstrap mechanisms.

New Deployment Paradigm

Small-scale foundation models reduce costs; when facing tasks, they are specialized via TTT, allowing each user/session to have a specialized model, lowering the threshold for AI system deployment.

Section 07

Limitations and Future Directions

Current Limitations

Relies on a small amount of labeled data for critic calibration
TTT is several times slower than standard inference
Mainly validated on mathematical reasoning; generalization to other domains needs verification

Future Research

Unlabeled calibration: Adversarial calibration or meta-learning
Efficient implementation: Reduce inference latency
Multi-task TTT: Share experience to accelerate adaptation to new tasks
Theoretical deepening: Convergence guarantees and complexity bounds under the EM framework

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Building an AWS Generative AI Application from Scratch: EC2 + Bedrock Hands-On Tutorial

A complete cloud-native AI application development guide for beginners, building a simple generative AI chatbot using Amazon EC2, Apache, Python CGI, and Amazon Bedrock, covering architecture design, IAM permission configuration, security best practices, and cost optimization suggestions.

Recent activity 2026-06-02 19:49