Zing Forum

Reading

ZAYA1-8B: A Small-Scale MoE Model Setting a New Benchmark for Challenging Large Models' Reasoning Performance

This article introduces ZAYA1-8B, a MoE reasoning model with only 700M active parameters. Through four-stage RL training and the Markovian RSA test-time computation method, it achieves 91.9% on AIME'25, approaching the performance of ultra-large models like Gemini-2.5 Pro.

ZAYA1-8B混合专家模型推理模型测试时计算MoEAIME数学推理AMD
Published 2026-05-07 02:44Recent activity 2026-05-08 12:20Estimated read 9 min
ZAYA1-8B: A Small-Scale MoE Model Setting a New Benchmark for Challenging Large Models' Reasoning Performance
1

Section 01

[Introduction] ZAYA1-8B: A Small-Scale MoE Model Setting a New Benchmark for Challenging Large Models' Reasoning Performance

ZAYA1-8B is a Mixture of Experts (MoE) reasoning model with only 700M active parameters (total parameters: 8B). Through four-stage Reinforcement Learning (RL) training and the Markovian RSA test-time computation method, it achieves a score of 91.9% on the AIME'25 benchmark, with performance approaching ultra-large models like Gemini-2.5 Pro. Built on the MoE++ architecture, this model was trained entirely on AMD's full-stack computing platform, challenging the traditional belief that "reasoning ability is positively correlated with model size."

2

Section 02

Background and Model Architecture

In the field of large language models, the traditional belief is that reasoning ability is positively correlated with model size. Top models like DeepSeek-R1 and Gemini-2.5 Pro often have tens of billions or even hundreds of billions of parameters. ZAYA1-8B adopts Zyphra's MoE++ architecture, whose core features include: sparse activation (only about 700M parameters are activated per inference), optimized expert routing mechanism (ensuring load balancing), and dynamic expert offloading (supporting deployment in resource-constrained environments). Additionally, its entire training process (from pre-training to post-training) was completed on AMD's full-stack platform (Instinct accelerators, ROCm software stack, etc.), demonstrating AMD's competitiveness in large model training.

3

Section 03

Reasoning-Oriented Training Strategy

ZAYA1-8B adopts a reasoning-oriented strategy for training from scratch:

  1. Pre-training phase: Introduce reasoning data and design an "answer-preserving cropping scheme" to ensure no key information is lost, enabling the model to establish a reasoning mindset early on.
  2. Four-stage RL post-training:
    • Stage 1: Reasoning warm-up—use PPO/GRPO algorithms on math problems and logic puzzles to activate basic reasoning abilities;
    • Stage 2: RLVE-Gym curriculum learning—cover 400 tasks from basic to advanced;
    • Stage3: Math and code-specific RL—combine test-time computation trajectories and synthetic code environments to enhance the stability of long reasoning chains;
    • Stage4: Behavioral RL—optimize dialogue and instruction-following capabilities while considering interactive experience.
4

Section 04

Markovian RSA: An Efficient Test-Time Computation Method

One of ZAYA1-8B's innovations is the Markovian RSA (Recursive Self-Aggregation) test-time computation method, which solves the context window exhaustion problem of traditional TTC:

  • Core idea: Generate multiple reasoning trajectories in parallel, recursively aggregate information each round; only pass a fixed-length (e.g., 4K tokens) reasoning tail instead of the complete history; each round's decision depends on the current tail (Markov property), reducing context burden.
  • Performance improvement: After using this method, ZAYA1-8B achieves 91.9% on AIME'25 and 89.6% on HMMT'25, approaching ultra-large models with higher efficiency.
5

Section 05

Benchmark Test Result Comparison

ZAYA1-8B performs excellently on multiple high-difficulty benchmarks:

  • Mathematical reasoning: AIME'25 (91.9%), HMMT'25 (89.6%)—equivalent to or exceeding DeepSeek-R1-0528;
  • Code generation: Can compete with larger-scale dedicated code models on programming competition benchmarks;
  • Comprehensive reasoning: Stable performance on multi-step complex tasks with strong generalization. These results far surpass models of the same scale and even approach ultra-large models like Gemini-2.5 Pro.
6

Section 06

Technical Insights

The success of ZAYA1-8B brings the following insights:

  1. Size is not the only answer: Through architecture optimization and training strategies, small models can challenge the reasoning performance of large models, providing possibilities for resource-constrained scenarios;
  2. Training quality matters more than size: The four-stage RL cascade and reasoning-oriented pre-training indicate that training data quality and process design are more important than parameter count;
  3. New paradigm for test-time computation: Markovian RSA demonstrates an efficient way to extend reasoning under limited context;
  4. Hardware ecosystem diversity: Successful training on the full AMD platform proves the feasibility of non-NVIDIA ecosystems, reducing the industry's single dependency.
7

Section 07

Application Scenarios

ZAYA1-8B's compact size is suitable for various scenarios:

  • Edge device deployment: 700M active parameters can run efficiently on consumer GPUs or high-end CPUs;
  • Real-time reasoning services: Low latency makes it suitable for real-time applications like online Q&A and code completion;
  • Cost-sensitive scenarios: Inference cost is much lower than ultra-large models, suitable for large-scale deployment;
  • Research benchmark: Provides a strong benchmark for small and medium-sized reasoning models for the community.
8

Section 08

Conclusion

ZAYA1-8B proves that "efficiency does not mean compromise." Through innovations like the MoE architecture, reasoning-oriented training, and Markovian RSA, a model with 700M active parameters achieves the performance of models tens of times its size. This not only sets a new benchmark for small model research but also provides a feasible path for the popularization and democratization of AI.