# ZAYA1-8B: A Small-Scale MoE Model Setting a New Benchmark for Challenging Large Models' Reasoning Performance

> This article introduces ZAYA1-8B, a MoE reasoning model with only 700M active parameters. Through four-stage RL training and the Markovian RSA test-time computation method, it achieves 91.9% on AIME'25, approaching the performance of ultra-large models like Gemini-2.5 Pro.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-06T18:44:08.000Z
- 最近活动: 2026-05-08T04:20:53.131Z
- 热度: 126.4
- 关键词: ZAYA1-8B, 混合专家模型, 推理模型, 测试时计算, MoE, AIME, 数学推理, AMD
- 页面链接: https://www.zingnex.cn/en/forum/thread/zaya1-8b-moe
- Canonical: https://www.zingnex.cn/forum/thread/zaya1-8b-moe
- Markdown 来源: floors_fallback

---

## [Introduction] ZAYA1-8B: A Small-Scale MoE Model Setting a New Benchmark for Challenging Large Models' Reasoning Performance

ZAYA1-8B is a Mixture of Experts (MoE) reasoning model with only 700M active parameters (total parameters: 8B). Through four-stage Reinforcement Learning (RL) training and the Markovian RSA test-time computation method, it achieves a score of 91.9% on the AIME'25 benchmark, with performance approaching ultra-large models like Gemini-2.5 Pro. Built on the MoE++ architecture, this model was trained entirely on AMD's full-stack computing platform, challenging the traditional belief that "reasoning ability is positively correlated with model size."

## Background and Model Architecture

In the field of large language models, the traditional belief is that reasoning ability is positively correlated with model size. Top models like DeepSeek-R1 and Gemini-2.5 Pro often have tens of billions or even hundreds of billions of parameters. ZAYA1-8B adopts Zyphra's MoE++ architecture, whose core features include: sparse activation (only about 700M parameters are activated per inference), optimized expert routing mechanism (ensuring load balancing), and dynamic expert offloading (supporting deployment in resource-constrained environments). Additionally, its entire training process (from pre-training to post-training) was completed on AMD's full-stack platform (Instinct accelerators, ROCm software stack, etc.), demonstrating AMD's competitiveness in large model training.

## Reasoning-Oriented Training Strategy

ZAYA1-8B adopts a reasoning-oriented strategy for training from scratch:
1. **Pre-training phase**: Introduce reasoning data and design an "answer-preserving cropping scheme" to ensure no key information is lost, enabling the model to establish a reasoning mindset early on.
2. **Four-stage RL post-training**: 
   - Stage 1: Reasoning warm-up—use PPO/GRPO algorithms on math problems and logic puzzles to activate basic reasoning abilities;
   - Stage 2: RLVE-Gym curriculum learning—cover 400 tasks from basic to advanced;
   - Stage3: Math and code-specific RL—combine test-time computation trajectories and synthetic code environments to enhance the stability of long reasoning chains;
   - Stage4: Behavioral RL—optimize dialogue and instruction-following capabilities while considering interactive experience.

## Markovian RSA: An Efficient Test-Time Computation Method

One of ZAYA1-8B's innovations is the Markovian RSA (Recursive Self-Aggregation) test-time computation method, which solves the context window exhaustion problem of traditional TTC:
- **Core idea**: Generate multiple reasoning trajectories in parallel, recursively aggregate information each round; only pass a fixed-length (e.g., 4K tokens) reasoning tail instead of the complete history; each round's decision depends on the current tail (Markov property), reducing context burden.
- **Performance improvement**: After using this method, ZAYA1-8B achieves 91.9% on AIME'25 and 89.6% on HMMT'25, approaching ultra-large models with higher efficiency.

## Benchmark Test Result Comparison

ZAYA1-8B performs excellently on multiple high-difficulty benchmarks:
- **Mathematical reasoning**: AIME'25 (91.9%), HMMT'25 (89.6%)—equivalent to or exceeding DeepSeek-R1-0528;
- **Code generation**: Can compete with larger-scale dedicated code models on programming competition benchmarks;
- **Comprehensive reasoning**: Stable performance on multi-step complex tasks with strong generalization. These results far surpass models of the same scale and even approach ultra-large models like Gemini-2.5 Pro.

## Technical Insights

The success of ZAYA1-8B brings the following insights:
1. **Size is not the only answer**: Through architecture optimization and training strategies, small models can challenge the reasoning performance of large models, providing possibilities for resource-constrained scenarios;
2. **Training quality matters more than size**: The four-stage RL cascade and reasoning-oriented pre-training indicate that training data quality and process design are more important than parameter count;
3. **New paradigm for test-time computation**: Markovian RSA demonstrates an efficient way to extend reasoning under limited context;
4. **Hardware ecosystem diversity**: Successful training on the full AMD platform proves the feasibility of non-NVIDIA ecosystems, reducing the industry's single dependency.

## Application Scenarios

ZAYA1-8B's compact size is suitable for various scenarios:
- **Edge device deployment**: 700M active parameters can run efficiently on consumer GPUs or high-end CPUs;
- **Real-time reasoning services**: Low latency makes it suitable for real-time applications like online Q&A and code completion;
- **Cost-sensitive scenarios**: Inference cost is much lower than ultra-large models, suitable for large-scale deployment;
- **Research benchmark**: Provides a strong benchmark for small and medium-sized reasoning models for the community.

## Conclusion

ZAYA1-8B proves that "efficiency does not mean compromise." Through innovations like the MoE architecture, reasoning-oriented training, and Markovian RSA, a model with 700M active parameters achieves the performance of models tens of times its size. This not only sets a new benchmark for small model research but also provides a feasible path for the popularization and democratization of AI.