Zing Forum

Reading

ICLR 2026 Oral: Optimal Solution to Sparsity in Mixture-of-Experts Models—A New Paradigm for Reasoning Tasks

A joint team from Tokyo Institute of Technology and RIKEN proposes a new theory for optimizing sparsity in Mixture-of-Experts (MoE) models, reveals distinct scaling laws for reasoning and memory capabilities, and releases 65 open-source checkpoints.

MoE混合专家模型稀疏性ICLR 2026缩放定律推理能力LLM开源模型东京工业大学理化学研究所
Published 2026-05-01 00:03Recent activity 2026-05-01 00:17Estimated read 5 min
ICLR 2026 Oral: Optimal Solution to Sparsity in Mixture-of-Experts Models—A New Paradigm for Reasoning Tasks
1

Section 01

[Introduction] ICLR 2026 Oral Research Reveals Optimal Sparsity Solution for MoE and Capability Scaling Laws

The research on optimizing sparsity in Mixture-of-Experts (MoE) models by the joint team from Tokyo Institute of Technology and RIKEN has been accepted as an Oral paper at ICLR 2026. This study proposes that reasoning and memory capabilities in MoE follow distinct scaling laws, and fully open-sources 65 pre-trained checkpoints and related code, providing a new paradigm for MoE architecture design.

2

Section 02

Research Background: Challenges of MoE Architecture and Issues with Capability Dimensions

MoE expands model capacity while maintaining inference efficiency through sparse activation mechanisms, and has become a standard component in top systems like GPT-4. However, traditional scaling laws for dense models do not apply to sparse architectures. Additionally, Large Language Models (LLMs) possess two capabilities: memory (fitting training data) and reasoning (solving complex tasks), so it is necessary to explore the impact of sparsity on both and the differences in their scaling laws.

3

Section 03

Key Findings: Independent Scaling Principles for Reasoning and Memory Capabilities

The team proposes two robust principles: 1. Activated FLOPs determine reasoning capability—models with the same training loss but higher activated computation perform better in reasoning; 2. Token-per-parameter (TPP) needs balance—memory tasks favor more parameters, while reasoning tasks benefit from an optimal TPP. These laws are verified through Guided Reinforcement Pre-training with Optimization (GRPO) and test-time computation expansion, and the optimal sparsity needs to be determined during the pre-training phase.

4

Section 04

Experimental Validation: Systematic Support from 65 Open-Source Checkpoints

The team releases 65 pre-trained checkpoints (covering different hidden dimensions, number of experts, and Top-K configurations), trained based on NVIDIA Megatron-LM, and evaluated on benchmarks such as GSM8K/MATH (mathematical reasoning) and HumanEval/MBPP (code generation). They use methods like lm-evaluation-harness, and open-source data, code, and logs to ensure the reproducibility of the research.

5

Section 05

Practical Implications: Re-thinking the MoE Design Paradigm

It revises the traditional computationally optimal scaling landscape, requiring simultaneous consideration of activated FLOPs and TPP; guides resource-constrained teams to prioritize optimizing activated computation and data efficiency rather than blindly expanding parameters; the open-source checkpoints provide research resources for the community, accelerating the iterative optimization of MoE architectures.

6

Section 06

Technical Details and Open-Source Contributions

The project is based on NVIDIA Megatron-LM and the volcengine/verl framework. Pre-training scripts are located in scripts/pre-training/ (corresponding to Hugging Face checkpoints). It integrates EleutherAI's lm-evaluation-harness, provides evaluation scripts for math/code tasks, the taskloss-eval/ directory contains instructions for task loss evaluation, and test-time-compute/ supports self-consistent decoding.

7

Section 07

Conclusion and Outlook: Capability Decoupling Drives Architectural Innovation

This research provides a new perspective for the decoupling of LLM capabilities; future architectures may shift from a 'one-size-fits-all' approach to 'divide and conquer'. Optimizing MoE sparsity configuration will become a key topic, and the open-source resources lay the foundation for exploration in this direction, supporting the development of computationally optimal AI systems.