# ICLR 2026 Oral: Optimal Solution to Sparsity in Mixture-of-Experts Models—A New Paradigm for Reasoning Tasks

> A joint team from Tokyo Institute of Technology and RIKEN proposes a new theory for optimizing sparsity in Mixture-of-Experts (MoE) models, reveals distinct scaling laws for reasoning and memory capabilities, and releases 65 open-source checkpoints.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-30T16:03:49.000Z
- 最近活动: 2026-04-30T16:17:19.418Z
- 热度: 154.8
- 关键词: MoE, 混合专家模型, 稀疏性, ICLR 2026, 缩放定律, 推理能力, LLM, 开源模型, 东京工业大学, 理化学研究所
- 页面链接: https://www.zingnex.cn/en/forum/thread/iclr-2026-oral
- Canonical: https://www.zingnex.cn/forum/thread/iclr-2026-oral
- Markdown 来源: floors_fallback

---

## [Introduction] ICLR 2026 Oral Research Reveals Optimal Sparsity Solution for MoE and Capability Scaling Laws

The research on optimizing sparsity in Mixture-of-Experts (MoE) models by the joint team from Tokyo Institute of Technology and RIKEN has been accepted as an Oral paper at ICLR 2026. This study proposes that reasoning and memory capabilities in MoE follow distinct scaling laws, and fully open-sources 65 pre-trained checkpoints and related code, providing a new paradigm for MoE architecture design.

## Research Background: Challenges of MoE Architecture and Issues with Capability Dimensions

MoE expands model capacity while maintaining inference efficiency through sparse activation mechanisms, and has become a standard component in top systems like GPT-4. However, traditional scaling laws for dense models do not apply to sparse architectures. Additionally, Large Language Models (LLMs) possess two capabilities: memory (fitting training data) and reasoning (solving complex tasks), so it is necessary to explore the impact of sparsity on both and the differences in their scaling laws.

## Key Findings: Independent Scaling Principles for Reasoning and Memory Capabilities

The team proposes two robust principles: 1. Activated FLOPs determine reasoning capability—models with the same training loss but higher activated computation perform better in reasoning; 2. Token-per-parameter (TPP) needs balance—memory tasks favor more parameters, while reasoning tasks benefit from an optimal TPP. These laws are verified through Guided Reinforcement Pre-training with Optimization (GRPO) and test-time computation expansion, and the optimal sparsity needs to be determined during the pre-training phase.

## Experimental Validation: Systematic Support from 65 Open-Source Checkpoints

The team releases 65 pre-trained checkpoints (covering different hidden dimensions, number of experts, and Top-K configurations), trained based on NVIDIA Megatron-LM, and evaluated on benchmarks such as GSM8K/MATH (mathematical reasoning) and HumanEval/MBPP (code generation). They use methods like lm-evaluation-harness, and open-source data, code, and logs to ensure the reproducibility of the research.

## Practical Implications: Re-thinking the MoE Design Paradigm

It revises the traditional computationally optimal scaling landscape, requiring simultaneous consideration of activated FLOPs and TPP; guides resource-constrained teams to prioritize optimizing activated computation and data efficiency rather than blindly expanding parameters; the open-source checkpoints provide research resources for the community, accelerating the iterative optimization of MoE architectures.

## Technical Details and Open-Source Contributions

The project is based on NVIDIA Megatron-LM and the volcengine/verl framework. Pre-training scripts are located in scripts/pre-training/ (corresponding to Hugging Face checkpoints). It integrates EleutherAI's lm-evaluation-harness, provides evaluation scripts for math/code tasks, the taskloss-eval/ directory contains instructions for task loss evaluation, and test-time-compute/ supports self-consistent decoding.

## Conclusion and Outlook: Capability Decoupling Drives Architectural Innovation

This research provides a new perspective for the decoupling of LLM capabilities; future architectures may shift from a 'one-size-fits-all' approach to 'divide and conquer'. Optimizing MoE sparsity configuration will become a key topic, and the open-source resources lay the foundation for exploration in this direction, supporting the development of computationally optimal AI systems.
