Zing Forum

Reading

SETA: A Mixture of Sparse Experts Architecture to Solve the Dilemma of Continual Learning in Large Models

This article introduces the SETA framework, which effectively resolves the conflict between plasticity and stability in the continual learning of large language models through adaptive sparse subspace decomposition and expert routing mechanisms, while maintaining the ability to learn new knowledge and preventing catastrophic forgetting.

持续学习大语言模型稀疏专家灾难性遗忘机器学习参数高效终身学习
Published 2026-06-06 01:53Recent activity 2026-06-08 09:26Estimated read 6 min
SETA: A Mixture of Sparse Experts Architecture to Solve the Dilemma of Continual Learning in Large Models
1

Section 01

SETA: A Mixture of Sparse Experts Architecture to Solve the Dilemma of Continual Learning in Large Models

This article introduces the SETA (Mixture of Sparse Experts for Task Agnostic Continual Learning) framework, which resolves the conflict between plasticity and stability in the continual learning of large language models through adaptive sparse subspace decomposition and expert routing mechanisms, preventing catastrophic forgetting while learning new knowledge. The framework divides the parameter space into unique experts (task-specific) and shared experts (cross-task general), combined with a dynamic routing mechanism to achieve efficient continual learning.

2

Section 02

Core Dilemma of Continual Learning and Limitations of Existing Methods

Continual learning for large language models faces a dilemma between plasticity and stability: updating parameters is needed to learn new tasks, but this easily damages old knowledge leading to catastrophic forgetting. Existing methods treat parameters as homogeneous resources without distinguishing between task-specific and shared knowledge, resulting in parameter competition between new and old tasks, leading to trade-offs.

3

Section 03

Core Architecture Design of the SETA Framework

The core innovation of SETA is separating the parameter space into two parts:

  • Unique Experts: Each new task has an independent module to learn task-specific patterns without mutual interference;
  • Shared Experts: Capture cross-task general features and knowledge, shared by all tasks to ensure reuse of general capabilities. This architecture avoids parameter competition between new and old tasks, fundamentally resolving the conflict.
4

Section 04

Key Technical Implementation of SETA

SETA ensures its effectiveness through three technologies:

  1. Adaptive Elastic Anchoring Mechanism: Applies soft constraints on shared expert parameters, allowing necessary adjustments while preventing catastrophic parameter drift;
  2. Routing-Aware Regularization: Protects shared knowledge at the weight and routing levels, avoiding excessive changes to the shared expert calling pattern by the gating network;
  3. Unified Gating Network: Dynamically activates relevant unique and shared experts during inference, automatically invoking knowledge without the need for task identifiers.
5

Section 05

Experimental Validation and Performance Analysis

Experiments were conducted based on models such as LLaMA-2 7B and Qwen3-4B, evaluated on multi-domain benchmark tests (text classification, question answering, generation):

  • Overall Performance: Comparable to or better than state-of-the-art baselines;
  • Knowledge Retention: Effectively mitigates catastrophic forgetting, maintaining good performance on early tasks;
  • Backward Transfer: Learning new tasks sometimes improves the performance of old tasks; Compared to existing methods: Stronger protection than regularization methods (EWC, SI), more parameter-efficient than architectural methods (Progressive Networks), and no need to store old data compared to replay methods.
6

Section 06

Technical Insights and Implications of SETA

SETA reveals the characteristics of LLM parameter space: knowledge of different tasks occupies different subspaces; it achieves dynamic capacity allocation (adaptive allocation of exclusive and shared capacity); task-agnostic design (no task identifier needed for inference) enhances practicality and is suitable for real-world scenarios.

7

Section 07

Limitations and Future Research Directions

SETA still has open issues:

  • Balancing the number of experts and model size;
  • Exploring expert merging and compression to improve parameter efficiency;
  • Finer-grained subspace decomposition;
  • Combining technologies such as knowledge distillation and meta-learning to enhance capabilities.
8

Section 08

Practical Application Value and Conclusion

The application value of SETA includes: personalized model services, domain adaptation, privacy-preserving learning, and lifelong learning systems. Conclusion: SETA provides a novel and effective solution for LLM continual learning, performing excellently in both theory and experiments, opening up new possibilities for research in this field.