Zing Forum

Reading

Instruction-Aware Pruning: Enabling Large Language Models to Activate Parameters On-Demand

An innovative dynamic pruning method that uses a small predictor network to determine which neurons should be activated based on input instructions. It achieves 50% parameter pruning while maintaining model performance, providing new ideas for deploying large models on edge devices.

模型剪枝大语言模型IFPruning稀疏度预测参数高效模型压缩动态推理Qwen边缘部署
Published 2026-05-27 21:13Recent activity 2026-05-27 21:20Estimated read 6 min
Instruction-Aware Pruning: Enabling Large Language Models to Activate Parameters On-Demand
1

Section 01

Instruction-Aware Pruning (IFPruning): An Innovative Method to Enable Large Models to Activate Parameters On-Demand

Core Insights

Instruction-Aware Pruning (IFPruning) is a dynamic pruning method that uses a small predictor network to decide which neurons to activate based on input instructions. It achieves 50% parameter pruning while maintaining model performance, offering new ideas for deploying large models on edge devices.

Original Authors and Sources

2

Section 02

Problem Background: Limitations of Static Pruning

Traditional model pruning uses static strategies, which have the following defects:

  1. Ignoring input heterogeneity: Significant differences exist between simple Q&A and complex reasoning requirements
  2. Difficulty balancing performance and efficiency: Over-pruning harms performance, while conservative pruning wastes resources
  3. Lack of adaptability: Unable to adjust computing resources in real time

The ideal strategy should be dynamic and input-aware: use fewer parameters for simple inputs and retain more capabilities for complex inputs.

3

Section 03

Core Architecture and Ideas of IFPruning

IFPruning consists of three core components:

  1. Large model to be pruned: Taking Qwen2.5-3B-Instruct as an example, the target is to prune to 50% active parameters
  2. Sparsity predictor: A lightweight model (e.g., SmolLM2-360M) receives instructions and outputs FFN layer neuron masks
  3. Mask head network: A two-layer MLP converts the predictor's representation into top-k selection decisions (retaining 50% of neurons per layer)
4

Section 04

Two-Stage Training Strategy

Stage 1: Continual Pre-training

  • Corpus: SlimPajama
  • Data Organization: (Current chunk, next chunk) pairs
  • Training Details: bf16 mixed precision, 4-card DDP, main model lr=1e-6, predictor lr=1e-4, disable gradient checkpointing

Stage 2: Instruction Fine-tuning (SFT)

  • Dataset: Tulu-v2 + FLAN-V2
  • Template: Qwen2.5 chat template
  • Loss Calculation: Only on assistant response tokens
  • Goal: Align with dialogue scenarios
5

Section 05

Highlights of Technical Implementation

  1. Dual tokenizer processing: The main model and predictor use different tokenizers; they are encoded separately during preprocessing while maintaining semantic alignment
  2. Universal model support: Compatible with Llama series and Qwen2 series, with flexible switching via configuration
  3. Complete evaluation system: Integrates lm-evaluation-harness, supports tasks like MMLU and HellaSwag, and compares with baselines such as dense models and random pruning
6

Section 06

Experimental Findings and Key Insights

  1. Learning rate sensitivity: Joint training is sensitive to learning rate combinations; excessively high predictor lr easily leads to mask collapse
  2. Advantage of frozen main model: Freezing the main model (only training the predictor and mask head) yields better results over long training periods, avoiding representation drift
  3. Mask effectiveness: Different inputs activate different neuron subsets, validating the dynamic pruning hypothesis
7

Section 07

Practical Significance and Application Prospects

  1. Edge device deployment: 50% parameter pruning reduces inference costs, helping run large models on resource-constrained devices
  2. Adaptive computing budget: Can be extended to adjust pruning rates for intelligent resource allocation
  3. Research implications: Small-scale training is prone to mask collapse, requiring larger datasets and stable strategies
8

Section 08

Conclusion: Future Directions of Dynamic Pruning

Instruction-aware pruning represents an important direction in model compression from static to dynamic, and from general to adaptive. Although it faces training stability challenges, it opens up new possibilities for efficient large model deployment, and input-aware dynamic computing methods will receive more attention.