Reading

Instruction-Aware Pruning: Enabling Large Language Models to Activate Parameters On-Demand

An innovative dynamic pruning method that uses a small predictor network to determine which neurons should be activated based on input instructions. It achieves 50% parameter pruning while maintaining model performance, providing new ideas for deploying large models on edge devices.

模型剪枝大语言模型IFPruning稀疏度预测参数高效模型压缩动态推理Qwen边缘部署

Published 2026-05-27 21:13Recent activity 2026-05-27 21:20Estimated read 6 min

Instruction-Aware Pruning: Enabling Large Language Models to Activate Parameters On-Demand

Section 01

Instruction-Aware Pruning (IFPruning): An Innovative Method to Enable Large Models to Activate Parameters On-Demand

Core Insights

Instruction-Aware Pruning (IFPruning) is a dynamic pruning method that uses a small predictor network to decide which neurons to activate based on input instructions. It achieves 50% parameter pruning while maintaining model performance, offering new ideas for deploying large models on edge devices.

Original Authors and Sources

Original Author/Maintainer: wonjin0403
Source Platform: GitHub
Original Title: IFPruning-Implementation
Original Link: https://github.com/wonjin0403/IFPruning-Implementation
Release Time: May 27, 2026

Section 02

Problem Background: Limitations of Static Pruning

Traditional model pruning uses static strategies, which have the following defects:

Ignoring input heterogeneity: Significant differences exist between simple Q&A and complex reasoning requirements
Difficulty balancing performance and efficiency: Over-pruning harms performance, while conservative pruning wastes resources
Lack of adaptability: Unable to adjust computing resources in real time

The ideal strategy should be dynamic and input-aware: use fewer parameters for simple inputs and retain more capabilities for complex inputs.

Section 03

Core Architecture and Ideas of IFPruning

IFPruning consists of three core components:

Large model to be pruned: Taking Qwen2.5-3B-Instruct as an example, the target is to prune to 50% active parameters
Sparsity predictor: A lightweight model (e.g., SmolLM2-360M) receives instructions and outputs FFN layer neuron masks
Mask head network: A two-layer MLP converts the predictor's representation into top-k selection decisions (retaining 50% of neurons per layer)

Section 04

Two-Stage Training Strategy

Stage 1: Continual Pre-training

Corpus: SlimPajama
Data Organization: (Current chunk, next chunk) pairs
Training Details: bf16 mixed precision, 4-card DDP, main model lr=1e-6, predictor lr=1e-4, disable gradient checkpointing

Stage 2: Instruction Fine-tuning (SFT)

Dataset: Tulu-v2 + FLAN-V2
Template: Qwen2.5 chat template
Loss Calculation: Only on assistant response tokens
Goal: Align with dialogue scenarios

Section 05

Highlights of Technical Implementation

Dual tokenizer processing: The main model and predictor use different tokenizers; they are encoded separately during preprocessing while maintaining semantic alignment
Universal model support: Compatible with Llama series and Qwen2 series, with flexible switching via configuration
Complete evaluation system: Integrates lm-evaluation-harness, supports tasks like MMLU and HellaSwag, and compares with baselines such as dense models and random pruning

Section 06

Experimental Findings and Key Insights

Learning rate sensitivity: Joint training is sensitive to learning rate combinations; excessively high predictor lr easily leads to mask collapse
Advantage of frozen main model: Freezing the main model (only training the predictor and mask head) yields better results over long training periods, avoiding representation drift
Mask effectiveness: Different inputs activate different neuron subsets, validating the dynamic pruning hypothesis

Section 07

Practical Significance and Application Prospects

Edge device deployment: 50% parameter pruning reduces inference costs, helping run large models on resource-constrained devices
Adaptive computing budget: Can be extended to adjust pruning rates for intelligent resource allocation
Research implications: Small-scale training is prone to mask collapse, requiring larger datasets and stable strategies

Section 08

Conclusion: Future Directions of Dynamic Pruning

Instruction-aware pruning represents an important direction in model compression from static to dynamic, and from general to adaptive. Although it faces training stability challenges, it opens up new possibilities for efficient large model deployment, and input-aware dynamic computing methods will receive more attention.