Zing Forum

Reading

SubFit: A New Paradigm for LLM Compression at the Submodule Level, Breaking Hierarchical and Continuity Constraints

SubFit achieves 84.6% downstream accuracy retention at 25% sparsity through submodule-level non-continuous selection and lightweight residual replacement, significantly outperforming traditional hierarchical compression methods and providing a more efficient compression solution for large model deployment.

模型压缩大语言模型稀疏化后训练压缩TransformerAttentionFeedForward模型部署
Published 2026-06-02 01:52Recent activity 2026-06-02 13:53Estimated read 8 min
SubFit: A New Paradigm for LLM Compression at the Submodule Level, Breaking Hierarchical and Continuity Constraints
1

Section 01

SubFit: Introduction to the New Paradigm of LLM Compression at the Submodule Level

SubFit is a new paradigm for LLM compression at the submodule level. By breaking the full-layer granularity and continuous selection constraints of traditional hierarchical compression, it adopts submodule-level non-continuous selection and lightweight residual replacement strategies. At 25% sparsity, it retains 84.6% downstream accuracy, significantly outperforming traditional hierarchical compression methods and providing an efficient solution for large model deployment.

Basic Information:

2

Section 02

Research Background: Limitations of Traditional LLM Compression and Redundancy Analysis

Post-training compression of large language models aims to reduce inference costs, but existing replacement-based methods have two constraints: full-layer granularity (taking entire Transformer layers as units) and continuous selection (removed components must be distributed continuously).

The authors' analysis found that pre-trained Transformer redundancy has non-uniform distribution characteristics:

  1. Uneven spatial distribution: Redundancy is scattered across different depths
  2. Component type differences: Attention and FeedForward have different redundancy characteristics
  3. Non-continuous patterns: Removable components do not need to be continuous

Traditional hierarchical compression is too coarse and misses fine-grained optimization opportunities.

3

Section 03

Detailed Explanation of SubFit Method: Submodule-Level Non-Continuous Compression and Residual Replacement

Core design principles of SubFit (Submodule-level Fitted residual replacement):

  1. Submodule granularity: Refine the compression unit to Attention and FeedForward submodules, and evaluate importance independently
  2. Non-continuous selection: Allow submodule compression at any position to accurately locate redundancy
  3. Lightweight residual replacement: Replace selected submodules with fitted residual bypasses (retain residual connections + lightweight fitting module + calibration data-driven)

Implementation flow: Importance evaluation → Submodule selection → Residual bypass design → Calibration training → Iterative optimization.

4

Section 04

Experimental Validation: SubFit Outperforms Traditional Methods

Experimental Setup: Cover 10 LLMs (5 base + 5 instruction-tuned), 12.5%-37.5% sparsity, compare with 4 baseline methods, evaluate perplexity and downstream accuracy.

Key Results:

  • At 25% sparsity: 84.6% downstream accuracy retention (strongest baseline: 81.6%, +3% improvement), perplexity degradation of 2.42x (baseline:4.34x, 44% reduction)
  • Inference efficiency: Improve inference speed, save KV cache memory, deployment-friendly

Ablation Experiments: Submodule granularity, non-continuous selection, and residual replacement are all key contributions.

5

Section 05

Technical Advantages and Comparison with Other Compression Methods

Technical Advantages:

  1. Fine-grained optimization: Accurate redundancy localization, type-aware strategy, retain key capabilities
  2. Post-training friendly: No retraining needed, small amount of calibration data, plug-and-play, progressive compression

Comparison with Other Methods:

  • vs Pruning: No fine-tuning required to maintain performance
  • vs Quantization: Structural compression (can be complementary)
  • vs Distillation: Directly compress the original model, retain architecture and weights
6

Section 06

Application Prospects and Deployment Recommendations

Applicable Scenarios: Resource-constrained deployment (edge/mobile), high-throughput services, long-context applications, cost-sensitive applications

Deployment Recommendations:

  1. Start adjusting from 25% sparsity
  2. Prepare a small amount of target domain calibration data (thousands of samples)
  3. Validate performance on downstream tasks
  4. Can combine with quantization technology for extreme compression
7

Section 07

Current Limitations and Future Research Directions

Current Limitations:

  1. Significant performance drop at extremely high sparsity (>50%)
  2. Greater impact on tasks sensitive to specific submodules
  3. Dependence on calibration data quality

Future Directions:

  1. Dynamic compression (input-adaptive submodule activation)
  2. Mixed granularity compression
  3. Adaptive sparsity learning
  4. Multi-task joint compression optimization
8

Section 08

Significance and Prospects of SubFit

SubFit breaks traditional hierarchical and continuity constraints, proving that fine-grained submodule compression can significantly improve performance while maintaining post-training convenience. In today's era where LLM deployment costs are a concern, SubFit provides a practical and efficient solution, and will play an important role in lowering deployment thresholds and expanding application scope in the future.