Reading

SubFit: A New Paradigm for LLM Compression at the Submodule Level, Breaking Hierarchical and Continuity Constraints

SubFit achieves 84.6% downstream accuracy retention at 25% sparsity through submodule-level non-continuous selection and lightweight residual replacement, significantly outperforming traditional hierarchical compression methods and providing a more efficient compression solution for large model deployment.

模型压缩大语言模型稀疏化后训练压缩TransformerAttentionFeedForward模型部署

Published 2026-06-02 01:52Recent activity 2026-06-02 13:53Estimated read 8 min

SubFit: A New Paradigm for LLM Compression at the Submodule Level, Breaking Hierarchical and Continuity Constraints

Section 01

SubFit: Introduction to the New Paradigm of LLM Compression at the Submodule Level

SubFit is a new paradigm for LLM compression at the submodule level. By breaking the full-layer granularity and continuous selection constraints of traditional hierarchical compression, it adopts submodule-level non-continuous selection and lightweight residual replacement strategies. At 25% sparsity, it retains 84.6% downstream accuracy, significantly outperforming traditional hierarchical compression methods and providing an efficient solution for large model deployment.

Basic Information:

Original author team (arXiv submission)
Source: arXiv, original title: From Layers to Submodules: Rethinking Granularity in Replacement-Based LLM Compression
Release date: June 1, 2026
Open-source code: https://github.com/eliacunegatti/SubFit
Original link: http://arxiv.org/abs/2606.02559v1

Section 02

Research Background: Limitations of Traditional LLM Compression and Redundancy Analysis

Post-training compression of large language models aims to reduce inference costs, but existing replacement-based methods have two constraints: full-layer granularity (taking entire Transformer layers as units) and continuous selection (removed components must be distributed continuously).

The authors' analysis found that pre-trained Transformer redundancy has non-uniform distribution characteristics:

Uneven spatial distribution: Redundancy is scattered across different depths
Component type differences: Attention and FeedForward have different redundancy characteristics
Non-continuous patterns: Removable components do not need to be continuous

Traditional hierarchical compression is too coarse and misses fine-grained optimization opportunities.

Section 03

Detailed Explanation of SubFit Method: Submodule-Level Non-Continuous Compression and Residual Replacement

Core design principles of SubFit (Submodule-level Fitted residual replacement):

Submodule granularity: Refine the compression unit to Attention and FeedForward submodules, and evaluate importance independently
Non-continuous selection: Allow submodule compression at any position to accurately locate redundancy
Lightweight residual replacement: Replace selected submodules with fitted residual bypasses (retain residual connections + lightweight fitting module + calibration data-driven)

Implementation flow: Importance evaluation → Submodule selection → Residual bypass design → Calibration training → Iterative optimization.

Section 04

Experimental Validation: SubFit Outperforms Traditional Methods

Experimental Setup: Cover 10 LLMs (5 base + 5 instruction-tuned), 12.5%-37.5% sparsity, compare with 4 baseline methods, evaluate perplexity and downstream accuracy.

Key Results:

At 25% sparsity: 84.6% downstream accuracy retention (strongest baseline: 81.6%, +3% improvement), perplexity degradation of 2.42x (baseline:4.34x, 44% reduction)
Inference efficiency: Improve inference speed, save KV cache memory, deployment-friendly

Ablation Experiments: Submodule granularity, non-continuous selection, and residual replacement are all key contributions.

Section 05

Technical Advantages and Comparison with Other Compression Methods

Technical Advantages:

Fine-grained optimization: Accurate redundancy localization, type-aware strategy, retain key capabilities
Post-training friendly: No retraining needed, small amount of calibration data, plug-and-play, progressive compression

Comparison with Other Methods:

vs Pruning: No fine-tuning required to maintain performance
vs Quantization: Structural compression (can be complementary)
vs Distillation: Directly compress the original model, retain architecture and weights

Section 06

Application Prospects and Deployment Recommendations

Applicable Scenarios: Resource-constrained deployment (edge/mobile), high-throughput services, long-context applications, cost-sensitive applications

Deployment Recommendations:

Start adjusting from 25% sparsity
Prepare a small amount of target domain calibration data (thousands of samples)
Validate performance on downstream tasks
Can combine with quantization technology for extreme compression

Section 07

Current Limitations and Future Research Directions

Current Limitations:

Significant performance drop at extremely high sparsity (>50%)
Greater impact on tasks sensitive to specific submodules
Dependence on calibration data quality

Future Directions:

Dynamic compression (input-adaptive submodule activation)
Mixed granularity compression
Adaptive sparsity learning
Multi-task joint compression optimization

Section 08

Significance and Prospects of SubFit

SubFit breaks traditional hierarchical and continuity constraints, proving that fine-grained submodule compression can significantly improve performance while maintaining post-training convenience. In today's era where LLM deployment costs are a concern, SubFit provides a practical and efficient solution, and will play an important role in lowering deployment thresholds and expanding application scope in the future.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15