Zing Forum

Reading

Compressing Large Language Models by Replacing MLP Blocks: A New Alternative to Quantization and Pruning

A study from Comenius University in Bratislava explores a large language model compression method that does not rely on traditional quantization or pruning techniques. By replacing MLP blocks in Transformers with smaller, more efficient alternative structures, it significantly reduces memory usage and inference latency while maintaining the model's expressive power.

大语言模型模型压缩MLP块替换Transformer优化推理加速学位论文
Published 2026-05-15 03:24Recent activity 2026-05-15 03:28Estimated read 6 min
Compressing Large Language Models by Replacing MLP Blocks: A New Alternative to Quantization and Pruning
1

Section 01

[Main Floor/Introduction] Replacing MLP Blocks: A New Approach to Large Language Model Compression

A study from Comenius University in Bratislava explores a large language model compression method that does not rely on traditional quantization or pruning techniques. By replacing MLP blocks in Transformers with smaller, more efficient alternative structures, this research aims to significantly reduce memory usage and inference latency while preserving the model's expressive power, providing a new direction for large model compression.

2

Section 02

Background: Parameter Inflation of Large Models and Limitations of Traditional Compression Techniques

Under the Transformer architecture, the number of parameters of large language models has soared from hundreds of millions to hundreds of billions or even trillions, leading to huge memory usage and slow inference speed (e.g., GPT-3 requires over 350GB of VRAM for single-precision inference). Among traditional compression techniques, quantization may lose precision (especially low-bit quantization), and pruning tends to result in irregular sparse patterns that are difficult to accelerate with hardware. Therefore, a third alternative path needs to be found.

3

Section 03

Core Insight: MLP Blocks Account for a Large Proportion of Parameters

The study found that MLP blocks in the standard Transformer architecture account for approximately 80% of the total parameters (attention mechanisms only account for 20%), which are the main source of memory and computational bottlenecks. Core hypothesis: Treat each MLP block as an independent function and replace it with a smaller, efficient approximate structure to achieve customized compression through divide-and-conquer, rather than a one-size-fits-all global solution.

4

Section 04

Methodology: Replacing Large MLP Blocks with Small Networks

  1. Capture input-output pairs of each MLP block from a frozen pre-trained model as calibration data; 2. Train smaller networks (such as shallower MLPs, pure linear layers, or hybrid architectures) to mimic the original MLP functions, minimizing output differences; 3. The modular nature supports parallel processing and customized strategies for different blocks, and allows individual rollback of poorly performing alternatives.
5

Section 05

Experimental Design and Evaluation: Trade-off Analysis Between Compression and Performance

Experiments use multi-scale Transformer models as benchmarks, with evaluation metrics including model size, inference speed, and performance on GLUE benchmark tasks. By adjusting the complexity of alternative structures to plot the Pareto frontier, it helps practitioners select the optimal configuration under resource constraints; meanwhile, it was found that early layers and late layers have significant differences in sensitivity to compression.

6

Section 06

Practical Significance: Edge Deployment and New Thinking Framework

This method can make it possible to deploy LLMs on edge devices (smartphones, IoT); reduce inference costs for cloud services, translating into economic benefits. It also provides a new framework: treating compression as 'function-preserving architecture search', which aligns with the concept of neural architecture search but focuses on compression rather than designing from scratch.

7

Section 07

Limitations and Future Directions

Limitations: Training alternative structures requires additional one-time computing resources; complex MLP blocks are difficult to approximate with simple structures; currently applicable to MLP blocks in encoder-decoder architectures, and applicability to other variants (such as sparse attention, mixture-of-experts models) remains to be verified. Future directions: Explore more complex alternative structures (such as small Transformer blocks), hybrid compression strategies, extension to vision Transformers, and dynamic replacement mechanisms.