Zing Forum

Reading

Compressing Large Language Models via MLP Block Replacement: A Module-Level Knowledge Distillation Approach

A graduation thesis from Comenius University in Bratislava explores model compression by replacing MLP blocks in Transformers with smaller approximate networks, offering a new approach to LLM compression that differs from quantization and pruning.

LLM模型压缩MLPTransformer知识蒸馏函数逼近模型轻量化边缘部署神经网络架构毕业论文
Published 2026-04-01 06:42Recent activity 2026-04-01 06:57Estimated read 7 min
Compressing Large Language Models via MLP Block Replacement: A Module-Level Knowledge Distillation Approach
1

Section 01

[Main Floor/Introduction] MLP Block Replacement: A New Module-Level Knowledge Distillation Approach for LLM Compression

A graduation thesis from Comenius University in Bratislava proposes a new approach to LLM compression that differs from quantization and pruning—treating MLP blocks in Transformers as independent functions and replacing them one by one with smaller approximate networks. This module-level knowledge distillation method opens up new possibilities for model compression, eliminating the need for end-to-end retraining of the entire model and offering advantages such as modularity, controllability, and interpretability.

2

Section 02

Research Background: Bottlenecks of MLP Blocks and Limitations of Traditional Compression Methods

In LLMs based on modern Transformer architectures, MLP blocks account for approximately 80% of the total parameters and are the main bottleneck for memory storage and inference latency. Traditional compression techniques like quantization (reducing precision) and structured/unstructured pruning (removing neurons or sparsifying weights) reduce parameters or precision while maintaining the original structure, whereas this thesis proposes changing the structure itself by replacing MLP blocks with small modules.

3

Section 03

Core Idea: Four-Step Strategy for Function-Level Replacement

The key of this method is to treat each MLP block as an independent function approximation problem. The steps include: 1. Freeze the attention layers, normalization layers, etc., of the pre-trained model; 2. Collect input-output pairs of the original MLP blocks as training data; 3. Train smaller alternative networks (e.g., shallow MLP, linear layer) to approximate the original output; 4. Replace MLP blocks one by one while keeping the overall architecture unchanged.

4

Section 04

Technical Scheme: Alternative Network Architectures and Training Strategies

Candidate alternative architectures include: 1. Shallow MLP (single layer or narrower two layers); 2. Pure linear projection (low-rank approximation); 3. Hybrid structures (e.g., attention-enhanced MLP, depthwise separable convolution, MoE-style sparse activation). The training strategy uses minimizing the MSE or cosine similarity loss between the output of the alternative network and the original MLP block, and training data is collected via a single forward pass on representative samples.

5

Section 05

Evaluation Dimensions and Challenges

Evaluation dimensions include the trade-off between compression ratio and performance (parameter compression ratio, inference speed, downstream task performance), layer-wise sensitivity analysis (compression tolerance of early vs. late layers, key block identification), and combinatorial optimization problems (greedy strategy, heuristic configuration, automatic search for optimal combinations). The challenges lie in alternative network design and handling inter-block dependencies.

6

Section 06

Comparison with Existing Compression Methods

Method Compression Granularity Need Retraining? Change to Original Structure Main Challenges
Quantization Weight-level No (PTQ) / Yes (QAT) None Precision loss, calibration sensitivity
Pruning Neuron/layer Usually yes Structural change Sparse computation efficiency, irregular memory access
MLP Replacement Module-level Partial (only alternative networks) Structural replacement Alternative network design, inter-block dependencies
The advantage of MLP replacement is its structural interpretability, producing standard dense matrix operations without requiring specialized hardware support.
7

Section 07

Potential Impact and Future Research Directions

If effective, this method may bring: 1. Progressive compression (dynamically selecting compression levels); 2. Edge device deployment (more aggressive compression ratios); 3. Integration with NAS (automatically discovering optimal architectures); 4. Superposition with quantization/pruning (higher compression ratios).

8

Section 08

Research Summary and Project Resources

This method re-examines LLM compression from the perspective of function approximation and is a complementary technique to quantization and pruning. The project is hosted on GitHub, containing components such as configs, docs, notebooks, and scripts, and is in an active development phase, suitable for researchers to track.