Reading

Compressing Large Language Models by Replacing MLP Blocks: A New Alternative to Quantization and Pruning

A study from Comenius University in Bratislava explores a large language model compression method that does not rely on traditional quantization or pruning techniques. By replacing MLP blocks in Transformers with smaller, more efficient alternative structures, it significantly reduces memory usage and inference latency while maintaining the model's expressive power.

大语言模型模型压缩MLP块替换Transformer优化推理加速学位论文

Published 2026-05-15 03:24Recent activity 2026-05-15 03:28Estimated read 6 min

Compressing Large Language Models by Replacing MLP Blocks: A New Alternative to Quantization and Pruning

Section 01

[Main Floor/Introduction] Replacing MLP Blocks: A New Approach to Large Language Model Compression

A study from Comenius University in Bratislava explores a large language model compression method that does not rely on traditional quantization or pruning techniques. By replacing MLP blocks in Transformers with smaller, more efficient alternative structures, this research aims to significantly reduce memory usage and inference latency while preserving the model's expressive power, providing a new direction for large model compression.

Section 02

Background: Parameter Inflation of Large Models and Limitations of Traditional Compression Techniques

Under the Transformer architecture, the number of parameters of large language models has soared from hundreds of millions to hundreds of billions or even trillions, leading to huge memory usage and slow inference speed (e.g., GPT-3 requires over 350GB of VRAM for single-precision inference). Among traditional compression techniques, quantization may lose precision (especially low-bit quantization), and pruning tends to result in irregular sparse patterns that are difficult to accelerate with hardware. Therefore, a third alternative path needs to be found.

Section 03

Core Insight: MLP Blocks Account for a Large Proportion of Parameters

The study found that MLP blocks in the standard Transformer architecture account for approximately 80% of the total parameters (attention mechanisms only account for 20%), which are the main source of memory and computational bottlenecks. Core hypothesis: Treat each MLP block as an independent function and replace it with a smaller, efficient approximate structure to achieve customized compression through divide-and-conquer, rather than a one-size-fits-all global solution.

Section 04

Methodology: Replacing Large MLP Blocks with Small Networks

Capture input-output pairs of each MLP block from a frozen pre-trained model as calibration data; 2. Train smaller networks (such as shallower MLPs, pure linear layers, or hybrid architectures) to mimic the original MLP functions, minimizing output differences; 3. The modular nature supports parallel processing and customized strategies for different blocks, and allows individual rollback of poorly performing alternatives.

Section 05

Experimental Design and Evaluation: Trade-off Analysis Between Compression and Performance

Experiments use multi-scale Transformer models as benchmarks, with evaluation metrics including model size, inference speed, and performance on GLUE benchmark tasks. By adjusting the complexity of alternative structures to plot the Pareto frontier, it helps practitioners select the optimal configuration under resource constraints; meanwhile, it was found that early layers and late layers have significant differences in sensitivity to compression.

Section 06

Practical Significance: Edge Deployment and New Thinking Framework

This method can make it possible to deploy LLMs on edge devices (smartphones, IoT); reduce inference costs for cloud services, translating into economic benefits. It also provides a new framework: treating compression as 'function-preserving architecture search', which aligns with the concept of neural architecture search but focuses on compression rather than designing from scratch.

Section 07

Limitations and Future Directions

Limitations: Training alternative structures requires additional one-time computing resources; complex MLP blocks are difficult to approximate with simple structures; currently applicable to MLP blocks in encoder-decoder architectures, and applicability to other variants (such as sparse attention, mixture-of-experts models) remains to be verified. Future directions: Explore more complex alternative structures (such as small Transformer blocks), hybrid compression strategies, extension to vision Transformers, and dynamic replacement mechanisms.

Continue Reading

Keep going with more reads from the same topic.

SignalCut: An Intelligent Tool for Turning AI Search Visibility Gaps into Video Marketing Campaigns

SignalCut is an innovative web application that analyzes brands' visibility gaps in AI search, automatically generates evidence-based marketing strategies, and creates Hera video materials, helping early-stage brands gain a competitive edge in the AI answer engine era.

Recent activity 2026-04-26 11:27

AWS Open-Sources AI Search Citation Analysis System: Track Brand Exposure in AI Search Engines

An open-source project officially released by AWS, built on Amazon Bedrock, Step Functions, and React to form a complete serverless citation analysis system. It helps enterprises monitor their brand's citation status and competitive landscape in AI searches like ChatGPT, Perplexity, Gemini, and Claude.

Recent activity 2026-03-31 20:49

Next.js Application SEO and GEO Integrated Optimization Solution: Comprehensive Visibility from Search Engines to AI Assistants

This article delves into the stevewerme/seo-geo-nextjs project, an open-source tool designed specifically for Next.js applications to simultaneously optimize traditional search engine rankings (SEO) and generative engine visibility (GEO). It analyzes the project's core architecture, implementation mechanisms, practical application scenarios, and its strategic significance for developers and content creators.

Recent activity 2026-04-03 14:48

Baiyuan GEO Platform Technical White Paper: SaaS Engineering Practice for Generative Engine Optimization (GEO)

This article deeply analyzes the GEO Platform technical white paper developed by Baiyuan Technology, covering the seven-dimensional AI citation rate scoring algorithm, AXP shadow document delivery mechanism, Schema.org three-layer entity knowledge graph, and the hallucination automatic detection and repair closed-loop system, providing an engineering solution for brands to gain visibility in generative AI such as ChatGPT and Claude.

Recent activity 2026-04-18 22:54