Zing Forum

Reading

MADS: A Model-Aware Neural Activation-Based Core Set Selection Method for Instruction Fine-Tuning

Researchers propose the MADS method, which selects a diverse core training set by analyzing the neural activation states of large language models (LLMs) during inference. Using only 15% of the data, it outperforms full-data training on multiple benchmarks and demonstrates good model scale transferability.

指令微调数据选择核心集神经激活模型感知覆盖最大化数据多样性Alpaca
Published 2026-05-29 13:28Recent activity 2026-06-01 10:20Estimated read 4 min
MADS: A Model-Aware Neural Activation-Based Core Set Selection Method for Instruction Fine-Tuning
1

Section 01

【Main Floor/Introduction】MADS: A Model-Aware Neural Activation-Based Core Set Selection Method for Instruction Fine-Tuning

Researchers propose the MADS method, which selects a diverse core training set by analyzing the neural activation states of large language models (LLMs) during inference. Using only 15% of the data, it outperforms full-data training on multiple benchmarks and demonstrates good model scale transferability. This article will introduce it from aspects such as background, method, experiments, and value.

2

Section 02

Research Background: Dilemmas in Data Selection for Instruction Fine-Tuning and Limitations of Existing Methods

Instruction fine-tuning is a key technique to improve LLMs' instruction-following ability, but the explosion of data volume brings core set selection problems (needing to balance efficiency, storage, and quality). Traditional methods rely on surface text features (embedding, clustering, uncertainty), which are disconnected from the model's internal understanding and thus struggle to ensure core set diversity.

3

Section 03

Core Idea of MADS: Model-Aware Neural Activation Features and Workflow

MADS selects core sets based on the neural activation states of LLMs during inference. Workflow: 1. Use a small LLM (e.g., 3B parameters) to process candidate data and record activations at each layer; 2. Convert activations into feature vectors; 3. Adopt a coverage maximization strategy to select a diverse core set.

4

Section 04

Experimental Validation: Cross-Model Transferability and Performance on Alpaca Dataset

MADS was evaluated on 6 benchmarks: 1. Cross-model transferability: Core sets selected by the 3B model are effective for 7B/8B/13B models; 2. Alpaca-GPT4 dataset (52k entries): After fine-tuning with 15% core set (7.8k entries), it achieved an average improvement of 2.5% over full-data training.

5

Section 05

Why Does MADS Work? Internal Perspective and Strategic Advantages

Reasons for MADS's effectiveness: 1. Captures diversity from the model's perspective (text similarity ≠ model-perceived similarity); 2. Coverage maximization strategy ensures full coverage of activation patterns; 3. Uses small models to extract activations, making computational overhead manageable.

6

Section 06

Practical Application Value: Resource-Constrained Environments, Rapid Prototyping, and Data Auditing

Application scenarios of MADS: 1. Resource-constrained environments: Achieve good results with less data; 2. Rapid prototyping: Accelerate experimental iteration; 3. Data quality auditing: Gain insights into dataset composition and issues.

7

Section 07

Limitations and Future Research Directions

Limitations of MADS: Additional overhead for activation extraction, sensitivity to hyperparameters, insufficient adaptability to some professional tasks. Future directions: More efficient activation extraction, adaptive coverage strategies, and combination with other selection techniques.