# MADS: A Model-Aware Neural Activation-Based Core Set Selection Method for Instruction Fine-Tuning

> Researchers propose the MADS method, which selects a diverse core training set by analyzing the neural activation states of large language models (LLMs) during inference. Using only 15% of the data, it outperforms full-data training on multiple benchmarks and demonstrates good model scale transferability.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-29T05:28:36.000Z
- 最近活动: 2026-06-01T02:20:06.131Z
- 热度: 82.1
- 关键词: 指令微调, 数据选择, 核心集, 神经激活, 模型感知, 覆盖最大化, 数据多样性, Alpaca
- 页面链接: https://www.zingnex.cn/en/forum/thread/mads
- Canonical: https://www.zingnex.cn/forum/thread/mads
- Markdown 来源: floors_fallback

---

## 【Main Floor/Introduction】MADS: A Model-Aware Neural Activation-Based Core Set Selection Method for Instruction Fine-Tuning

Researchers propose the MADS method, which selects a diverse core training set by analyzing the neural activation states of large language models (LLMs) during inference. Using only 15% of the data, it outperforms full-data training on multiple benchmarks and demonstrates good model scale transferability. This article will introduce it from aspects such as background, method, experiments, and value.

## Research Background: Dilemmas in Data Selection for Instruction Fine-Tuning and Limitations of Existing Methods

Instruction fine-tuning is a key technique to improve LLMs' instruction-following ability, but the explosion of data volume brings core set selection problems (needing to balance efficiency, storage, and quality). Traditional methods rely on surface text features (embedding, clustering, uncertainty), which are disconnected from the model's internal understanding and thus struggle to ensure core set diversity.

## Core Idea of MADS: Model-Aware Neural Activation Features and Workflow

MADS selects core sets based on the neural activation states of LLMs during inference. Workflow: 1. Use a small LLM (e.g., 3B parameters) to process candidate data and record activations at each layer; 2. Convert activations into feature vectors; 3. Adopt a coverage maximization strategy to select a diverse core set.

## Experimental Validation: Cross-Model Transferability and Performance on Alpaca Dataset

MADS was evaluated on 6 benchmarks: 1. Cross-model transferability: Core sets selected by the 3B model are effective for 7B/8B/13B models; 2. Alpaca-GPT4 dataset (52k entries): After fine-tuning with 15% core set (7.8k entries), it achieved an average improvement of 2.5% over full-data training.

## Why Does MADS Work? Internal Perspective and Strategic Advantages

Reasons for MADS's effectiveness: 1. Captures diversity from the model's perspective (text similarity ≠ model-perceived similarity); 2. Coverage maximization strategy ensures full coverage of activation patterns; 3. Uses small models to extract activations, making computational overhead manageable.

## Practical Application Value: Resource-Constrained Environments, Rapid Prototyping, and Data Auditing

Application scenarios of MADS: 1. Resource-constrained environments: Achieve good results with less data; 2. Rapid prototyping: Accelerate experimental iteration; 3. Data quality auditing: Gain insights into dataset composition and issues.

## Limitations and Future Research Directions

Limitations of MADS: Additional overhead for activation extraction, sensitivity to hyperparameters, insufficient adaptability to some professional tasks. Future directions: More efficient activation extraction, adaptive coverage strategies, and combination with other selection techniques.
