# Vulnerability of Instruction-Tuned Models: A Single Punctuation Mark Can Cause Responses to Collapse

> This article reveals that instruction-tuned large models have fundamental vulnerabilities: simple lexical constraints (such as banning a single punctuation mark or common word) can lead to a complete collapse of responses, resulting in a 14-48% loss of comprehensiveness. Moreover, this vulnerability stems from instruction tuning itself, not the model size or architecture.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-14T17:40:01.000Z
- 最近活动: 2026-04-15T02:55:10.604Z
- 热度: 148.8
- 关键词: 指令微调, 大语言模型, 模型鲁棒性, 约束生成, GPT-4o, 机制分析, 评估方法
- 页面链接: https://www.zingnex.cn/en/forum/thread/llm-arxiv-2604-13006v1
- Canonical: https://www.zingnex.cn/forum/thread/llm-arxiv-2604-13006v1
- Markdown 来源: floors_fallback

---

## [Introduction] Vulnerability of Instruction-Tuned Models: A Single Punctuation Mark Can Cause Responses to Collapse

This article reveals that instruction-tuned large models have fundamental vulnerabilities: simple lexical constraints (such as banning a single punctuation mark or common word) can lead to a 14-48% loss of response comprehensiveness. This vulnerability originates from the instruction tuning training paradigm itself, not the model size or architecture. Both open-source and closed-source models (e.g., GPT-4o-mini) are affected, indicating the need to pay attention to model robustness.

## [Background] Vulnerability of Instruction-Tuned Models Under Simple Constraints

Large language models can generate useful responses after instruction tuning, but the research team questions whether this usefulness is fragile under simple constraints. Experimental results show that constraints like banning a single punctuation mark or common word cause the model's responses to collapse completely; baseline responses are better in 77%-100% of cases. GPT-4o-mini also suffers a 31% loss of comprehensiveness and a 99% baseline win rate, with the root cause lying in the instruction tuning paradigm.

## [Experimental Methods] Design of Model Testing Under Simple Constraints

### Constraint Types
- Punctuation constraints: Ban a single punctuation mark (comma, period, etc.)
- Lexical constraints: Ban common words (e.g., "the", "is")
- Format constraints: Restrict specific output formats

### Evaluation Methods
Use pairwise evaluation: free generation (baseline) vs constrained generation, with blind testing by GPT-4o-mini and GPT-4, totaling 1920 pairs of evaluations.

### Test Models
Cover 3 open-source model families and closed-source GPT-4o-mini to ensure the universality of results.

## [Experimental Evidence] Data Performance of Model Collapse Under Constraints

- **Comprehensiveness loss**: Under constraints, the model's response comprehensiveness decreases by 14%-48%, missing a lot of key information.
- **Baseline win rate**: Baseline responses are better in 77%-100% of cases, with a significant drop in quality.
- **Closed-source model vulnerability**: GPT-4o-mini suffers a 31% loss of comprehensiveness and a 99% baseline win rate, proving the problem is not unique to open-source models.
- **MT-Bench reproduction**: Collapse effects are observed in 8 task categories such as writing, reasoning, and mathematics, indicating universality.

## [Mechanism Analysis] Why Do Instruction-Tuned Models Collapse?

### Planning Failure, Not Generation Failure
- Two-pass generation recovery: First generate freely, then rewrite under constraints, which can restore 59%-96% of response length, indicating the model has the ability to generate under constraints; the problem lies in initial planning.
- Linear probe prediction: A probe before generation can predict response length (R²=0.51-0.93), and R² is positively correlated with the degree of collapse, proving that the short response is determined in the planning stage.

### Instruction Tuning Is the Culprit
- Base models have no systematic collapse: Under the same constraints, the effect on base models without instruction tuning is small and bidirectional.
- Probe fails in base models: The prompt representation of base models cannot predict response length (negative R²), indicating that instruction tuning creates a fragile representation structure.

Conclusion: Instruction tuning couples task capabilities with surface form templates, leading to loss of ability when format deviates.

## [Evaluation Insights] Blind Spots and Reflections on Current Evaluation Methods

- **Independent evaluation vs pairwise evaluation**: Standard independent LLM-as-judge evaluation only detects an average quality drop of 3.5%, while pairwise evaluation reveals a 23% quality drop, exposing the blind spot where independent evaluation severely underestimates the impact of constraints.
- Insight: Research on constrained generation needs to carefully choose evaluation methods; pairwise evaluation is more sensitive.

## [Mitigation Directions] Possible Solutions and Future Research

### Mitigation Strategies
- Two-pass generation: First generate freely, then rewrite under constraints to restore quality (though it increases computational cost).
- Diversify training data: Introduce diverse format constraints during instruction tuning to decouple content and form.
- Explicit planning module: Separate planning and generation; first abstractly plan content, then handle format.

### Limitations and Future Work
- Constraint scope: Only lexical-level constraints are tested; need to study the impact of semantic and style constraints.
- Model scope: Need to track the performance of new architectures and training methods.
- Mechanism depth: Need to deeply study how instruction tuning creates a fragile representation structure.

## [Conclusion] Warning Significance of Instruction-Tuned Model Robustness

The research title "One Token Away from Collapse" vividly summarizes the findings: a single token constraint can cause the performance of instruction-tuned models to decline. It warns us: when pursuing benchmark scores, we need to pay attention to robustness; AI systems need to maintain stable capabilities under real-world constraints. For practitioners: Be cautious when handling output constraints during deployment; the two-pass generation strategy can be adopted. For researchers: Open up new directions for understanding and improving the mechanism of instruction tuning.