Zing Forum

Reading

SilentBench: A Systematic Benchmark Revealing the "Output Suppression" Phenomenon in Large Language Models

The first benchmark dedicated to studying output suppression patterns in large language models. By comparing base models with instruction-tuned models, it finds that RLHF training produces consistent suppression signatures across specific categories.

大语言模型RLHF输出抑制基准测试模型对齐AI安全指令微调模型评估
Published 2026-05-06 20:14Recent activity 2026-05-06 20:23Estimated read 5 min
SilentBench: A Systematic Benchmark Revealing the "Output Suppression" Phenomenon in Large Language Models
1

Section 01

SilentBench: A Systematic Benchmark Revealing the "Output Suppression" Phenomenon in Large Language Models (Introduction)

SilentBench is the first open-source benchmark dedicated to studying the "output suppression" phenomenon in large language models. By comparing base models with instruction-tuned models, it reveals that RLHF training produces consistent suppression signatures across specific categories. This article will discuss aspects including background, methodology, evidence, conclusions, and future directions.

2

Section 02

Background: Definition and Research Motivation of Output Suppression

Output suppression refers to the phenomenon where a model forms an answer tendency internally but does not output it in the end, which is different from explicit refusal to answer. Traditional research focuses on what the model "says" but ignores the content it "almost says but doesn't". The research team raises a core question: Do RLHF and instruction tuning systematically change the model's generation boundaries?

3

Section 03

Methodology: Design and Implementation of the SilentBench Benchmark

  • Dataset: 35,000 records covering paired tests of base and instruction-tuned versions from 4 model families (OPT, Gemma, Llama3.1, Mistral).
  • Test Categories: Safety, Factual, Controversial Factual, Knowledge Boundary, Creative.
  • Model Comparison Matrix:
    Model Family Base Model Instruction-Tuned Version
    OPT facebook/opt-1.3b facebook/opt-iml-1.3b
    Gemma google/gemma-2b google/gemma-2b-it
    Llama meta-llama/Llama-3.1-8B meta-llama/Llama-3.1-8B-Instruct
    Mistral mistralai/Mistral-7B-v0.1 mistralai/Mistral-7B-Instruct-v0.2
  • Technical Implementation: Reproducible code is provided, including environment setup (pip install -r requirements.txt), experiment script execution, and result analysis (results stored in results/complete_stats_table.csv).
4

Section 04

Evidence: Key Findings of Output Suppression

  1. Suppression has perfect consistency (standard deviation std=0.000), which is a deterministic behavioral pattern;
  2. RLHF produces category-specific suppression signatures, with similar distributions across different model families;
  3. The suppression effect is strongest in the Safety (Cohen's d=1.73) and Controversial Factual (d=1.49) categories;
  4. Small models (1B-8B parameters) achieve alignment through output suppression, with almost no hard refusals.
5

Section 05

Conclusions: Research Significance and Impact of SilentBench

  • AI Safety: Provides a new analytical dimension for the internal knowledge boundaries of models;
  • Model Development: Reveals that RLHF may cause knowledge suppression, raising the ethical question of "who has the right to decide which topics the model should remain silent on";
  • Transparency: The open-source tool promotes comprehensive evaluation of model behavioral characteristics, aligning with the spirit of the AI transparency movement.
6

Section 06

Recommendations: Current Limitations and Future Research Directions

  • Limitations: The paper is pending publication; tests are based on English corpora; static prompt sets do not cover dynamic dialogues;
  • Future Directions: Multilingual expansion; research on cumulative effects of suppression in dialogue contexts; exploration of suppression removal methods; user perception research.