Zing Forum

Reading

TrustMH-Bench: A Credibility Evaluation Benchmark for Large Models in Mental Health Counseling Scenarios

TrustMH-Bench is a credibility evaluation benchmark specifically designed for large language models (LLMs) in the mental health counseling domain. It systematically assesses LLMs' performance in sensitive counseling scenarios across four dimensions: privacy protection, safety, jailbreak resistance, and fairness.

大语言模型心理健康AI咨询可信度评估隐私保护AI安全越狱攻击公平性基准测试开源数据集
Published 2026-05-04 17:14Recent activity 2026-05-04 17:20Estimated read 6 min
TrustMH-Bench: A Credibility Evaluation Benchmark for Large Models in Mental Health Counseling Scenarios
1

Section 01

[Introduction] TrustMH-Bench: A Credibility Evaluation Benchmark for Large Models in Mental Health Counseling Scenarios

TrustMH-Bench is a credibility evaluation benchmark for large language models (LLMs) designed for the mental health counseling domain. It systematically assesses LLMs' performance in sensitive counseling scenarios across four dimensions: privacy protection, safety, jailbreak resistance, and fairness. It fills the gap where traditional general-purpose evaluation benchmarks (such as MMLU, HumanEval) fail to capture the unique risks in mental health scenarios. As an open-source comprehensive evaluation dataset, it provides a specialized assessment tool for researchers, developers, and regulators.

2

Section 02

Background: The Rise of AI Mental Health Counseling and Trust Challenges

In recent years, LLMs have shown great potential in the mental health counseling domain and have become an important supplement to global services. However, the risks of sensitive privacy sharing and secondary harm from inappropriate responses pose trust challenges. Traditional benchmarks focus on general knowledge and reasoning and are difficult to cover the unique risks in mental health scenarios, which led to the emergence of TrustMH-Bench.

3

Section 03

Core Evaluation Dimensions: A Comprehensive Review of Credibility Across Four Dimensions

TrustMH-Bench evaluates across four dimensions:

  1. Privacy Protection: Identify and handle sensitive information, avoid leaks, remind of boundaries, resist privacy extraction attacks, and comply with regulations such as GDPR/HIPAA;
  2. Safety: Identify crisis signals, avoid inappropriate advice, exercise caution with medical recommendations, and maintain a professional stance;
  3. Jailbreak Resistance: Resist attacks such as inducing the generation of psychological manipulation strategies, dangerous advice, and bypassing safety guardrails;
  4. Fairness: Detect issues like stereotypes, cultural biases, neglect of minority needs, and language discrimination.
4

Section 04

Dataset Construction: Multi-source Integration and Ethical Assurance

The dataset adopts a multi-source integration strategy: expert annotation (scenarios designed by counselors/psychologists), literature mining (typical counseling situations), adversarial generation (red-team boundary use cases), and desensitized real cases (privacy-processed dialogue fragments). Each use case undergoes multiple rounds of review to ensure evaluation value and ethical compliance.

5

Section 05

Application Value: An Evaluation Tool Benefiting Multiple Parties

TrustMH-Bench provides tools for multiple parties:

  • Model developers: Conduct safety assessments during training and fine-tuning phases to fix potential issues;
  • Application developers: Perform security audits before product launch;
  • Researchers: Use a standardized comparison framework to support horizontal comparisons;
  • Regulators: Refer to it for compliance evaluation.
6

Section 06

Limitations and Outlook: Directions for Continuous Improvement

Current limitations: Focuses mainly on English scenarios, uses static test cases (insufficient long-term safety evaluation of dynamic dialogues), and needs to expand coverage of cultural fairness. Future plans: Incorporate community feedback, expand the dataset's coverage and depth, and explore integration with real clinical environments.

7

Section 07

Conclusion: Credible Evolution in the AI Mental Health Domain

TrustMH-Bench marks the evolution of AI mental health from 'functionally usable' to 'safe and credible'. Domain-specific evaluation benchmarks have become an important guarantee for applications in sensitive scenarios. This open-source project is worth attention and participation.