# Controlled Study on Interpretability and Robustness of Large Language Models: How Faithfulness Training Affects Adversarial Safety

> The research project from IIT Jodhpur uses a three-arm controlled experimental design to explore the impact of explanation faithfulness training on the adversarial robustness of large language models, with systematic evaluations conducted on GSM8K, AdvBench, and MT-Bench.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-18T05:10:32.000Z
- 最近活动: 2026-04-18T05:23:09.076Z
- 热度: 141.8
- 关键词: faithfulness, robustness, adversarial attacks, LLM safety, AI alignment, chain-of-thought, 可解释AI, AI安全
- 页面链接: https://www.zingnex.cn/en/forum/thread/llm-github-netajik-mtp-faithfulness-robustness
- Canonical: https://www.zingnex.cn/forum/thread/llm-github-netajik-mtp-faithfulness-robustness
- Markdown 来源: floors_fallback

---

## [Introduction] Core Overview of Controlled Study on Interpretability and Robustness of Large Language Models

The master's thesis research conducted by IIT Jodhpur uses a three-arm controlled experiment to explore the impact of explanation faithfulness training on the adversarial robustness of large language models (LLMs). Systematic evaluations are performed on three benchmarks: GSM8K (mathematical reasoning), AdvBench (adversarial safety), and MT-Bench (dialogue utility). The study aims to clarify the relationship pattern between faithfulness training and robustness (synergy, decoupling, or trade-off), providing guidance for designing safer and more interpretable AI systems.

## Research Background: Cross-cutting Challenges Between Interpretability and Safety

As LLM capabilities improve, their "black-box" nature brings two core challenges: interpretability (whether the reasoning process is faithful to internal computations) and safety (resistance to adversarial attacks). Traditional research treats these two separately, while this study focuses on the key question: Does faithfulness training affect adversarial robustness? This research was conducted by Kancharapu Netaji from IIT Jodhpur under the guidance of Dr. Deeksha Varshney.

## Experimental Design and Technical Implementation

A three-arm controlled experiment is used to ensure comparability:
- Arm A (Baseline): Cross-entropy loss only (answer)
- Arm B (Reasoning): Cross-entropy (answer + reasoning process)
- Arm C (Faithfulness): Cross-entropy (answer) + contrastive faithfulness loss
Statistical significance is ensured through 3 random seeds × 3 experimental groups = 9 checkpoints. Methodological rigor is reflected in pre-registration (submitting evaluation scripts before training, etc.). Technically, LoRA is used for parameter-efficient fine-tuning, and the code is organized in a modular way (directories like train/eval/scripts, etc.).

## Multi-dimensional Evaluation Dimensions (Evidence)

Each checkpoint is evaluated from three aspects:
1. Faithfulness (GSM8K): Compare the consistency between generated reasoning and the actual computation process;
2. Adversarial Robustness (AdvBench 200 prompts): Evaluate using fixed snapshots, and submit hash values to ensure reproducibility (original prompts need to be obtained after verification);
3. Utility (MT-Bench 80 prompts): Evaluate the utility in dialogue scenarios;
It is also planned to analyze internal representation mechanisms through residual flow and rejection direction.

## Research Significance and Potential Impact

- Theoretical: If it is confirmed that faithfulness training improves robustness, it will support the "interpretability-safety synergy" hypothesis and promote the integration of the two fields;
- Practical: Organizations deploying LLMs can simultaneously improve safety through interpretability tools;
- Methodological: Demonstrate how master's projects can conduct high-quality AI safety research, with three-arm control and pre-registration being worth learning from.

## Limitations and Future Directions

**Limitations**:
- Model scale: Based on medium-sized open-source models (e.g., Llama series), verification on larger commercial models is needed;
- Task scope: Focuses on mathematical reasoning and safety rejection; other tasks (code, medical Q&A) need to be explored;
- Faithfulness measurement: Definitions and measurement methods are still open, which may affect conclusions.
**Future Directions**:
- Reproduce on larger models; expand task domains; conduct in-depth analysis of internal representation mechanisms; develop joint optimization objectives.