Zing Forum

Reading

Controlled Study on Interpretability and Robustness of Large Language Models: How Faithfulness Training Affects Adversarial Safety

The research project from IIT Jodhpur uses a three-arm controlled experimental design to explore the impact of explanation faithfulness training on the adversarial robustness of large language models, with systematic evaluations conducted on GSM8K, AdvBench, and MT-Bench.

faithfulnessrobustnessadversarial attacksLLM safetyAI alignmentchain-of-thought可解释AIAI安全
Published 2026-04-18 13:10Recent activity 2026-04-18 13:23Estimated read 6 min
Controlled Study on Interpretability and Robustness of Large Language Models: How Faithfulness Training Affects Adversarial Safety
1

Section 01

[Introduction] Core Overview of Controlled Study on Interpretability and Robustness of Large Language Models

The master's thesis research conducted by IIT Jodhpur uses a three-arm controlled experiment to explore the impact of explanation faithfulness training on the adversarial robustness of large language models (LLMs). Systematic evaluations are performed on three benchmarks: GSM8K (mathematical reasoning), AdvBench (adversarial safety), and MT-Bench (dialogue utility). The study aims to clarify the relationship pattern between faithfulness training and robustness (synergy, decoupling, or trade-off), providing guidance for designing safer and more interpretable AI systems.

2

Section 02

Research Background: Cross-cutting Challenges Between Interpretability and Safety

As LLM capabilities improve, their "black-box" nature brings two core challenges: interpretability (whether the reasoning process is faithful to internal computations) and safety (resistance to adversarial attacks). Traditional research treats these two separately, while this study focuses on the key question: Does faithfulness training affect adversarial robustness? This research was conducted by Kancharapu Netaji from IIT Jodhpur under the guidance of Dr. Deeksha Varshney.

3

Section 03

Experimental Design and Technical Implementation

A three-arm controlled experiment is used to ensure comparability:

  • Arm A (Baseline): Cross-entropy loss only (answer)
  • Arm B (Reasoning): Cross-entropy (answer + reasoning process)
  • Arm C (Faithfulness): Cross-entropy (answer) + contrastive faithfulness loss Statistical significance is ensured through 3 random seeds × 3 experimental groups = 9 checkpoints. Methodological rigor is reflected in pre-registration (submitting evaluation scripts before training, etc.). Technically, LoRA is used for parameter-efficient fine-tuning, and the code is organized in a modular way (directories like train/eval/scripts, etc.).
4

Section 04

Multi-dimensional Evaluation Dimensions (Evidence)

Each checkpoint is evaluated from three aspects:

  1. Faithfulness (GSM8K): Compare the consistency between generated reasoning and the actual computation process;
  2. Adversarial Robustness (AdvBench 200 prompts): Evaluate using fixed snapshots, and submit hash values to ensure reproducibility (original prompts need to be obtained after verification);
  3. Utility (MT-Bench 80 prompts): Evaluate the utility in dialogue scenarios; It is also planned to analyze internal representation mechanisms through residual flow and rejection direction.
5

Section 05

Research Significance and Potential Impact

  • Theoretical: If it is confirmed that faithfulness training improves robustness, it will support the "interpretability-safety synergy" hypothesis and promote the integration of the two fields;
  • Practical: Organizations deploying LLMs can simultaneously improve safety through interpretability tools;
  • Methodological: Demonstrate how master's projects can conduct high-quality AI safety research, with three-arm control and pre-registration being worth learning from.
6

Section 06

Limitations and Future Directions

Limitations:

  • Model scale: Based on medium-sized open-source models (e.g., Llama series), verification on larger commercial models is needed;
  • Task scope: Focuses on mathematical reasoning and safety rejection; other tasks (code, medical Q&A) need to be explored;
  • Faithfulness measurement: Definitions and measurement methods are still open, which may affect conclusions. Future Directions:
  • Reproduce on larger models; expand task domains; conduct in-depth analysis of internal representation mechanisms; develop joint optimization objectives.