Section 01
[Introduction] Core Overview of Controlled Study on Interpretability and Robustness of Large Language Models
The master's thesis research conducted by IIT Jodhpur uses a three-arm controlled experiment to explore the impact of explanation faithfulness training on the adversarial robustness of large language models (LLMs). Systematic evaluations are performed on three benchmarks: GSM8K (mathematical reasoning), AdvBench (adversarial safety), and MT-Bench (dialogue utility). The study aims to clarify the relationship pattern between faithfulness training and robustness (synergy, decoupling, or trade-off), providing guidance for designing safer and more interpretable AI systems.