Section 01
Introduction: New Path for Safety Alignment During Inference
Introduces the Robust Deliberative Alignment method, a new technique to improve large language model (LLM) safety during inference. By attributing unsafe behaviors to the underlying model characteristics, it achieves safety enhancement without retraining, addressing the limitations of traditional training-phase alignment (e.g., RLHF) such as high cost, incomplete coverage, and rigidity.