Section 01
[Introduction] Core Overview of Research on Implicit Ethical Alignment of Large Language Models
This research focuses on the internal activation patterns of large language models (LLMs) in policy selection tasks, explores their implicit ethical alignment mechanisms, and compares them with classic ethical frameworks such as utilitarianism, fairness and justice, and the categorical imperative. The study aims to reveal whether implicit ethical representations are formed inside the models, providing new ideas for AI safe deployment, interpretability improvement, and value bias correction.