Section 01
LLMSurgeon: A New Approach to Reverse-Engineer LLM's "Digital DNA"
LLMSurgeon is a revolutionary framework that uses reverse engineering to infer the domain distribution of an LLM's pre-training data solely from its generated text. This breakthrough addresses the "black box" dilemma of LLMs (where training data composition is often hidden) and opens new avenues for AI model audit, transparency, and accountability. Key contributions include solving the data mixture inference problem via calibrated confusion matrices and constrained optimization, with a verifiable evaluation suite (LLMScan) supporting its effectiveness.