Section 01
Introduction: Microglia-Inspired Dynamic Pruning Optimizes Inference Models
Introduction
Drawing inspiration from the mechanism of microglia selectively pruning synapses in the brain, researchers developed a dynamic attention head pruning system. On the Phi-3-Mini model, this system achieves 20-30% attention head pruning with minimal accuracy loss while improving inference latency by 10-15%, providing a new optimization approach to the inference cost problem of large language models.