During the deployment and inference of Large Language Models (LLMs), checkpoint loading often becomes a performance bottleneck. As model sizes continue to grow, checkpoint files can reach tens or even hundreds of gigabytes, and traditional synchronous loading methods lead to significant startup delays. The optimal I/O strategy varies greatly across different scenarios: synchronous reading may be the fastest for small models with single shards, while large models with multiple shards require advanced techniques like asynchronous I/O or memory mapping.
Developers usually need to manually choose between multiple I/O backends, including synchronous POSIX, Tokio asynchronous, Linux io_uring, and memory mapping, but each solution has its applicable scenarios and limitations. This complexity increases deployment difficulty and easily leads to suboptimal choices.