Modern large language model deployment faces a common challenge: how to load huge model weights into GPU memory in the shortest time possible. Different storage formats (such as Hugging Face's SafeTensors and ServerlessLLM formats) have different access patterns, and different hardware platforms and file system configurations significantly impact I/O performance.
Traditional approaches usually choose a fixed loading strategy, such as always using memory mapping (mmap) or always using asynchronous I/O. However, this method cannot adapt to diverse workloads. Tensora's core insight is: No single I/O strategy is optimal in all scenarios, and the choice should be dynamically determined based on checkpoint size, shard structure, and platform capabilities.