Section 01
[Introduction] PMC-InterCPT: Interleaved Medical Multimodal Data for Better Performance with Fewer Tokens
PMC-InterCPT is a medical multimodal pre-training dataset released by the arXiv team on May 31, 2026. Its core goal is to address the quality and efficiency issues of traditional medical multimodal data. Key innovations include: integrating chart-referenced text content to provide complete context, recovering missing titles, improving data quality via LLM-supervised filtering, and adopting a four-bucket evidence classification method to resolve modal imbalance. Validation on the Qwen3.5-4B-Base model shows that this dataset can significantly enhance medical multimodal performance with fewer pre-training tokens while maintaining general multimodal capabilities. Original paper link: http://arxiv.org/abs/2606.01049v1