Zing Forum

Reading

PMC-InterCPT: Interleaved Medical Multimodal Pre-training Data for Stronger Medical Understanding with Fewer Tokens

PMC-InterCPT achieves improved medical multimodal performance on Qwen3.5-4B-Base while reducing pre-training token usage through integrating chart-referenced text, recovering missing titles, and resampling with four-bucket evidence classification.

PMC-InterCPT医学多模态持续预训练交错数据四桶分类LLM监督过滤医学VLM数据质量
Published 2026-05-31 14:38Recent activity 2026-06-02 11:30Estimated read 8 min
PMC-InterCPT: Interleaved Medical Multimodal Pre-training Data for Stronger Medical Understanding with Fewer Tokens
1

Section 01

[Introduction] PMC-InterCPT: Interleaved Medical Multimodal Data for Better Performance with Fewer Tokens

PMC-InterCPT is a medical multimodal pre-training dataset released by the arXiv team on May 31, 2026. Its core goal is to address the quality and efficiency issues of traditional medical multimodal data. Key innovations include: integrating chart-referenced text content to provide complete context, recovering missing titles, improving data quality via LLM-supervised filtering, and adopting a four-bucket evidence classification method to resolve modal imbalance. Validation on the Qwen3.5-4B-Base model shows that this dataset can significantly enhance medical multimodal performance with fewer pre-training tokens while maintaining general multimodal capabilities. Original paper link: http://arxiv.org/abs/2606.01049v1

2

Section 02

Background: Data Pain Points in Medical Multimodal Pre-training

Medical multimodal models rely on large-scale image-text data, but traditional data construction has the following issues:

  1. Title Limitations: Chart titles are short, have limited information, depend on context, and lack textual explanations;
  2. Structural Noise: Automatic extraction introduces problems like missing titles, residual tags, and repeated context;
  3. Continuous Pre-training Needs: Base models require more professional, high-quality data, and noise can interfere with learned representations.
3

Section 03

Methodology: Core Design and Processing Pipeline of PMC-InterCPT

Core Innovations

Integrate chart-referenced text content to form interleaved image-text sequences, simulating the logic of human paper reading.

Data Construction Pipeline

  1. Title Recovery: Generate/recover descriptions for images with missing titles;
  2. Text Cleaning: Remove residual tags and standardize formats;
  3. Interleaved Reconstruction: Organize images and referenced text in original order to maintain logical coherence;
  4. LLM Filtering: Double screening via medical relevance and quality classifiers.

Modal Balance Solution

Introduce a four-bucket evidence classification method (visual-dominant, text-dominant, balanced, weakly associated) and implement modal-aware resampling to avoid over-dominance of any evidence type.

4

Section 04

Experimental Validation: Win-Win Results in Quality and Efficiency

Experimental Setup

  • Base model: Qwen3.5-4B-Base;
  • Training pipeline: Continuous Pre-training (CPT) + Supervised Fine-tuning (SFT);
  • Comparison baseline: Original data source pool.

Key Results

  1. Better Performance with Fewer Tokens: Outperforms the original data source pool using fewer CPT tokens;
  2. Improved Medical Performance: Significant improvements in medical image understanding, terminology usage, and clinical reasoning abilities;
  3. General Performance Preservation: Does not compromise general multimodal capabilities;
  4. Complementarity: Synergistic effects from data quality and modal balance.
5

Section 05

Application Scenarios and Deployment Recommendations

Applicable Scenarios

  • Medical multimodal model training;
  • Medical education (generating teaching materials);
  • Clinical assistance (supporting decision-making systems);
  • Medical research (literature analysis and knowledge mining).

Usage Recommendations

  • CPT phase: Use to build a foundation of medical knowledge;
  • SFT phase: Fine-tune with instruction data;
  • Further filtering: Optimize data according to application scenarios.

Ethical Considerations

  • Privacy protection: Ensure patient information is desensitized;
  • Accuracy: Strictly control the correctness of medical information;
  • Responsibility boundary: Clarify the auxiliary positioning of the model.
6

Section 06

Limitations and Future Directions

Current Limitations

  1. Language Limitation: Mainly based on English literature;
  2. Modal Limitation: Focuses on image-text, with insufficient coverage of video, audio, etc.;
  3. Domain Coverage: Inadequate coverage of some medical specialties.

Future Directions

  1. Multilingual Expansion: Incorporate medical literature in other languages;
  2. Multimodal Expansion: Integrate data like pathological slides and genomes;
  3. Dynamic Updates: Establish a continuous update mechanism;
  4. Fine-grained Annotation: Add detailed medical annotations.
7

Section 07

Conclusion: A Paradigm for Medical Multimodal Construction Prioritizing Data Quality

PMC-InterCPT represents a significant advancement in medical multimodal data construction. Through context integration, quality filtering, and modal balance, it achieves dual improvements in data quality and efficiency. Core insight: Data quality is more important than quantity in continuous pre-training. The four-bucket classification method provides new ideas for modal imbalance issues and can be extended to other multimodal domains. This dataset serves as a high-quality data example for the development of medical AI, promoting progress in the field.