Zing Forum

Reading

CAST: A Novel Topology Fusion Approach for Core Selection in Multimodal Datasets

To address the challenge of data selection in large-scale multimodal model training, researchers propose the CAST framework. By constructing modal topologies, multi-scale distribution matching, and a soft relationship coverage mechanism, it selects high-information core sets while maintaining data distribution equivalence, significantly outperforming existing baselines on Flickr30K and MS-COCO datasets.

CAST多模态核心集数据选择拓扑融合扩散小波分布匹配跨模态数据集优化
Published 2026-05-12 15:59Recent activity 2026-05-13 11:54Estimated read 9 min
CAST: A Novel Topology Fusion Approach for Core Selection in Multimodal Datasets
1

Section 01

CAST Framework: A Novel Topology Fusion Approach for Core Selection in Multimodal Datasets (Introduction)

To address the challenge of data selection in large-scale multimodal model training, researchers propose the CAST (Collapse-Aware multi-Scale Topology fusion) framework. This framework constructs modal topologies, multi-scale distribution matching, and a soft relationship coverage mechanism to select high-information core sets while maintaining data distribution equivalence, solving the single-modal bias and distribution shift issues of existing methods. Experiments show that CAST significantly outperforms existing baselines on the Flickr30K and MS-COCO datasets, with both performance and efficiency advantages.

2

Section 02

Background of Multimodal Data Selection and Limitations of Existing Methods

Data Dilemma in Multimodal Model Training

Large-scale multimodal models (e.g., CLIP, LLaVA) rely on massive image-text paired data, but training costs are extremely high (thousands of GPU hours), making dataset selection a key direction to reduce costs.

Dual Limitations of Existing Methods

  1. Single-modal Dominated Sampling Bias: Dominating one modality while ignoring cross-modal information imbalance, leading to semantic loss in the other modality.
  2. Distribution Shift Caused by Coarse-grained Scoring: It is difficult to ensure distribution equivalence between the core set and the original dataset, affecting model generalization; existing strategies also fail to balance global structure, local details, and redundancy-aware coverage.
3

Section 03

Three Core Innovations of the CAST Framework

CAST framework includes three core innovations:

  1. Local Collapse-Aware Cross-modal Topology Fusion: Construct image and text topologies separately, identify and handle local collapse regions, then unify them into a comprehensive topology via cross-modal fusion, preserving key information from both modalities.
  2. Multi-scale Distribution Matching in Diffusion Wavelet Domain: Leverage the multi-scale analysis, geometric structure preservation, and smooth frequency domain decomposition capabilities of diffusion wavelets to ensure the core set is distributionally equivalent to the original data across multiple scales.
  3. Local Soft Relationship Coverage Mechanism: Extend to relation-aware indirect coverage, introduce soft coverage and redundancy penalties to avoid redundancy in dense regions and ensure core set diversity.
4

Section 04

Experimental Validation: Dual Improvements in Performance and Efficiency

Experimental validation on Flickr30K and MS-COCO datasets:

  • Core Set Quality: Models trained on CAST-selected core sets significantly outperform existing baselines.
  • Cross-Architecture Generalization: The core set is applicable to different model architectures, capturing the essential information of the data.
  • Energy Efficiency: While maintaining performance, it is more energy-efficient than state-of-the-art synthetic methods.
5

Section 05

In-depth Technical Details of CAST

Topology Construction Method

Construct modal topologies using graph neural networks: For the image modality, build a k-nearest neighbor graph based on visual features; for the text modality, build a similar graph based on language features, with edge weights reflecting semantic similarity.

Cross-modal Fusion Strategy

Adopt an attention mechanism to adaptively adjust the fusion ratio between image and text topologies based on the cross-modal alignment quality of samples.

Diffusion Wavelet Implementation

Defined by simulating heat diffusion propagation on graphs, adapting to graph structures and avoiding the dependency of traditional wavelets on regular grids.

Optimization Algorithm

Formulate core set selection as a combinatorial optimization problem, and use a strategy combining greedy algorithms and convex relaxation for efficient solving.

6

Section 06

Implications of CAST for Multimodal Research and Conclusions

Implications for Multimodal Research

  1. Modal Balance: All modalities need to be considered simultaneously to avoid single-modal dominance.
  2. Distribution Equivalence: The core set must represent the complete distribution of the original data; otherwise, generalization is affected.
  3. Multi-scale Perspective: Semantic information at different scales (from global themes to local details) needs to be captured.
  4. Topological Structure: Topological representations better capture the intrinsic geometry and relationships of data than feature vectors.

Conclusion

CAST addresses key limitations of existing methods, provides a feasible path to balance performance and cost in large-scale multimodal model training, and lays a technical foundation for efficient data selection.

7

Section 07

Limitations of CAST and Future Research Directions

Limitations

  1. Computational Complexity: Topology construction and multi-scale analysis increase initial selection overhead.
  2. Hyperparameter Sensitivity: Hyperparameters of diffusion wavelets and coverage mechanisms need to be adjusted based on datasets.
  3. Theoretical Analysis: Theoretical research on the effectiveness of the technology combination is insufficient.

Future Directions

  • Develop more efficient topology construction algorithms.
  • Explore automatic hyperparameter selection methods.
  • Extend to more modalities such as audio and video.
  • Apply to other types of models like generative models.
  • Develop an online version supporting streaming data.