Zing Forum

Reading

TIME-IMM Time Series Prediction Framework Reproduction: A Practice of Multimodal Fusion for Multi-source Asynchronous Data

This project reproduces and extends the IMM-TSF benchmark framework from NeurIPS 2025, conducting time series prediction experiments on the EPA-Air dataset (a multi-source asynchronous type). It successfully reproduces 7 baseline models, completes three ablation experiments (text encoder selection, architecture family effect, and placebo test), and verifies the practical effect of multimodal fusion in irregular time series prediction.

TIME-IMM时间序列预测多模态融合不规则数据EPA-AirIMM-TSFNeurIPSRAG向量嵌入消融实验
Published 2026-05-13 01:39Recent activity 2026-05-13 01:52Estimated read 8 min
TIME-IMM Time Series Prediction Framework Reproduction: A Practice of Multimodal Fusion for Multi-source Asynchronous Data
1

Section 01

[Introduction] Core Summary of Reproduction and Extension Research on the TIME-IMM Time Series Prediction Framework

This project reproduces and extends the IMM-TSF benchmark framework from NeurIPS 2025, conducting time series prediction experiments on the EPA-Air multi-source asynchronous dataset. It successfully reproduces 7 baseline models, completes three ablation experiments (text encoder selection, architecture family effect, and placebo test), verifies the effect of multimodal fusion in irregular time series prediction, and reveals the impact of key factors such as encoder selection and architecture matching on fusion effectiveness.

2

Section 02

Research Background and Characteristics of the EPA-Air Dataset

Research Background and Challenges: In time series prediction, real-world data often exhibits irregularity due to sensor failures, etc. Multi-source asynchrony (independent sampling from multiple data sources) is a typical challenge. The TIME-IMM dataset and IMM-TSF framework aim to address such issues.

Characteristics of the EPA-Air Dataset: Derived from the TIME-IMM benchmark, it contains 4 environmental indicators (AQI, ozone, PM2.5, temperature) from 8 U.S. counties, with approximately 49,552 observations distributed across 6,587 timestamps, an average interval of 1.02 hours, a feature observability entropy of 0.3777 (high sparsity), and 1,244 weather news summaries. The context window and prediction range are both 7 days, and the training/validation/test sets are divided by time into 60%/20%/20%.

3

Section 03

Reproduction Results of Baseline Models

Successfully reproduced 7 baseline models from the original paper. Multimodal fusion reduced the average MSE by approximately 2.1% (lower than the cross-dataset average of 6.7% in the original paper). Some models benefited significantly: DLinear (-2.6%), Informer (-8.3%), TimeMixer (-7.3%); models like TimesNet showed negative effects (e.g., TimesNet's MSE increased by 5.8%), possibly due to weak semantic alignment between text and sensor signals. Attempts to reproduce 4 models such as TimeLLM encountered technical obstacles: TimeLLM required patch_size adjustment, and CRU failed due to numerical instability of the ODE solver.

4

Section 04

Ablation Experiment 1: Impact of Text Encoder Selection

The original paper claims that the impact of encoder selection is negligible, but this experiment comparing GPT-2 and BERT found that encoder selection has a significant impact on the EPA-Air dataset, with a single model's performance fluctuating by up to 13.3 percentage points. For example, the TTM model: text fusion reduced performance by 6.0% when using GPT-2, but improved it by 7.3% when using BERT, challenging the original paper's conclusion and indicating that encoder selection is a key factor for specific datasets.

5

Section 05

Ablation Experiment 2: Correlation Between Architecture Family and Text Fusion Benefits

There is a pattern between model architecture families and text fusion benefits: non-patch models (DLinear, Informer, etc.) benefited from both encoders, with an average MSE reduction of about 3%; patch-based models (PatchTST, TTM, etc.) showed a complex pattern: average performance decreased by 4.6% when using GPT-2, but increased by 3.0% when using BERT, suggesting that architecture and fusion strategy matching should be considered when designing multimodal models.

6

Section 06

Ablation Experiment 3: Placebo Test to Verify Semantic Value

To verify whether the text fusion benefits come from real semantics, a placebo test was conducted by replacing real weather news with random noise text. Results: Among the 7 models, only 3 showed real semantic benefits (real text was significantly better than noise), 2 had a placebo effect (difference <1%), and 2 performed worse with real text. This indicates that the effect of multimodal fusion depends on dataset characteristics and model architecture, and is not always effective.

7

Section 07

Technical Implementation and Research Limitations

Technical Implementation: Experiments were run on Google Colab Pro's A100 GPU. Due to memory constraints, GPT-2 (768-dimensional) was used instead of the DeepSeek encoder; the TTF/MMF modules used the TTF_RecAvg + MMF_GR_Add combination; data preprocessing used Kaggle raw data + Google Drive cached text embeddings; training hyperparameters were unified: history window 7 days, prediction range 7 days, step size 7 days, batch size 8, learning rate 1e-3, maximum 50 epochs, early stopping patience 10.

Limitations: The original paper's DeepSeek encoder was not used, and not all TTF/MMF combinations were tested; experiments were only conducted on the EPA-Air dataset, so the generalizability of the conclusions needs further verification.

8

Section 08

Practical Application Value and Open Source Contributions

Application Value: It has guiding significance for irregular time series scenarios such as IoT monitoring, healthcare, and financial transactions, helping practitioners select encoders and architectures according to the scenario and avoid blind fusion.

Open Source Contributions: The project code and experimental records are open-sourced, including Jupyter Notebooks, extended experiments, and complete logs, providing a foundation for community reproduction and further research.