# TIME-IMM Time Series Prediction Framework Reproduction: A Practice of Multimodal Fusion for Multi-source Asynchronous Data

> This project reproduces and extends the IMM-TSF benchmark framework from NeurIPS 2025, conducting time series prediction experiments on the EPA-Air dataset (a multi-source asynchronous type). It successfully reproduces 7 baseline models, completes three ablation experiments (text encoder selection, architecture family effect, and placebo test), and verifies the practical effect of multimodal fusion in irregular time series prediction.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-12T17:39:03.000Z
- 最近活动: 2026-05-12T17:52:37.735Z
- 热度: 163.8
- 关键词: TIME-IMM, 时间序列预测, 多模态融合, 不规则数据, EPA-Air, IMM-TSF, NeurIPS, RAG, 向量嵌入, 消融实验
- 页面链接: https://www.zingnex.cn/en/forum/thread/time-imm
- Canonical: https://www.zingnex.cn/forum/thread/time-imm
- Markdown 来源: floors_fallback

---

## [Introduction] Core Summary of Reproduction and Extension Research on the TIME-IMM Time Series Prediction Framework

This project reproduces and extends the IMM-TSF benchmark framework from NeurIPS 2025, conducting time series prediction experiments on the EPA-Air multi-source asynchronous dataset. It successfully reproduces 7 baseline models, completes three ablation experiments (text encoder selection, architecture family effect, and placebo test), verifies the effect of multimodal fusion in irregular time series prediction, and reveals the impact of key factors such as encoder selection and architecture matching on fusion effectiveness.

## Research Background and Characteristics of the EPA-Air Dataset

**Research Background and Challenges**: In time series prediction, real-world data often exhibits irregularity due to sensor failures, etc. Multi-source asynchrony (independent sampling from multiple data sources) is a typical challenge. The TIME-IMM dataset and IMM-TSF framework aim to address such issues.

**Characteristics of the EPA-Air Dataset**: Derived from the TIME-IMM benchmark, it contains 4 environmental indicators (AQI, ozone, PM2.5, temperature) from 8 U.S. counties, with approximately 49,552 observations distributed across 6,587 timestamps, an average interval of 1.02 hours, a feature observability entropy of 0.3777 (high sparsity), and 1,244 weather news summaries. The context window and prediction range are both 7 days, and the training/validation/test sets are divided by time into 60%/20%/20%.

## Reproduction Results of Baseline Models

Successfully reproduced 7 baseline models from the original paper. Multimodal fusion reduced the average MSE by approximately 2.1% (lower than the cross-dataset average of 6.7% in the original paper). Some models benefited significantly: DLinear (-2.6%), Informer (-8.3%), TimeMixer (-7.3%); models like TimesNet showed negative effects (e.g., TimesNet's MSE increased by 5.8%), possibly due to weak semantic alignment between text and sensor signals. Attempts to reproduce 4 models such as TimeLLM encountered technical obstacles: TimeLLM required patch_size adjustment, and CRU failed due to numerical instability of the ODE solver.

## Ablation Experiment 1: Impact of Text Encoder Selection

The original paper claims that the impact of encoder selection is negligible, but this experiment comparing GPT-2 and BERT found that encoder selection has a significant impact on the EPA-Air dataset, with a single model's performance fluctuating by up to 13.3 percentage points. For example, the TTM model: text fusion reduced performance by 6.0% when using GPT-2, but improved it by 7.3% when using BERT, challenging the original paper's conclusion and indicating that encoder selection is a key factor for specific datasets.

## Ablation Experiment 2: Correlation Between Architecture Family and Text Fusion Benefits

There is a pattern between model architecture families and text fusion benefits: non-patch models (DLinear, Informer, etc.) benefited from both encoders, with an average MSE reduction of about 3%; patch-based models (PatchTST, TTM, etc.) showed a complex pattern: average performance decreased by 4.6% when using GPT-2, but increased by 3.0% when using BERT, suggesting that architecture and fusion strategy matching should be considered when designing multimodal models.

## Ablation Experiment 3: Placebo Test to Verify Semantic Value

To verify whether the text fusion benefits come from real semantics, a placebo test was conducted by replacing real weather news with random noise text. Results: Among the 7 models, only 3 showed real semantic benefits (real text was significantly better than noise), 2 had a placebo effect (difference <1%), and 2 performed worse with real text. This indicates that the effect of multimodal fusion depends on dataset characteristics and model architecture, and is not always effective.

## Technical Implementation and Research Limitations

**Technical Implementation**: Experiments were run on Google Colab Pro's A100 GPU. Due to memory constraints, GPT-2 (768-dimensional) was used instead of the DeepSeek encoder; the TTF/MMF modules used the TTF_RecAvg + MMF_GR_Add combination; data preprocessing used Kaggle raw data + Google Drive cached text embeddings; training hyperparameters were unified: history window 7 days, prediction range 7 days, step size 7 days, batch size 8, learning rate 1e-3, maximum 50 epochs, early stopping patience 10.

**Limitations**: The original paper's DeepSeek encoder was not used, and not all TTF/MMF combinations were tested; experiments were only conducted on the EPA-Air dataset, so the generalizability of the conclusions needs further verification.

## Practical Application Value and Open Source Contributions

**Application Value**: It has guiding significance for irregular time series scenarios such as IoT monitoring, healthcare, and financial transactions, helping practitioners select encoders and architectures according to the scenario and avoid blind fusion.

**Open Source Contributions**: The project code and experimental records are open-sourced, including Jupyter Notebooks, extended experiments, and complete logs, providing a foundation for community reproduction and further research.
