正文

TS-LLM：面向大语言模型的时序数据构建与多维度评测系统

一个完整的时序数据到大语言模型推理的pipeline系统，包含多源时序数据集收集、描述生成、时序编码器训练，以及基于BLEU、ROUGE、BERTScore和LLM-as-judge的三维度评测框架。

时间序列大语言模型多模态学习时序编码器PatchTSTQwen数据描述生成评测框架

发布时间 2026/04/29 12:38最近活动 2026/04/29 12:59预计阅读 11 分钟

章节 01

TS-LLM Project Overview (Main Guide)

TS-LLM is a complete pipeline system for time series data to large language model (LLM) reasoning. It includes multi-source time series dataset collection, description generation, time series encoder training, and a three-dimensional evaluation framework based on BLEU, ROUGE, BERTScore, and LLM-as-judge. The project aims to bridge time series data and LLMs, enabling intelligent analysis of time series with LLM's semantic understanding and reasoning capabilities.

章节 02

Research Background and Project Purpose

Time series data exists widely in finance, energy, traffic, meteorology, etc. Traditional methods are limited to numerical prediction and statistical modeling, lacking deep semantic capture. LLMs' strong text understanding and reasoning open new possibilities. TS-LLM builds a "time series → text description → LLM reasoning" pipeline, realizing automated description generation of multi-source time series, training time series encoders to map numerical sequences to LLM embedding space, and establishing a three-dimensional evaluation system to assess description quality.

章节 03

System Architecture and Core Components

Data Layer: Collects 7 public time series datasets from different fields (ETT-small, ElectricityECL, Exchange_Rate, Monash Time Series, NAB, Traffic, Weather). Each dataset has description generation scripts (analyzing statistical features like mean, variance, trend, seasonality, anomalies to generate structured natural language descriptions) and visualization scripts. Data screening uses multi-round iteration via run_analysis.py based on integrity, sequence length, change amplitude, etc.

Model Layer: Implements three time series encoders:

CNN 1D: Captures local patterns with sliding convolution kernels.
MLP: Maps to embedding space via fully connected layers, simple but competitive.
PatchTST: Splits time series into patches and uses Transformer to capture long-distance dependencies.

Integrates with Qwen series models (Qwen2.5-3B-Instruct, Qwen3-0.6B-Instruct-2512, Qwen3-4B-Instruct-2507) with training modes: frozen (only train encoder) and full (end-to-end fine-tuning). Supports multi-card parallel training for encoder comparison.

章节 04

Three-Dimensional Evaluation Framework

Reference-based Evaluation: Uses BLEU-4 (n-gram precision), ROUGE-L (longest common subsequence), BERTScore-F1 (semantic similarity) to measure similarity with reference descriptions (generated by strong LLMs). Pros: simple, interpretable; Cons: depends on reference quality, may penalize diverse expressions.
LLM-as-judge Evaluation: Uses LLMs to score on 1-5 Likert scale for:
- Faithfulness: Whether the description accurately reflects original data features (no hallucinations).
- Completeness: Whether key information is covered (no omissions).
Downstream QA Evaluation: Designs multiple-choice QA tasks with three conditions:
- meta_only: Use metadata (dataset name, time range) to answer.
- caption: Use generated description to answer.
- wrong_caption: Use wrong description as control.

Compares accuracy to quantify information gain of generated descriptions.

章节 05

Technical Implementation and Experiment Flow

Key Implementations:

Adaptive prompt construction: prompting.py adjusts prompts dynamically based on time series features (e.g., emphasize trend for trending data).
Multi-API concurrent evaluation: Supports multiple APIs (GLM, Silicon Flow, Doubao) for robust judgment, logs in API_Test/.
Iterative sample optimization: Multi-round filtering of low-quality samples after each training, saved in Sample/iteration_1~4/, final training uses 300k samples in run_300k_20260413.
Visualization: Each sample has visualizations (line charts, trend decomposition, anomaly标注) for manual verification.

Experiment Steps:

Environment setup: Activate virtual env, install dependencies (torch, transformers, etc.).
Generate descriptions: Run scripts like generate_descriptions_ETT.py and viz_ETT_samples_v2.py.
Filter samples: Use generate_filtered_samples.py to get qualified samples.
Train models: Single-card (e.g., CNN encoder) or multi-card parallel training for three encoders.
Generate inference files: infer_for_tscapeval.py (for evaluation) and infer_for_qa.py (for QA tasks).
Run evaluation: Use ts-caption-eval with config files for reference-based, LLM-as-judge, and QA evaluations.

章节 06

Innovation Points and Application Scenarios

Innovation Points:

Multi-dimensional complementary evaluation: Combines traditional metrics, LLM-as-judge, and downstream QA to cover different aspects of description quality.
Diverse time series encoders: Explores CNN, MLP, PatchTST to adapt to different time series characteristics.
Iterative sample optimization: Data-centric approach to improve model performance by enhancing data quality.

Application Scenarios:

Intelligent report generation: Auto-generate natural language reports for non-technical users (e.g., power transformer status, exchange rate trends).
Multi-modal time series QA system: Answer questions about time series data (useful in finance, equipment monitoring, medical diagnosis).
Time series data augmentation: Generate synthetic time series with specific features (trends, seasonality).

章节 07

Limitations and Future Directions

Current Limitations:

Description granularity: Focuses on high-level statistical features, lacks fine-grained local pattern descriptions.
Cross-domain generalization: Evaluation is mainly on training datasets, cross-domain performance needs verification.
Long time series processing: Challenges in encoding and describing ultra-long time series (e.g., years of data).
Causal reasoning: Focuses on correlation, lacks causal relationship modeling.

Future Directions:

Introduce more encoders (TimesNet, N-BEATS).
Multi-scale description generation: Generate both high-level overview and fine-grained local analysis.
Interactive exploration: Support natural language queries for time series data.
Integrate causal discovery: Generate descriptions with causal explanations.

章节 08

Project Summary

TS-LLM is a systematic project combining time series data and LLMs, covering data collection, description generation, model training, and multi-dimensional evaluation. Its core values include multi-source dataset integration, encoder architecture comparison, three-dimensional evaluation system, and open-source reproducibility. It provides a valuable reference and experimental platform for researchers and engineers in time series analysis, multi-modal LLMs, and natural language generation applications.