Zing Forum

Reading

TS-LLM: A Time-Series Data Construction and Multi-Dimensional Evaluation System for Large Language Models

A complete pipeline system for time-series data to large language model (LLM) reasoning, including multi-source time-series dataset collection, description generation, time-series encoder training, and a three-dimensional evaluation framework based on BLEU, ROUGE, BERTScore, and LLM-as-judge.

时间序列大语言模型多模态学习时序编码器PatchTSTQwen数据描述生成评测框架
Published 2026-04-29 12:38Recent activity 2026-04-29 12:59Estimated read 11 min
TS-LLM: A Time-Series Data Construction and Multi-Dimensional Evaluation System for Large Language Models
1

Section 01

TS-LLM Project Overview (Main Guide)

TS-LLM is a complete pipeline system for time series data to large language model (LLM) reasoning. It includes multi-source time series dataset collection, description generation, time series encoder training, and a three-dimensional evaluation framework based on BLEU, ROUGE, BERTScore, and LLM-as-judge. The project aims to bridge time series data and LLMs, enabling intelligent analysis of time series with LLM's semantic understanding and reasoning capabilities.

2

Section 02

Research Background and Project Purpose

Time series data exists widely in finance, energy, traffic, meteorology, etc. Traditional methods are limited to numerical prediction and statistical modeling, lacking deep semantic capture. LLMs' strong text understanding and reasoning open new possibilities. TS-LLM builds a "time series → text description → LLM reasoning" pipeline, realizing automated description generation of multi-source time series, training time series encoders to map numerical sequences to LLM embedding space, and establishing a three-dimensional evaluation system to assess description quality.

3

Section 03

System Architecture and Core Components

Data Layer: Collects 7 public time series datasets from different fields (ETT-small, ElectricityECL, Exchange_Rate, Monash Time Series, NAB, Traffic, Weather). Each dataset has description generation scripts (analyzing statistical features like mean, variance, trend, seasonality, anomalies to generate structured natural language descriptions) and visualization scripts. Data screening uses multi-round iteration via run_analysis.py based on integrity, sequence length, change amplitude, etc.

Model Layer: Implements three time series encoders:

  • CNN 1D: Captures local patterns with sliding convolution kernels.
  • MLP: Maps to embedding space via fully connected layers, simple but competitive.
  • PatchTST: Splits time series into patches and uses Transformer to capture long-distance dependencies.

Integrates with Qwen series models (Qwen2.5-3B-Instruct, Qwen3-0.6B-Instruct-2512, Qwen3-4B-Instruct-2507) with training modes: frozen (only train encoder) and full (end-to-end fine-tuning). Supports multi-card parallel training for encoder comparison.

4

Section 04

Three-Dimensional Evaluation Framework

  1. Reference-based Evaluation: Uses BLEU-4 (n-gram precision), ROUGE-L (longest common subsequence), BERTScore-F1 (semantic similarity) to measure similarity with reference descriptions (generated by strong LLMs). Pros: simple, interpretable; Cons: depends on reference quality, may penalize diverse expressions.

  2. LLM-as-judge Evaluation: Uses LLMs to score on 1-5 Likert scale for:

    • Faithfulness: Whether the description accurately reflects original data features (no hallucinations).
    • Completeness: Whether key information is covered (no omissions).
  3. Downstream QA Evaluation: Designs multiple-choice QA tasks with three conditions:

    • meta_only: Use metadata (dataset name, time range) to answer.
    • caption: Use generated description to answer.
    • wrong_caption: Use wrong description as control.

Compares accuracy to quantify information gain of generated descriptions.

5

Section 05

Technical Implementation and Experiment Flow

Key Implementations:

  • Adaptive prompt construction: prompting.py adjusts prompts dynamically based on time series features (e.g., emphasize trend for trending data).
  • Multi-API concurrent evaluation: Supports multiple APIs (GLM, Silicon Flow, Doubao) for robust judgment, logs in API_Test/.
  • Iterative sample optimization: Multi-round filtering of low-quality samples after each training, saved in Sample/iteration_1~4/, final training uses 300k samples in run_300k_20260413.
  • Visualization: Each sample has visualizations (line charts, trend decomposition, anomaly annotation) for manual verification.

Experiment Steps:

  1. Environment setup: Activate virtual env, install dependencies (torch, transformers, etc.).
  2. Generate descriptions: Run scripts like generate_descriptions_ETT.py and viz_ETT_samples_v2.py.
  3. Filter samples: Use generate_filtered_samples.py to get qualified samples.
  4. Train models: Single-card (e.g., CNN encoder) or multi-card parallel training for three encoders.
  5. Generate inference files: infer_for_tscapeval.py (for evaluation) and infer_for_qa.py (for QA tasks).
  6. Run evaluation: Use ts-caption-eval with config files for reference-based, LLM-as-judge, and QA evaluations.
6

Section 06

Innovation Points and Application Scenarios

Innovation Points:

  • Multi-dimensional complementary evaluation: Combines traditional metrics, LLM-as-judge, and downstream QA to cover different aspects of description quality.
  • Diverse time series encoders: Explores CNN, MLP, PatchTST to adapt to different time series characteristics.
  • Iterative sample optimization: Data-centric approach to improve model performance by enhancing data quality.

Application Scenarios:

  • Intelligent report generation: Auto-generate natural language reports for non-technical users (e.g., power transformer status, exchange rate trends).
  • Multi-modal time series QA system: Answer questions about time series data (useful in finance, equipment monitoring, medical diagnosis).
  • Time series data augmentation: Generate synthetic time series with specific features (trends, seasonality).
7

Section 07

Limitations and Future Directions

Current Limitations:

  • Description granularity: Focuses on high-level statistical features, lacks fine-grained local pattern descriptions.
  • Cross-domain generalization: Evaluation is mainly on training datasets, cross-domain performance needs verification.
  • Long time series processing: Challenges in encoding and describing ultra-long time series (e.g., years of data).
  • Causal reasoning: Focuses on correlation, lacks causal relationship modeling.

Future Directions:

  • Introduce more encoders (TimesNet, N-BEATS).
  • Multi-scale description generation: Generate both high-level overview and fine-grained local analysis.
  • Interactive exploration: Support natural language queries for time series data.
  • Integrate causal discovery: Generate descriptions with causal explanations.
8

Section 08

Project Summary

TS-LLM is a systematic project combining time series data and LLMs, covering data collection, description generation, model training, and multi-dimensional evaluation. Its core values include multi-source dataset integration, encoder architecture comparison, three-dimensional evaluation system, and open-source reproducibility. It provides a valuable reference and experimental platform for researchers and engineers in time series analysis, multi-modal LLMs, and natural language generation applications.