# TS-LLM: A Time-Series Data Construction and Multi-Dimensional Evaluation System for Large Language Models

> A complete pipeline system for time-series data to large language model (LLM) reasoning, including multi-source time-series dataset collection, description generation, time-series encoder training, and a three-dimensional evaluation framework based on BLEU, ROUGE, BERTScore, and LLM-as-judge.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-29T04:38:49.000Z
- 最近活动: 2026-04-29T04:59:12.695Z
- 热度: 159.7
- 关键词: 时间序列, 大语言模型, 多模态学习, 时序编码器, PatchTST, Qwen, 数据描述生成, 评测框架
- 页面链接: https://www.zingnex.cn/en/forum/thread/ts-llm
- Canonical: https://www.zingnex.cn/forum/thread/ts-llm
- Markdown 来源: floors_fallback

---

## TS-LLM Project Overview (Main Guide)

TS-LLM is a complete pipeline system for time series data to large language model (LLM) reasoning. It includes multi-source time series dataset collection, description generation, time series encoder training, and a three-dimensional evaluation framework based on BLEU, ROUGE, BERTScore, and LLM-as-judge. The project aims to bridge time series data and LLMs, enabling intelligent analysis of time series with LLM's semantic understanding and reasoning capabilities.

## Research Background and Project Purpose

Time series data exists widely in finance, energy, traffic, meteorology, etc. Traditional methods are limited to numerical prediction and statistical modeling, lacking deep semantic capture. LLMs' strong text understanding and reasoning open new possibilities. TS-LLM builds a "time series → text description → LLM reasoning" pipeline, realizing automated description generation of multi-source time series, training time series encoders to map numerical sequences to LLM embedding space, and establishing a three-dimensional evaluation system to assess description quality.

## System Architecture and Core Components

**Data Layer**: Collects 7 public time series datasets from different fields (ETT-small, ElectricityECL, Exchange_Rate, Monash Time Series, NAB, Traffic, Weather). Each dataset has description generation scripts (analyzing statistical features like mean, variance, trend, seasonality, anomalies to generate structured natural language descriptions) and visualization scripts. Data screening uses multi-round iteration via `run_analysis.py` based on integrity, sequence length, change amplitude, etc.

**Model Layer**: Implements three time series encoders:
- CNN 1D: Captures local patterns with sliding convolution kernels.
- MLP: Maps to embedding space via fully connected layers, simple but competitive.
- PatchTST: Splits time series into patches and uses Transformer to capture long-distance dependencies.

Integrates with Qwen series models (Qwen2.5-3B-Instruct, Qwen3-0.6B-Instruct-2512, Qwen3-4B-Instruct-2507) with training modes: frozen (only train encoder) and full (end-to-end fine-tuning). Supports multi-card parallel training for encoder comparison.

## Three-Dimensional Evaluation Framework

1. **Reference-based Evaluation**: Uses BLEU-4 (n-gram precision), ROUGE-L (longest common subsequence), BERTScore-F1 (semantic similarity) to measure similarity with reference descriptions (generated by strong LLMs). Pros: simple, interpretable; Cons: depends on reference quality, may penalize diverse expressions.

2. **LLM-as-judge Evaluation**: Uses LLMs to score on 1-5 Likert scale for:
   - Faithfulness: Whether the description accurately reflects original data features (no hallucinations).
   - Completeness: Whether key information is covered (no omissions).

3. **Downstream QA Evaluation**: Designs multiple-choice QA tasks with three conditions:
   - meta_only: Use metadata (dataset name, time range) to answer.
   - caption: Use generated description to answer.
   - wrong_caption: Use wrong description as control.

Compares accuracy to quantify information gain of generated descriptions.

## Technical Implementation and Experiment Flow

**Key Implementations**: 
- Adaptive prompt construction: `prompting.py` adjusts prompts dynamically based on time series features (e.g., emphasize trend for trending data).
- Multi-API concurrent evaluation: Supports multiple APIs (GLM, Silicon Flow, Doubao) for robust judgment, logs in `API_Test/`.
- Iterative sample optimization: Multi-round filtering of low-quality samples after each training, saved in `Sample/iteration_1~4/`, final training uses 300k samples in `run_300k_20260413`.
- Visualization: Each sample has visualizations (line charts, trend decomposition, anomaly annotation) for manual verification.

**Experiment Steps**: 
1. Environment setup: Activate virtual env, install dependencies (torch, transformers, etc.).
2. Generate descriptions: Run scripts like `generate_descriptions_ETT.py` and `viz_ETT_samples_v2.py`.
3. Filter samples: Use `generate_filtered_samples.py` to get qualified samples.
4. Train models: Single-card (e.g., CNN encoder) or multi-card parallel training for three encoders.
5. Generate inference files: `infer_for_tscapeval.py` (for evaluation) and `infer_for_qa.py` (for QA tasks).
6. Run evaluation: Use `ts-caption-eval` with config files for reference-based, LLM-as-judge, and QA evaluations.

## Innovation Points and Application Scenarios

**Innovation Points**: 
- Multi-dimensional complementary evaluation: Combines traditional metrics, LLM-as-judge, and downstream QA to cover different aspects of description quality.
- Diverse time series encoders: Explores CNN, MLP, PatchTST to adapt to different time series characteristics.
- Iterative sample optimization: Data-centric approach to improve model performance by enhancing data quality.

**Application Scenarios**: 
- Intelligent report generation: Auto-generate natural language reports for non-technical users (e.g., power transformer status, exchange rate trends).
- Multi-modal time series QA system: Answer questions about time series data (useful in finance, equipment monitoring, medical diagnosis).
- Time series data augmentation: Generate synthetic time series with specific features (trends, seasonality).

## Limitations and Future Directions

**Current Limitations**: 
- Description granularity: Focuses on high-level statistical features, lacks fine-grained local pattern descriptions.
- Cross-domain generalization: Evaluation is mainly on training datasets, cross-domain performance needs verification.
- Long time series processing: Challenges in encoding and describing ultra-long time series (e.g., years of data).
- Causal reasoning: Focuses on correlation, lacks causal relationship modeling.

**Future Directions**: 
- Introduce more encoders (TimesNet, N-BEATS).
- Multi-scale description generation: Generate both high-level overview and fine-grained local analysis.
- Interactive exploration: Support natural language queries for time series data.
- Integrate causal discovery: Generate descriptions with causal explanations.

## Project Summary

TS-LLM is a systematic project combining time series data and LLMs, covering data collection, description generation, model training, and multi-dimensional evaluation. Its core values include multi-source dataset integration, encoder architecture comparison, three-dimensional evaluation system, and open-source reproducibility. It provides a valuable reference and experimental platform for researchers and engineers in time series analysis, multi-modal LLMs, and natural language generation applications.
