Zing 论坛

正文

ChatSR:面向符号回归的多模态大语言模型

ChatSR是符号回归领域的首个多模态大语言模型,通过Set Transformer编码科学数据,生成描述数据规律的数学表达式先序遍历,支持BFGS优化常数项并计算拟合度R²。

symbolic regressionmultimodal LLMscientific discoverySet TransformerBFGS optimizationmathematical expressionQwen开源
发布时间 2026/06/12 14:46最近活动 2026/06/12 14:54预计阅读 9 分钟
ChatSR:面向符号回归的多模态大语言模型
1

章节 01

ChatSR: The First Multimodal LLM for Symbolic Regression (Core Overview)

ChatSR: A Scientific Multimodal Large Language Model for Discovering Formulas from Scientific Data

ChatSR is the first multimodal large language model in the symbolic regression field. It encodes scientific data using Set Transformer, generates preorder traversal of mathematical expressions describing data patterns, supports BFGS optimization for constant terms, and calculates the fitting degree R².

This project innovatively applies multimodal large language models to symbolic regression, leveraging their sequence generation capability to directly output structured representations of mathematical expressions, providing a new AI-driven tool for scientific discovery.

2

章节 02

Background: Symbolic Regression & The Need for ChatSR

Background: Symbolic Regression & The Need for ChatSR

Symbolic regression is a fundamental task in machine learning, aiming to discover mathematical expressions describing relationships between variables from data. Unlike black-box neural networks, it produces interpretable white-box models (human-understandable formulas). Traditional methods like genetic programming are effective but have high computational costs and struggle with complex data.

ChatSR fills this gap by introducing multimodal LLMs into symbolic regression, offering a new approach for scientific discovery.

3

章节 03

Technical Architecture of ChatSR

Technical Architecture of ChatSR

Set Transformer Data Encoding

ChatSR uses Set Transformer to encode numerical data points, which can handle unordered data sets—critical for scientific experimental data (often unordered).

Math Special Token System

It introduces special math tokens: operator tokens (e.g., <|math_add|>, <|math_sin|>), variable tokens (e.g., <|math_x1|>), and constant tokens (e.g., <|math_C|>) for structured expression generation.

Preorder Traversal Representation

Mathematical expressions are output as preorder traversal sequences (e.g., x1 + x2 becomes <|math_add|>,<|math_x1|>,<|math_x2|>), which are unambiguous, easy to parse, and compatible with autoregressive generation of LLMs.

4

章节 04

Key Functional Features of ChatSR

Key Functional Features

Multimodal Data Processing

ChatSR handles both numerical data (format [x1, x2, ..., y]) and text prompts, learning relationships between input features and target variables to generate expressions.

Distributed Training Support

It supports distributed training via HuggingFace Trainer with FSDP, enabling large-scale dataset training (configurable for single-machine multi-card or cluster).

Interactive Tools

  • interactive_inference_json_AAAA.py: Debugging script for loss, first token probability, and generation behavior.
  • interactive_inference_json_bfgs.py: Full inference flow (expression recovery, BFGS optimization, R² calculation).

BFGS Constant Optimization

Constants in generated expressions are optimized using BFGS to minimize MSE, with R² as the fitting metric.

5

章节 05

Data Preparation & Training Process

Data Preparation & Training

Data Format

Training samples are JSON-formatted, including fields like id, conversations (human-machine dialogue), data_points, expression_tokens, and standard_tokens.

Data Generation

The data_gen_vary.py script generates synthetic data with parameters like num_samples, max_length, max_vars, and max_dims, splitting into train/val/test sets.

Training Setup

  • Dependencies: Python3.10, PyTorch, Transformers, Accelerate, SciPy.
  • Token Extension: expend_tokens.py extends the vocabulary with math tokens (required before training).
  • Config: Disables word embedding sharing (tie_word_embeddings=False) for independent training of math token weights.
  • Distributed Training: Uses train_symbolic_regression_distributed_fixed.py with FSDP for multi-card training.
6

章节 06

Inference Flow & Evaluation Metrics

Inference & Evaluation

Inference Flow

  1. Model generates preorder traversal sequence from input data.
  2. Parse the sequence into an expression tree.
  3. Recover the mathematical expression.
  4. Optimize constants via BFGS.
  5. Evaluate using MSE and R².

Metrics

  • MSE: Measures average squared difference between predictions and true values.
  • : Measures the proportion of variance explained by the model (0 to 1; closer to 1 means better fit).
7

章节 07

Application Scenarios of ChatSR

Application Scenarios

Physics Law Discovery

Automatically discover physical laws from experimental data (e.g., Kepler's laws from planetary orbit data).

Engineering Modeling

Build empirical formulas from observational data in fields like materials science, fluid dynamics, and chemical kinetics (where theoretical models are lacking).

Data-Driven Scientific Discovery

Handle high-dimensional data to find hidden variable relationships, leveraging multimodal input (numerical + text).

Education Assistance

Help students understand data-math expression relationships as a tool for science computing courses.

8

章节 08

Limitations & Future Directions of ChatSR

Limitations & Future Directions

Current Limitations

  • Limited set of supported mathematical operators.
  • Need to improve handling of complex nested expressions.
  • Constant optimization may get stuck in local optima.

Future Plans

  • Expand supported math functions.
  • Introduce constraint-guided expression generation.
  • Combine with neural networks for hybrid modeling.
  • Enhance multimodal input (e.g., images, time-series data).