Zing Forum

Reading

ChatSR: A Multimodal Large Language Model for Symbolic Regression

ChatSR is the first multimodal large language model in the field of symbolic regression. It encodes scientific data using Set Transformer, generates preorder traversal of mathematical expressions describing data patterns, supports BFGS optimization for constant terms, and calculates the fitting degree R².

symbolic regressionmultimodal LLMscientific discoverySet TransformerBFGS optimizationmathematical expressionQwen开源
Published 2026-06-12 14:46Recent activity 2026-06-12 14:54Estimated read 9 min
ChatSR: A Multimodal Large Language Model for Symbolic Regression
1

Section 01

ChatSR: The First Multimodal LLM for Symbolic Regression (Core Overview)

ChatSR: A Scientific Multimodal Large Language Model for Discovering Formulas from Scientific Data

ChatSR is the first multimodal large language model in the symbolic regression field. It encodes scientific data using Set Transformer, generates preorder traversal of mathematical expressions describing data patterns, supports BFGS optimization for constant terms, and calculates the fitting degree R².

This project innovatively applies multimodal large language models to symbolic regression, leveraging their sequence generation capability to directly output structured representations of mathematical expressions, providing a new AI-driven tool for scientific discovery.

2

Section 02

Background: Symbolic Regression & The Need for ChatSR

Background: Symbolic Regression & The Need for ChatSR

Symbolic regression is a fundamental task in machine learning, aiming to discover mathematical expressions describing relationships between variables from data. Unlike black-box neural networks, it produces interpretable white-box models (human-understandable formulas). Traditional methods like genetic programming are effective but have high computational costs and struggle with complex data.

ChatSR fills this gap by introducing multimodal LLMs into symbolic regression, offering a new approach for scientific discovery.

3

Section 03

Technical Architecture of ChatSR

Technical Architecture of ChatSR

Set Transformer Data Encoding

ChatSR uses Set Transformer to encode numerical data points, which can handle unordered data sets—critical for scientific experimental data (often unordered).

Math Special Token System

It introduces special math tokens: operator tokens (e.g., <|math_add|>, <|math_sin|>), variable tokens (e.g., <|math_x1|>), and constant tokens (e.g., <|math_C|>) for structured expression generation.

Preorder Traversal Representation

Mathematical expressions are output as preorder traversal sequences (e.g., x1 + x2 becomes <|math_add|>,<|math_x1|>,<|math_x2|>), which are unambiguous, easy to parse, and compatible with autoregressive generation of LLMs.

4

Section 04

Key Functional Features of ChatSR

Key Functional Features

Multimodal Data Processing

ChatSR handles both numerical data (format [x1, x2, ..., y]) and text prompts, learning relationships between input features and target variables to generate expressions.

Distributed Training Support

It supports distributed training via HuggingFace Trainer with FSDP, enabling large-scale dataset training (configurable for single-machine multi-card or cluster).

Interactive Tools

  • interactive_inference_json_AAAA.py: Debugging script for loss, first token probability, and generation behavior.
  • interactive_inference_json_bfgs.py: Full inference flow (expression recovery, BFGS optimization, R² calculation).

BFGS Constant Optimization

Constants in generated expressions are optimized using BFGS to minimize MSE, with R² as the fitting metric.

5

Section 05

Data Preparation & Training Process

Data Preparation & Training

Data Format

Training samples are JSON-formatted, including fields like id, conversations (human-machine dialogue), data_points, expression_tokens, and standard_tokens.

Data Generation

The data_gen_vary.py script generates synthetic data with parameters like num_samples, max_length, max_vars, and max_dims, splitting into train/val/test sets.

Training Setup

  • Dependencies: Python3.10, PyTorch, Transformers, Accelerate, SciPy.
  • Token Extension: expend_tokens.py extends the vocabulary with math tokens (required before training).
  • Config: Disables word embedding sharing (tie_word_embeddings=False) for independent training of math token weights.
  • Distributed Training: Uses train_symbolic_regression_distributed_fixed.py with FSDP for multi-card training.
6

Section 06

Inference Flow & Evaluation Metrics

Inference & Evaluation

Inference Flow

  1. Model generates preorder traversal sequence from input data.
  2. Parse the sequence into an expression tree.
  3. Recover the mathematical expression.
  4. Optimize constants via BFGS.
  5. Evaluate using MSE and R².

Metrics

  • MSE: Measures average squared difference between predictions and true values.
  • : Measures the proportion of variance explained by the model (0 to 1; closer to 1 means better fit).
7

Section 07

Application Scenarios of ChatSR

Application Scenarios

Physics Law Discovery

Automatically discover physical laws from experimental data (e.g., Kepler's laws from planetary orbit data).

Engineering Modeling

Build empirical formulas from observational data in fields like materials science, fluid dynamics, and chemical kinetics (where theoretical models are lacking).

Data-Driven Scientific Discovery

Handle high-dimensional data to find hidden variable relationships, leveraging multimodal input (numerical + text).

Education Assistance

Help students understand data-math expression relationships as a tool for science computing courses.

8

Section 08

Limitations & Future Directions of ChatSR

Limitations & Future Directions

Current Limitations

  • Limited set of supported mathematical operators.
  • Need to improve handling of complex nested expressions.
  • Constant optimization may get stuck in local optima.

Future Plans

  • Expand supported math functions.
  • Introduce constraint-guided expression generation.
  • Combine with neural networks for hybrid modeling.
  • Enhance multimodal input (e.g., images, time-series data).