正文

ChatSR：面向符号回归的多模态大语言模型

ChatSR是符号回归领域的首个多模态大语言模型，通过Set Transformer编码科学数据，生成描述数据规律的数学表达式先序遍历，支持BFGS优化常数项并计算拟合度R²。

symbolic regressionmultimodal LLMscientific discoverySet TransformerBFGS optimizationmathematical expressionQwen开源

发布时间 2026/06/12 14:46最近活动 2026/06/12 14:54预计阅读 9 分钟

章节 01

ChatSR: The First Multimodal LLM for Symbolic Regression (Core Overview)

ChatSR: A Scientific Multimodal Large Language Model for Discovering Formulas from Scientific Data

ChatSR is the first multimodal large language model in the symbolic regression field. It encodes scientific data using Set Transformer, generates preorder traversal of mathematical expressions describing data patterns, supports BFGS optimization for constant terms, and calculates the fitting degree R².

This project innovatively applies multimodal large language models to symbolic regression, leveraging their sequence generation capability to directly output structured representations of mathematical expressions, providing a new AI-driven tool for scientific discovery.

章节 02

Background: Symbolic Regression & The Need for ChatSR

Symbolic regression is a fundamental task in machine learning, aiming to discover mathematical expressions describing relationships between variables from data. Unlike black-box neural networks, it produces interpretable white-box models (human-understandable formulas). Traditional methods like genetic programming are effective but have high computational costs and struggle with complex data.

ChatSR fills this gap by introducing multimodal LLMs into symbolic regression, offering a new approach for scientific discovery.

章节 03

Technical Architecture of ChatSR

Set Transformer Data Encoding

ChatSR uses Set Transformer to encode numerical data points, which can handle unordered data sets—critical for scientific experimental data (often unordered).

Math Special Token System

Preorder Traversal Representation

章节 04

Key Functional Features of ChatSR

Key Functional Features

Multimodal Data Processing

ChatSR handles both numerical data (format [x1, x2, ..., y]) and text prompts, learning relationships between input features and target variables to generate expressions.

Distributed Training Support

It supports distributed training via HuggingFace Trainer with FSDP, enabling large-scale dataset training (configurable for single-machine multi-card or cluster).

Interactive Tools

interactive_inference_json_AAAA.py: Debugging script for loss, first token probability, and generation behavior.
interactive_inference_json_bfgs.py: Full inference flow (expression recovery, BFGS optimization, R² calculation).

BFGS Constant Optimization

Constants in generated expressions are optimized using BFGS to minimize MSE, with R² as the fitting metric.

章节 05

Data Preparation & Training Process

Data Preparation & Training

Data Format

Training samples are JSON-formatted, including fields like id, conversations (human-machine dialogue), data_points, expression_tokens, and standard_tokens.

Data Generation

The data_gen_vary.py script generates synthetic data with parameters like num_samples, max_length, max_vars, and max_dims, splitting into train/val/test sets.

Training Setup

Dependencies: Python3.10, PyTorch, Transformers, Accelerate, SciPy.
Token Extension: expend_tokens.py extends the vocabulary with math tokens (required before training).
Config: Disables word embedding sharing (tie_word_embeddings=False) for independent training of math token weights.
Distributed Training: Uses train_symbolic_regression_distributed_fixed.py with FSDP for multi-card training.

章节 06

Inference Flow & Evaluation Metrics

Inference & Evaluation

Inference Flow

Model generates preorder traversal sequence from input data.
Parse the sequence into an expression tree.
Recover the mathematical expression.
Optimize constants via BFGS.
Evaluate using MSE and R².

Metrics

MSE: Measures average squared difference between predictions and true values.
R²: Measures the proportion of variance explained by the model (0 to 1; closer to 1 means better fit).

章节 07

Application Scenarios of ChatSR

Application Scenarios

Physics Law Discovery

Automatically discover physical laws from experimental data (e.g., Kepler's laws from planetary orbit data).

Engineering Modeling

Build empirical formulas from observational data in fields like materials science, fluid dynamics, and chemical kinetics (where theoretical models are lacking).

Data-Driven Scientific Discovery

Handle high-dimensional data to find hidden variable relationships, leveraging multimodal input (numerical + text).

Education Assistance

Help students understand data-math expression relationships as a tool for science computing courses.

章节 08

Limitations & Future Directions of ChatSR

Limitations & Future Directions

Current Limitations

Limited set of supported mathematical operators.
Need to improve handling of complex nested expressions.
Constant optimization may get stuck in local optima.

Future Plans

Expand supported math functions.
Introduce constraint-guided expression generation.
Combine with neural networks for hybrid modeling.
Enhance multimodal input (e.g., images, time-series data).

ChatSR：面向符号回归的多模态大语言模型

ChatSR: The First Multimodal LLM for Symbolic Regression (Core Overview)

ChatSR: A Scientific Multimodal Large Language Model for Discovering Formulas from Scientific Data

Background: Symbolic Regression & The Need for ChatSR

Background: Symbolic Regression & The Need for ChatSR

Technical Architecture of ChatSR

Technical Architecture of ChatSR

Set Transformer Data Encoding

Math Special Token System

Preorder Traversal Representation

Key Functional Features of ChatSR

Key Functional Features

Multimodal Data Processing

Distributed Training Support

Interactive Tools

BFGS Constant Optimization

Data Preparation & Training Process

Data Preparation & Training

Data Format

Data Generation

Training Setup

Inference Flow & Evaluation Metrics

Inference & Evaluation

Inference Flow

Metrics

Application Scenarios of ChatSR

Application Scenarios

Physics Law Discovery

Engineering Modeling

Data-Driven Scientific Discovery

Education Assistance

Limitations & Future Directions of ChatSR

Limitations & Future Directions

Current Limitations

Future Plans

继续阅读

Nornir MCP Server：将大语言模型引入网络自动化的企业级桥梁

Bibliothèque Française LLM：为大型语言模型优化的法语公版文献索引系统

Splinter：一款无锁零拷贝的共享内存 KV 与向量存储库，让 LLM 推理告别 socket 与 memcpy 开销

libmlxforge：Apple Silicon 上的嵌入式 MLX LLM 推理引擎