# ChatSR: A Multimodal Large Language Model for Symbolic Regression

> ChatSR is the first multimodal large language model in the field of symbolic regression. It encodes scientific data using Set Transformer, generates preorder traversal of mathematical expressions describing data patterns, supports BFGS optimization for constant terms, and calculates the fitting degree R².

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-06-12T06:46:05.000Z
- 最近活动: 2026-06-12T06:54:43.691Z
- 热度: 159.9
- 关键词: symbolic regression, multimodal LLM, scientific discovery, Set Transformer, BFGS optimization, mathematical expression, Qwen, 开源
- 页面链接: https://www.zingnex.cn/en/forum/thread/chatsr
- Canonical: https://www.zingnex.cn/forum/thread/chatsr
- Markdown 来源: floors_fallback

---

## ChatSR: The First Multimodal LLM for Symbolic Regression (Core Overview)

## ChatSR: A Scientific Multimodal Large Language Model for Discovering Formulas from Scientific Data

ChatSR is the first multimodal large language model in the symbolic regression field. It encodes scientific data using Set Transformer, generates preorder traversal of mathematical expressions describing data patterns, supports BFGS optimization for constant terms, and calculates the fitting degree R². 

This project innovatively applies multimodal large language models to symbolic regression, leveraging their sequence generation capability to directly output structured representations of mathematical expressions, providing a new AI-driven tool for scientific discovery.

## Background: Symbolic Regression & The Need for ChatSR

## Background: Symbolic Regression & The Need for ChatSR

Symbolic regression is a fundamental task in machine learning, aiming to discover mathematical expressions describing relationships between variables from data. Unlike black-box neural networks, it produces interpretable white-box models (human-understandable formulas). Traditional methods like genetic programming are effective but have high computational costs and struggle with complex data. 

ChatSR fills this gap by introducing multimodal LLMs into symbolic regression, offering a new approach for scientific discovery.

## Technical Architecture of ChatSR

## Technical Architecture of ChatSR

### Set Transformer Data Encoding
ChatSR uses Set Transformer to encode numerical data points, which can handle unordered data sets—critical for scientific experimental data (often unordered). 

### Math Special Token System
It introduces special math tokens: operator tokens (e.g., `<|math_add|>`, `<|math_sin|>`), variable tokens (e.g., `<|math_x1|>`), and constant tokens (e.g., `<|math_C|>`) for structured expression generation. 

### Preorder Traversal Representation
Mathematical expressions are output as preorder traversal sequences (e.g., `x1 + x2` becomes `<|math_add|>,<|math_x1|>,<|math_x2|>`), which are unambiguous, easy to parse, and compatible with autoregressive generation of LLMs.

## Key Functional Features of ChatSR

## Key Functional Features

### Multimodal Data Processing
ChatSR handles both numerical data (format `[x1, x2, ..., y]`) and text prompts, learning relationships between input features and target variables to generate expressions. 

### Distributed Training Support
It supports distributed training via HuggingFace Trainer with FSDP, enabling large-scale dataset training (configurable for single-machine multi-card or cluster). 

### Interactive Tools
- `interactive_inference_json_AAAA.py`: Debugging script for loss, first token probability, and generation behavior.
- `interactive_inference_json_bfgs.py`: Full inference flow (expression recovery, BFGS optimization, R² calculation). 

### BFGS Constant Optimization
Constants in generated expressions are optimized using BFGS to minimize MSE, with R² as the fitting metric.

## Data Preparation & Training Process

## Data Preparation & Training

### Data Format
Training samples are JSON-formatted, including fields like `id`, `conversations` (human-machine dialogue), `data_points`, `expression_tokens`, and `standard_tokens`. 

### Data Generation
The `data_gen_vary.py` script generates synthetic data with parameters like `num_samples`, `max_length`, `max_vars`, and `max_dims`, splitting into train/val/test sets. 

### Training Setup
- Dependencies: Python3.10, PyTorch, Transformers, Accelerate, SciPy.
- Token Extension: `expend_tokens.py` extends the vocabulary with math tokens (required before training).
- Config: Disables word embedding sharing (`tie_word_embeddings=False`) for independent training of math token weights.
- Distributed Training: Uses `train_symbolic_regression_distributed_fixed.py` with FSDP for multi-card training.

## Inference Flow & Evaluation Metrics

## Inference & Evaluation

### Inference Flow
1. Model generates preorder traversal sequence from input data.
2. Parse the sequence into an expression tree.
3. Recover the mathematical expression.
4. Optimize constants via BFGS.
5. Evaluate using MSE and R². 

### Metrics
- **MSE**: Measures average squared difference between predictions and true values.
- **R²**: Measures the proportion of variance explained by the model (0 to 1; closer to 1 means better fit).

## Application Scenarios of ChatSR

## Application Scenarios

### Physics Law Discovery
Automatically discover physical laws from experimental data (e.g., Kepler's laws from planetary orbit data). 

### Engineering Modeling
Build empirical formulas from observational data in fields like materials science, fluid dynamics, and chemical kinetics (where theoretical models are lacking). 

### Data-Driven Scientific Discovery
Handle high-dimensional data to find hidden variable relationships, leveraging multimodal input (numerical + text). 

### Education Assistance
Help students understand data-math expression relationships as a tool for science computing courses.

## Limitations & Future Directions of ChatSR

## Limitations & Future Directions

### Current Limitations
- Limited set of supported mathematical operators.
- Need to improve handling of complex nested expressions.
- Constant optimization may get stuck in local optima. 

### Future Plans
- Expand supported math functions.
- Introduce constraint-guided expression generation.
- Combine with neural networks for hybrid modeling.
- Enhance multimodal input (e.g., images, time-series data).
