Reading

ChatSR: A Multimodal Large Language Model for Symbolic Regression

ChatSR is the first multimodal large language model in the field of symbolic regression. It encodes scientific data using Set Transformer, generates preorder traversal of mathematical expressions describing data patterns, supports BFGS optimization for constant terms, and calculates the fitting degree R².

symbolic regressionmultimodal LLMscientific discoverySet TransformerBFGS optimizationmathematical expressionQwen开源

Published 2026-06-12 14:46Recent activity 2026-06-12 14:54Estimated read 9 min

ChatSR: A Multimodal Large Language Model for Symbolic Regression

Section 01

ChatSR: The First Multimodal LLM for Symbolic Regression (Core Overview)

ChatSR: A Scientific Multimodal Large Language Model for Discovering Formulas from Scientific Data

ChatSR is the first multimodal large language model in the symbolic regression field. It encodes scientific data using Set Transformer, generates preorder traversal of mathematical expressions describing data patterns, supports BFGS optimization for constant terms, and calculates the fitting degree R².

This project innovatively applies multimodal large language models to symbolic regression, leveraging their sequence generation capability to directly output structured representations of mathematical expressions, providing a new AI-driven tool for scientific discovery.

Section 02

Background: Symbolic Regression & The Need for ChatSR

Symbolic regression is a fundamental task in machine learning, aiming to discover mathematical expressions describing relationships between variables from data. Unlike black-box neural networks, it produces interpretable white-box models (human-understandable formulas). Traditional methods like genetic programming are effective but have high computational costs and struggle with complex data.

ChatSR fills this gap by introducing multimodal LLMs into symbolic regression, offering a new approach for scientific discovery.

Section 03

Technical Architecture of ChatSR

Set Transformer Data Encoding

ChatSR uses Set Transformer to encode numerical data points, which can handle unordered data sets—critical for scientific experimental data (often unordered).

Math Special Token System

Preorder Traversal Representation

Section 04

Key Functional Features of ChatSR

Key Functional Features

Multimodal Data Processing

ChatSR handles both numerical data (format [x1, x2, ..., y]) and text prompts, learning relationships between input features and target variables to generate expressions.

Distributed Training Support

It supports distributed training via HuggingFace Trainer with FSDP, enabling large-scale dataset training (configurable for single-machine multi-card or cluster).

Interactive Tools

interactive_inference_json_AAAA.py: Debugging script for loss, first token probability, and generation behavior.
interactive_inference_json_bfgs.py: Full inference flow (expression recovery, BFGS optimization, R² calculation).

BFGS Constant Optimization

Constants in generated expressions are optimized using BFGS to minimize MSE, with R² as the fitting metric.

Section 05

Data Preparation & Training Process

Data Preparation & Training

Data Format

Training samples are JSON-formatted, including fields like id, conversations (human-machine dialogue), data_points, expression_tokens, and standard_tokens.

Data Generation

The data_gen_vary.py script generates synthetic data with parameters like num_samples, max_length, max_vars, and max_dims, splitting into train/val/test sets.

Training Setup

Dependencies: Python3.10, PyTorch, Transformers, Accelerate, SciPy.
Token Extension: expend_tokens.py extends the vocabulary with math tokens (required before training).
Config: Disables word embedding sharing (tie_word_embeddings=False) for independent training of math token weights.
Distributed Training: Uses train_symbolic_regression_distributed_fixed.py with FSDP for multi-card training.

Section 06

Inference Flow & Evaluation Metrics

Inference & Evaluation

Inference Flow

Model generates preorder traversal sequence from input data.
Parse the sequence into an expression tree.
Recover the mathematical expression.
Optimize constants via BFGS.
Evaluate using MSE and R².

Metrics

MSE: Measures average squared difference between predictions and true values.
R²: Measures the proportion of variance explained by the model (0 to 1; closer to 1 means better fit).

Section 07

Application Scenarios of ChatSR

Application Scenarios

Physics Law Discovery

Automatically discover physical laws from experimental data (e.g., Kepler's laws from planetary orbit data).

Engineering Modeling

Build empirical formulas from observational data in fields like materials science, fluid dynamics, and chemical kinetics (where theoretical models are lacking).

Data-Driven Scientific Discovery

Handle high-dimensional data to find hidden variable relationships, leveraging multimodal input (numerical + text).

Education Assistance

Help students understand data-math expression relationships as a tool for science computing courses.

Section 08

Limitations & Future Directions of ChatSR

Limitations & Future Directions

Current Limitations

Limited set of supported mathematical operators.
Need to improve handling of complex nested expressions.
Constant optimization may get stuck in local optima.

Future Plans

Expand supported math functions.
Introduce constraint-guided expression generation.
Combine with neural networks for hybrid modeling.
Enhance multimodal input (e.g., images, time-series data).

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

libmlxforge: An Embedded MLX LLM Inference Engine for Apple Silicon

libmlxforge is an embeddable MLX large language model (LLM) inference engine designed specifically for Apple Silicon. It provides a unified C ABI interface, supports calls from Node.js, Swift, and Rust, and features continuous batching, streaming output, JSON-constrained structured output, and embedding vector generation.

Recent activity 2026-06-09 17:23