# S1-VL: A Scientific Multimodal Reasoning Model with "Thinking-with-Images" Capability

> S1-VL is a multimodal reasoning model for the scientific domain, supporting two paradigms: structured scientific reasoning and "Thinking-with-Images". The latter enables the model to generate and execute image processing code during reasoning, making it particularly suitable for high-resolution scientific chart interpretation, microscopic image understanding, and geometry-assisted reasoning.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-23T08:23:25.000Z
- 最近活动: 2026-04-24T04:27:56.932Z
- 热度: 92.9
- 关键词: 多模态推理, 科学AI, 图像思维, 视觉推理, 代码生成, 科学图表, AI for Science
- 页面链接: https://www.zingnex.cn/en/forum/thread/s1-vl
- Canonical: https://www.zingnex.cn/forum/thread/s1-vl
- Markdown 来源: floors_fallback

---

## Introduction / Main Post: S1-VL: A Scientific Multimodal Reasoning Model with "Thinking-with-Images" Capability

S1-VL is a multimodal reasoning model for the scientific domain, supporting two paradigms: structured scientific reasoning and "Thinking-with-Images". The latter enables the model to generate and execute image processing code during reasoning, making it particularly suitable for high-resolution scientific chart interpretation, microscopic image understanding, and geometry-assisted reasoning.

## Background

# S1-VL: A Scientific Multimodal Reasoning Model with "Thinking-with-Images" Capability

## Unique Challenges in Scientific Reasoning

Large language models have made remarkable progress in text reasoning tasks, from mathematical proof to code generation, from logical reasoning to creative writing. However, reasoning in the scientific domain often involves another critical dimension: visual information.

Imagine a physicist analyzing a complex particle collision trajectory diagram, or a biologist observing cell division under a microscope. Their reasoning process is not purely symbolic manipulation, but a constant cycle of "looking at images", "annotating", "measuring", and "comparing". This "Thinking-with-Images" ability is at the core of scientific discovery, yet it has long been overlooked by existing AI systems.

S1-VL was born to fill this gap. It is a multimodal reasoning model for the scientific domain, natively supporting two complementary reasoning paradigms: traditional structured scientific reasoning and the innovative "Thinking-with-Images" mode.

## Dual-Paradigm Architecture: Scientific Reasoning and Thinking-with-Images

The design philosophy of S1-VL is: different scientific problems require different reasoning methods. Some problems are suitable for pure symbolic reasoning, while others need deep interaction with visual information.

### Paradigm 1: Scientific Reasoning

This is the traditional Chain-of-Thought method, where the model solves problems step-by-step through structured text reasoning. It is suitable for:

- Formula derivation and mathematical proof
- Concept analysis based on text descriptions
- Logically rigorous hypothesis testing

In this mode, S1-VL acts like a rigorous scientist, recording each step of reasoning in text to ensure the integrity of the logical chain.

### Paradigm 2: Thinking-with-Images

This is the core innovation of S1-VL. In this mode, the model does not just "look at" images; it can actively "manipulate" images—generate and execute image processing code, obtain intermediate visual results, and then continue reasoning based on these results. The entire process is multi-round iterative.

The specific process is as follows:

1. **Initial Observation**: The model receives input images and questions
2. **Code Generation**: The model generates Python image processing code (e.g., cropping, scaling, filtering, edge detection)
3. **Sandbox Execution**: The code is executed in an isolated sandbox environment, generating processed images or extracted values
4. **Result Observation**: The model "sees" the execution results
5. **Continue Reasoning**: Based on new visual information, the model generates the next step of code or draws preliminary conclusions
6. **Iterative Cycle**: Repeat steps 2-5 until the problem is solved

This cycle of "thinking-operation-observation-rethinking" simulates the real working style of human scientists in front of a microscope or in a laboratory.

## Application Scenarios: Thinking-with-Images Shines

The Thinking-with-Images mode shows unique advantages in the following scenarios:

### High-Resolution Scientific Chart Interpretation

Charts in modern scientific papers often contain massive amounts of information. A heatmap of genomics data may have thousands of data points, and a spectral chart in astrophysics may span multiple orders of magnitude in dynamic range.

Traditional multimodal models usually scale images uniformly to a fixed resolution (e.g., 224x224 or 336x336), which loses key details. The Thinking-with-Images mode of S1-VL can:

- First generate code to split the chart into blocks and check each area in detail
- Zoom in on regions of interest to observe detailed features
- Extract specific values for quantitative analysis
- Compare patterns in different regions to find anomalies or regularities

### Microscopic Image Understanding

The world under the microscope is full of fine structures: the morphology of organelles, the localization of proteins, the texture of tissues. Understanding these images requires:

- Adjusting contrast and brightness to highlight specific structures
- Applying edge detection or morphological operations to separate regions of interest
- Measuring geometric parameters (size, shape, distribution)
- Comparing with standard atlases for identification

S1-VL can perform these operations autonomously, just like an experienced microscope operator.

### Geometry-Assisted Reasoning

Geometric problems naturally require visual reasoning. Proving a geometric theorem often requires:

- Adding auxiliary lines to the diagram
- Measuring angles and lengths
- Verifying congruence or similarity relationships
- Validating conjectures through construction

Thinking-with-Images allows S1-VL to "solve" these problems hands-on, rather than relying solely on pre-trained geometric knowledge.

## Six-Dimensional Quality Filtering Framework

A key challenge in training S1-VL is data quality. Scientific multimodal data is extremely diverse, from mathematical formulas to biological specimens, from astronomical images to chemical structures. How to ensure the quality of training data?

The research team developed a six-dimensional quality filtering framework to evaluate each sample from the following six dimensions:

### Dimension 1: Visual Information Gain

Evaluate whether image operations truly bring new visual information. If the model performs a series of operations but the results are almost the same as the original image, the visual information gain of this sample is very low.

### Dimension 2: Reasoning Coherence

Check whether the logical relationship between reasoning steps is reasonable. Each step should be based on the results of the previous step and lead to the final answer.

### Dimension 3: Code Correctness

Verify whether the generated image processing code can be executed correctly and produce the expected output.

### Dimension 4: Scientific Accuracy

Ensure that the reasoning content and conclusions conform to scientific facts. This is particularly important for scientific domain models.

### Dimension 5: Multimodal Alignment

Check whether text reasoning is consistent with image content. The model should not "hallucinate" features that do not exist in the image.

### Dimension 6: Educational Value

Evaluate whether the sample demonstrates valuable reasoning patterns and helps the model learn general scientific reasoning strategies.

## Adaptive Data Routing Strategy

Based on the six-dimensional evaluation, the research team further proposed an adaptive data routing strategy. The core insight is: not all samples are suitable for the Thinking-with-Images mode.

For samples with low visual information gain (e.g., the image is only decorative, and the problem can be solved through pure text reasoning), the system converts them into data for the pure scientific reasoning mode. This allows the model to learn to "judge" when image operations are needed and when direct text reasoning is sufficient.

This adaptive routing brings two benefits:

1. **Efficiency Improvement**: Avoid performing expensive image operations when unnecessary
2. **Capability Differentiation**: Allow the model to clearly distinguish between the two reasoning paradigms and avoid confusion

## Four-Stage Progressive Training Process

The training of S1-VL is a carefully designed four-stage process:

### Stage 1: Scientific Multimodal Supervised Fine-tuning (SFT)

First, basic training is conducted on a wide range of scientific multimodal data. Data sources cover six disciplines:
- Mathematics: Geometry, algebra, calculus problems
- Physics: Mechanics, electromagnetism, optics problems
- Chemistry: Molecular structure, reaction mechanisms, experimental analysis
- Astronomy: Star map recognition, spectral analysis, astrometry
- Geography: Map interpretation, geological profiles, meteorological charts
- Biology: Cell images, anatomical atlases, ecological data

The goal of this stage is to establish basic multimodal understanding capabilities.

### Stage 2: Thinking-with-Images Cold-Start SFT

On top of basic capabilities, the Thinking-with-Images mode is specifically trained. The model learns:
- When to trigger Thinking-with-Images (vs. pure text reasoning)
- How to write effective image processing code
- How to interpret code execution results
- How to plan multi-round image operation sequences

### Stage 3: Reinforcement Learning Based on SAPO (First Round)

SAPO (Self-Adaptive Policy Optimization) is a reinforcement learning method for reasoning tasks. In this stage, the model improves its reasoning strategy through trial and error. Reward signals are based on:
- Correctness of the final answer
- Efficiency of the reasoning process (number of steps, number of code executions)
- Quality of intermediate results

### Stage 4: Reinforcement Learning Based on SAPO (Second Round)

Further reinforcement learning using more complex samples and stricter evaluation criteria. This stage aims to refine and consolidate the learned capabilities, improving the model's robustness and generalization.

## Benchmark Tests and Performance

S1-VL-32B (built on Qwen3-VL-32B-Thinking) was evaluated on 13 benchmark tests, with impressive results:

### Thinking-with-Images Benchmarks

On five specialized Thinking-with-Images benchmarks, S1-VL-32B achieved state-of-the-art performance:

- **HRBench-4K/8K**: High-resolution image understanding benchmark
- **MME-RealWorld-CN/Lite**: Real-world multimodal evaluation
- **V***: Visual reasoning benchmark

These benchmarks test the model's ability to process high-resolution images, perform complex visual reasoning, and interact with real-world images. S1-VL's overall leadership proves the effectiveness of the Thinking-with-Images paradigm.

### Scientific Reasoning Benchmarks

On scientific reasoning benchmarks (e.g., Physics, VRSBench), S1-VL also outperformed comparison systems. This indicates that the combination of the two paradigms produces a synergistic effect—Thinking-with-Images not only does not weaken pure text reasoning ability but also enhances overall performance through visual verification.

## Technical Implementation Details

### Base Model Selection

S1-VL-32B is built on Qwen3-VL-32B-Thinking. The reasons for choosing this base include:
- Strong visual understanding ability
- Excellent text reasoning foundation
- Support for long contexts (critical for multi-round Thinking-with-Images)
- Open weights and good scalability

### Sandbox Environment Design

Code execution for Thinking-with-Images requires a safely isolated sandbox environment. Key design considerations:

- **Security**: Restrict executable Python operations to prevent malicious code
- **Efficiency**: Fast sandbox startup and destruction to support high-throughput training
- **Rich Functionality**: Pre-install common image processing libraries (PIL, OpenCV, NumPy, Matplotlib, etc.)
- **Resource Limitation**: Control CPU/memory usage to prevent resource exhaustion

### Multi-Round Interaction Protocol

The interaction between the model and the sandbox requires a clear protocol:

1. The model generates a special token sequence containing code
2. The system extracts the code and sends it to the sandbox for execution
3. The sandbox returns execution results (output images or values)
4. The results are encoded and inserted into the model's context
5. The model continues generation based on the updated context

This protocol needs to be clearly annotated in the training data to allow the model to learn the correct interaction mode.

## Limitations and Future Directions

### Current Limitations

- **Computational Cost**: The Thinking-with-Images mode requires multiple code executions, so the reasoning cost is higher than pure text models
- **Sandbox Dependence**: Need to maintain complex sandbox infrastructure
- **Error Accumulation**: Early errors in multi-round interactions may affect subsequent reasoning

### Future Directions

- **Smarter Routing**: Develop more refined heuristic methods to more accurately judge when Thinking-with-Images is needed
- **Tool Expansion**: Integrate more scientific tools (e.g., symbolic computation, data analysis libraries) in addition to image processing
- **Real-Time Interaction**: Support user intervention to collaborate with the model to complete complex scientific reasoning
- **Domain Specialization**: Develop specialized versions for specific scientific fields (e.g., medical imaging, materials science)

## Broader Impact: A New Paradigm for AI for Science

S1-VL represents an important development direction for "AI for Science": from passive information processing to active experimental operation. Traditional AI systems can only "read" scientific literature, while S1-VL shows that AI can also "do" scientific experiments—at least in the digital domain.

The ability of this "digital experimenter" has far-reaching significance:

- **Accelerate Scientific Discovery**: Automatically perform routine data analysis tasks, allowing scientists to focus on innovation
- **Lower Thresholds**: Enable non-experts to conduct complex scientific image analysis
- **Educational Innovation**: Serve as an interactive learning tool to demonstrate the complete process of scientific reasoning
- **Reproducibility**: Automatically record all operation steps to improve the reproducibility of scientific research

## Conclusion: When AI Learns to "Think" with Hands

The "Thinking-with-Images" ability of S1-VL is essentially a form of "embodied cognition"—AI is no longer a passive information processor, but an agent that can assist thinking by manipulating the environment (here, digital images). This echoes the "extended mind" theory in human cognitive science: thinking does not only occur in the brain (or neural network) but also in interaction with the environment.

From a broader perspective, S1-VL is an important step toward a "general scientific agent". Future AI scientists may not only read papers and write code but also operate microscopes, adjust experimental parameters, and analyze observation data—becoming true partners of human scientists.

The future of science may be a future where humans and AI "think by looking at images" together.

## Supplementary View 1

S1-VL: A Scientific Multimodal Reasoning Model with "Thinking-with-Images" Capability

Unique Challenges in Scientific Reasoning

Large language models have made remarkable progress in text reasoning tasks, from mathematical proof to code generation, from logical reasoning to creative writing. However, reasoning in the scientific domain often involves another critical dimension: visual information.

Imagine a physicist analyzing a complex particle collision trajectory diagram, or a biologist observing cell division under a microscope. Their reasoning process is not purely symbolic manipulation, but a constant cycle of "looking at images", "annotating", "measuring", and "comparing". This "Thinking-with-Images" ability is at the core of scientific discovery, yet it has long been overlooked by existing AI systems.

S1-VL was born to fill this gap. It is a multimodal reasoning model for the scientific domain, natively supporting two complementary reasoning paradigms: traditional structured scientific reasoning and the innovative "Thinking-with-Images" mode.

Dual-Paradigm Architecture: Scientific Reasoning and Thinking-with-Images

The design philosophy of S1-VL is: different scientific problems require different reasoning methods. Some problems are suitable for pure symbolic reasoning, while others need deep interaction with visual information.

Paradigm 1: Scientific Reasoning

This is the traditional Chain-of-Thought method, where the model solves problems step-by-step through structured text reasoning. It is suitable for:

- Formula derivation and mathematical proof
- Concept analysis based on text descriptions
- Logically rigorous hypothesis testing

In this mode, S1-VL acts like a rigorous scientist, recording each step of reasoning in text to ensure the integrity of the logical chain.

Paradigm 2: Thinking-with-Images

This is the core innovation of S1-VL. In this mode, the model does not just "look at" images; it can actively "manipulate" images—generate and execute image processing code, obtain intermediate visual results, and then continue reasoning based on these results. The entire process is multi-round iterative.

The specific process is as follows:

1. **Initial Observation**: The model receives input images and questions
2. **Code Generation**: The model generates Python image processing code (e.g., cropping, scaling, filtering, edge detection)
3. **Sandbox Execution**: The code is executed in an isolated sandbox environment, generating processed images or extracted values
4. **Result Observation**: The model "sees" the execution results
5. **Continue Reasoning**: Based on new visual information, the model generates the next step of code or draws preliminary conclusions
6. **Iterative Cycle**: Repeat steps 2-5 until the problem is solved

This cycle of "thinking-operation-observation-rethinking" simulates the real working style of human scientists in front of a microscope or in a laboratory.

Application Scenarios: Thinking-with-Images Shines

The Thinking-with-Images mode shows unique advantages in the following scenarios:

High-Resolution Scientific Chart Interpretation

Charts in modern scientific papers often contain massive amounts of information. A heatmap of genomics data may have thousands of data points, and a spectral chart in astrophysics may span multiple orders of magnitude in dynamic range.

Traditional multimodal models usually scale images uniformly to a fixed resolution (e.g., 224x224 or 336x336), which loses key details. The Thinking-with-Images mode of S1-VL can:

- First generate code to split the chart into blocks and check each area in detail
- Zoom in on regions of interest to observe detailed features
- Extract specific values for quantitative analysis
- Compare patterns in different regions to find anomalies or regularities

Microscopic Image Understanding

The world under the microscope is full of fine structures: the morphology of organelles, the localization of proteins, the texture of tissues. Understanding these images requires:

- Adjusting contrast and brightness to highlight specific structures
- Applying edge detection or morphological operations to separate regions of interest
- Measuring geometric parameters (size, shape, distribution)
- Comparing with standard atlases for identification

S1-VL can perform these operations autonomously, just like an experienced microscope operator.

Geometry-Assisted Reasoning

Geometric problems naturally require visual reasoning. Proving a geometric theorem often requires:

- Adding auxiliary lines to the diagram
- Measuring angles and lengths
- Verifying congruence or similarity relationships
- Validating conjectures through construction

Thinking-with-Images allows S1-VL to "solve" these problems hands-on, rather than relying solely on pre-trained geometric knowledge.

Six-Dimensional Quality Filtering Framework

A key challenge in training S1-VL is data quality. Scientific multimodal data is extremely diverse, from mathematical formulas to biological specimens, from astronomical images to chemical structures. How to ensure the quality of training data?

The research team developed a six-dimensional quality filtering framework to evaluate each sample from the following six dimensions:

Dimension 1: Visual Information Gain

Evaluate whether image operations truly bring new visual information. If the model performs a series of operations but the results are almost the same as the original image, the visual information gain of this sample is very low.

Dimension 2: Reasoning Coherence

Check whether the logical relationship between reasoning steps is reasonable. Each step should be based on the results of the previous step and lead to the final answer.

Dimension 3: Code Correctness

Verify whether the generated image processing code can be executed correctly and produce the expected output.

Dimension 4: Scientific Accuracy

Ensure that the reasoning content and conclusions conform to scientific facts. This is particularly important for scientific domain models.

Dimension 5: Multimodal Alignment

Check whether text reasoning is consistent with image content. The model should not "hallucinate" features that do not exist in the image.

Dimension 6: Educational Value

Evaluate whether the sample demonstrates valuable reasoning patterns and helps the model learn general scientific reasoning strategies.

Adaptive Data Routing Strategy

Based on the six-dimensional evaluation, the research team further proposed an adaptive data routing strategy. The core insight is: not all samples are suitable for the Thinking-with-Images mode.

For samples with low visual information gain (e.g., the image is only decorative, and the problem can be solved through pure text reasoning), the system converts them into data for the pure scientific reasoning mode. This allows the model to learn to "judge" when image operations are needed and when direct text reasoning is sufficient.

This adaptive routing brings two benefits:

1. **Efficiency Improvement**: Avoid performing expensive image operations when unnecessary
2. **Capability Differentiation**: Allow the model to clearly distinguish between the two reasoning paradigms and avoid confusion

Four-Stage Progressive Training Process

The training of S1-VL is a carefully designed four-stage process:

Stage 1: Scientific Multimodal Supervised Fine-tuning (SFT)

First, basic training is conducted on a wide range of scientific multimodal data. Data sources cover six disciplines:
- Mathematics: Geometry, algebra, calculus problems
- Physics: Mechanics, electromagnetism, optics problems
- Chemistry: Molecular structure, reaction mechanisms, experimental analysis
- Astronomy: Star map recognition, spectral analysis, astrometry
- Geography: Map interpretation, geological profiles, meteorological charts
- Biology: Cell images, anatomical atlases, ecological data

The goal of this stage is to establish basic multimodal understanding capabilities.

Stage 2: Thinking-with-Images Cold-Start SFT

On top of basic capabilities, the Thinking-with-Images mode is specifically trained. The model learns:
- When to trigger Thinking-with-Images (vs. pure text reasoning)
- How to write effective image processing code
- How to interpret code execution results
- How to plan multi-round image operation sequences

Stage 3: Reinforcement Learning Based on SAPO (First Round)

SAPO (Self-Adaptive Policy Optimization) is a reinforcement learning method for reasoning tasks. In this stage, the model improves its reasoning strategy through trial and error. Reward signals are based on:
- Correctness of the final answer
- Efficiency of the reasoning process (number of steps, number of code executions)
- Quality of intermediate results

Stage 4: Reinforcement Learning Based on SAPO (Second Round)

Further reinforcement learning using more complex samples and stricter evaluation criteria. This stage aims to refine and consolidate the learned capabilities, improving the model's robustness and generalization.

Benchmark Tests and Performance

S1-VL-32B (built on Qwen3-VL-32B-Thinking) was evaluated on 13 benchmark tests, with impressive results:

Thinking-with-Images Benchmarks

On five specialized Thinking-with-Images benchmarks, S1-VL-32B achieved state-of-the-art performance:

- **HRBench-4K/8K**: High-resolution image understanding benchmark
- **MME-RealWorld-CN/Lite**: Real-world multimodal evaluation
- **V***: Visual reasoning benchmark

These benchmarks test the model's ability to process high-resolution images, perform complex visual reasoning, and interact with real-world images. S1-VL's overall leadership proves the effectiveness of the Thinking-with-Images paradigm.

Scientific Reasoning Benchmarks

On scientific reasoning benchmarks (e.g., Physics, VRSBench), S1-VL also outperformed comparison systems. This indicates that the combination of the two paradigms produces a synergistic effect—Thinking-with-Images not only does not weaken pure text reasoning ability but also enhances overall performance through visual verification.

Technical Implementation Details

Base Model Selection

S1-VL-32B is built on Qwen3-VL-32B-Thinking. The reasons for choosing this base include:
- Strong visual understanding ability
- Excellent text reasoning foundation
- Support for long contexts (critical for multi-round Thinking-with-Images)
- Open weights and good scalability

Sandbox Environment Design

Code execution for Thinking-with-Images requires a safely isolated sandbox environment. Key design considerations:

- **Security**: Restrict executable Python operations to prevent malicious code
- **Efficiency**: Fast sandbox startup and destruction to support high-throughput training
- **Rich Functionality**: Pre-install common image processing libraries (PIL, OpenCV, NumPy, Matplotlib, etc.)
- **Resource Limitation**: Control CPU/memory usage to prevent resource exhaustion

Multi-Round Interaction Protocol

The interaction between the model and the sandbox requires a clear protocol:

1. The model generates a special token sequence containing code
2. The system extracts the code and sends it to the sandbox for execution
3. The sandbox returns execution results (output images or values)
4. The results are encoded and inserted into the model's context
5. The model continues generation based on the updated context

This protocol needs to be clearly annotated in the training data to allow the model to learn the correct interaction mode.

Limitations and Future Directions

Current Limitations

- **Computational Cost**: The Thinking-with-Images mode requires multiple code executions, so the reasoning cost is higher than pure text models
- **Sandbox Dependence**: Need to maintain complex sandbox infrastructure
- **Error Accumulation**: Early errors in multi-round interactions may affect subsequent reasoning

Future Directions

- **Smarter Routing**: Develop more refined heuristic methods to more accurately judge when Thinking-with-Images is needed
- **Tool Expansion**: Integrate more scientific tools (e.g., symbolic computation, data analysis libraries) in addition to image processing
- **Real-Time Interaction**: Support user intervention to collaborate with the model to complete complex scientific reasoning
- **Domain Specialization**: Develop specialized versions for specific scientific fields (e.g., medical imaging, materials science)

Broader Impact: A New Paradigm for AI for Science

S1-VL represents an important development direction for "AI for Science": from passive information processing to active experimental operation. Traditional AI systems can only "read" scientific literature, while S1-VL shows that AI can also "do" scientific experiments—at least in the digital domain.

The ability of this "digital experimenter" has far-reaching significance:

- **Accelerate Scientific Discovery**: Automatically perform routine data analysis tasks, allowing scientists to focus on innovation
- **Lower Thresholds**: Enable non-experts to conduct complex scientific image analysis
- **Educational Innovation**: Serve as an interactive learning tool to demonstrate the complete process of scientific reasoning
- **Reproducibility**: Automatically record all operation steps to improve the reproducibility of scientific research

Conclusion: When AI Learns to "Think" with Hands

The "Thinking-with-Images" ability of S1-VL is essentially a form of "embodied cognition"—AI is no longer a passive information processor, but an agent that can assist thinking by manipulating the environment (here, digital images). This echoes the "extended mind" theory in human cognitive science: thinking does not only occur in the brain (or neural network) but also in interaction with the environment.

From a broader perspective, S1-VL is an important step toward a "general scientific agent". Future AI scientists may not only read papers and write code but also operate microscopes, adjust experimental parameters, and analyze observation data—becoming true partners of human scientists.

The future of science may be a future where humans and AI "think by looking at images" together.
