Zing Forum

Reading

OR-LLM-Agent: An Automatic Solving Framework for Operations Research Optimization Problems Based on Reasoning Large Language Models

The OR-LLM-Agent framework, jointly open-sourced by Shanghai Jiao Tong University and Nanyang Technological University, splits the solving of operations research optimization problems into three stages—mathematical modeling, code generation, and debugging—via task decomposition, and achieves automated solving using reasoning models like DeepSeek-R1.

OR-LLM-Agent运筹学优化问题DeepSeek-R1推理模型数学建模Gurobi上海交通大学
Published 2026-05-11 20:18Recent activity 2026-05-11 21:21Estimated read 8 min
OR-LLM-Agent: An Automatic Solving Framework for Operations Research Optimization Problems Based on Reasoning Large Language Models
1

Section 01

Introduction / Main Floor: OR-LLM-Agent: An Automatic Solving Framework for Operations Research Optimization Problems Based on Reasoning Large Language Models

The OR-LLM-Agent framework, jointly open-sourced by Shanghai Jiao Tong University and Nanyang Technological University, splits the solving of operations research optimization problems into three stages—mathematical modeling, code generation, and debugging—via task decomposition, and achieves automated solving using reasoning models like DeepSeek-R1.

2

Section 02

Research Background and Challenges

Operations Research (OR) optimization problems are widely present in key business scenarios such as logistics scheduling, production planning, and resource allocation. Traditionally, such problems require domain experts to manually build mathematical models and then use professional solvers like Gurobi and CPLEX for computation. This process is not only costly and time-consuming but also requires high professional knowledge of the solvers.

In recent years, with the rise of Large Language Models (LLMs), researchers have begun to explore the automation of this process using AI. However, most existing methods are based on non-reasoning LLMs, improving performance through prompt engineering or fine-tuning, which are inherently limited by the model's own reasoning capability bottlenecks.

The research team from Shanghai Jiao Tong University and Nanyang Technological University proposed the OR-LLM-Agent framework, which for the first time systematically applies reasoning large models to the automatic solving of OR optimization problems, achieving significant breakthroughs in multiple benchmark tests.

3

Section 03

Framework Design Philosophy

The core innovation of OR-LLM-Agent lies in its task decomposition strategy. The research team observed that splitting the solving of complex OR problems into multiple specialized subtasks, handled by different sub-agents, can significantly improve overall performance. The entire process is divided into three sequentially executed stages:

4

Section 04

Stage 1: Mathematical Modeling

This stage is responsible for converting problems described in natural language into standard mathematical optimization models. The sub-agent needs to identify decision variables, objective functions, and constraints, and output a standardized mathematical expression. This is the foundation for all subsequent steps, and the accuracy of modeling directly determines the quality of the final solution.

5

Section 05

Stage 2: Code Generation

Based on the mathematical model from the previous stage, this stage generates executable solver code. The framework mainly uses Python and Gurobi Optimizer, and the generated code needs to correctly implement the variable definitions, objective functions, and constraints in the model.

6

Section 06

Stage 3: Debugging and Optimization

After code generation, it is inevitable to have syntax errors or logical flaws. The debugging sub-agent is responsible for analyzing error information during execution, locating the root cause of the problem, and generating fixed code. This iterative process continues until a valid solution is obtained or the maximum number of attempts is reached.

7

Section 07

BWOR Benchmark Dataset

The research team found that existing OR benchmarks (such as NL4OPT, MAMO, IndustryOR) have inconsistencies in evaluating reasoning models—sometimes reasoning models perform worse than non-reasoning models of the same series. To address this, they constructed the BWOR (Benchmark for Operations Research) dataset.

The design goal of BWOR is to provide a more consistent and discriminative evaluation of model capabilities. The dataset includes diverse types of OR problems, each carefully designed to effectively test the model's comprehensive capabilities in terms of modeling accuracy, code correctness, and solving efficiency.

This dataset has been publicly released on Hugging Face and Zenodo, providing a standardized evaluation benchmark for subsequent research.

8

Section 08

Experimental Results and Performance Analysis

The experimental results are striking: OR-LLM-Agent based on DeepSeek-R1 outperformed all comparison methods on the BWOR benchmark, including GPT-o3, Gemini 2.5 Pro, the bare DeepSeek-R1 model, and specialized ORLM models, with an accuracy improvement of at least 7%.

This result fully demonstrates the effectiveness of the task decomposition strategy. Compared to end-to-end single-stage methods, phased specialized processing allows the model to focus on the core challenges of each subtask, avoiding cognitive overload in solving complex problems.

Notably, the research team used DeepSeek-R1, an open-source reasoning model, rather than the closed-source GPT-o3. This means enterprises can deploy the complete solution locally without relying on external APIs, ensuring data privacy and reducing long-term usage costs.