Reading

OR-LLM-Agent: An Automatic Solving Framework for Operations Research Optimization Problems Based on Reasoning Large Language Models

The OR-LLM-Agent framework, jointly open-sourced by Shanghai Jiao Tong University and Nanyang Technological University, splits the solving of operations research optimization problems into three stages—mathematical modeling, code generation, and debugging—via task decomposition, and achieves automated solving using reasoning models like DeepSeek-R1.

OR-LLM-Agent运筹学优化问题DeepSeek-R1推理模型数学建模Gurobi上海交通大学

Published 2026-05-11 20:18Recent activity 2026-05-11 21:21Estimated read 8 min

OR-LLM-Agent: An Automatic Solving Framework for Operations Research Optimization Problems Based on Reasoning Large Language Models

Section 01

Introduction / Main Floor: OR-LLM-Agent: An Automatic Solving Framework for Operations Research Optimization Problems Based on Reasoning Large Language Models

Section 02

Research Background and Challenges

Operations Research (OR) optimization problems are widely present in key business scenarios such as logistics scheduling, production planning, and resource allocation. Traditionally, such problems require domain experts to manually build mathematical models and then use professional solvers like Gurobi and CPLEX for computation. This process is not only costly and time-consuming but also requires high professional knowledge of the solvers.

In recent years, with the rise of Large Language Models (LLMs), researchers have begun to explore the automation of this process using AI. However, most existing methods are based on non-reasoning LLMs, improving performance through prompt engineering or fine-tuning, which are inherently limited by the model's own reasoning capability bottlenecks.

The research team from Shanghai Jiao Tong University and Nanyang Technological University proposed the OR-LLM-Agent framework, which for the first time systematically applies reasoning large models to the automatic solving of OR optimization problems, achieving significant breakthroughs in multiple benchmark tests.

Section 03

Framework Design Philosophy

The core innovation of OR-LLM-Agent lies in its task decomposition strategy. The research team observed that splitting the solving of complex OR problems into multiple specialized subtasks, handled by different sub-agents, can significantly improve overall performance. The entire process is divided into three sequentially executed stages:

Section 04

Stage 1: Mathematical Modeling

This stage is responsible for converting problems described in natural language into standard mathematical optimization models. The sub-agent needs to identify decision variables, objective functions, and constraints, and output a standardized mathematical expression. This is the foundation for all subsequent steps, and the accuracy of modeling directly determines the quality of the final solution.

Section 05

Stage 2: Code Generation

Based on the mathematical model from the previous stage, this stage generates executable solver code. The framework mainly uses Python and Gurobi Optimizer, and the generated code needs to correctly implement the variable definitions, objective functions, and constraints in the model.

Section 06

Stage 3: Debugging and Optimization

After code generation, it is inevitable to have syntax errors or logical flaws. The debugging sub-agent is responsible for analyzing error information during execution, locating the root cause of the problem, and generating fixed code. This iterative process continues until a valid solution is obtained or the maximum number of attempts is reached.

Section 07

BWOR Benchmark Dataset

The research team found that existing OR benchmarks (such as NL4OPT, MAMO, IndustryOR) have inconsistencies in evaluating reasoning models—sometimes reasoning models perform worse than non-reasoning models of the same series. To address this, they constructed the BWOR (Benchmark for Operations Research) dataset.

The design goal of BWOR is to provide a more consistent and discriminative evaluation of model capabilities. The dataset includes diverse types of OR problems, each carefully designed to effectively test the model's comprehensive capabilities in terms of modeling accuracy, code correctness, and solving efficiency.

This dataset has been publicly released on Hugging Face and Zenodo, providing a standardized evaluation benchmark for subsequent research.

Section 08

Experimental Results and Performance Analysis

The experimental results are striking: OR-LLM-Agent based on DeepSeek-R1 outperformed all comparison methods on the BWOR benchmark, including GPT-o3, Gemini 2.5 Pro, the bare DeepSeek-R1 model, and specialized ORLM models, with an accuracy improvement of at least 7%.

This result fully demonstrates the effectiveness of the task decomposition strategy. Compared to end-to-end single-stage methods, phased specialized processing allows the model to focus on the core challenges of each subtask, avoiding cognitive overload in solving complex problems.

Notably, the research team used DeepSeek-R1, an open-source reasoning model, rather than the closed-source GPT-o3. This means enterprises can deploy the complete solution locally without relying on external APIs, ensuring data privacy and reducing long-term usage costs.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15