Reading

LongTraceRL: Long-Context Reasoning Learning Based on Search Agent Trajectories and Scoring Rewards

LongTraceRL addresses the challenges of handling distracting information and process supervision in long-context reasoning by constructing hierarchical distracting documents and using entity-level scoring rewards, achieving excellent performance across multiple benchmarks.

长上下文推理强化学习过程监督知识图谱搜索智能体奖励设计多跳推理RLVR

Published 2026-05-30 01:51Recent activity 2026-06-01 10:57Estimated read 14 min

Section 01

LongTraceRL: Long-Context Reasoning Learning Based on Search Agent Trajectories and Scoring Rewards (Introduction)

LongTraceRL: Long-Context Reasoning Learning Based on Search Agent Trajectories and Scoring Rewards

Abstract: LongTraceRL addresses the challenges of handling distracting information and process supervision in long-context reasoning by constructing hierarchical distracting documents and using entity-level scoring rewards, achieving excellent performance across multiple benchmarks. Keywords: Long-context reasoning, reinforcement learning, process supervision, knowledge graph, search agent, reward design, multi-hop reasoning, RLVR Core Insights: LongTraceRL targets issues like model attention dispersion and information omission in long-context reasoning. It innovatively uses search agent trajectories to construct hierarchical distractors and designs entity-level scoring rewards to achieve fine-grained process supervision, significantly enhancing the model's reasoning ability in complex scenarios.

Section 02

Problem Background: Core Challenges of Long-Context Reasoning

Long-context reasoning is one of the core challenges faced by large language models. Although modern LLMs have expanded their context windows to millions of tokens, their ability to locate key information and integrate scattered evidence is limited, as shown in:

Attention Dispersion: Distracted by irrelevant information, unable to focus on key paragraphs
Information Omission: Failing to notice details critical to the answer
Spurious Association: Incorrectly linking irrelevant information to the question
Reasoning Chain Breakage: Losing logical connections between intermediate steps in multi-hop reasoning

These problems stem from information integration that humans consider 'obvious', but for models, it is a complex skill that requires explicit learning.

Section 03

Limitations of Existing Methods: Shortcomings of RLVR in Long-Context Reasoning

Reinforcement Learning with Verifiable Rewards (RLVR) has great potential in reasoning tasks, but existing methods have two key limitations:

Limitation 1: Low-Confusion Distractors

Existing training data often uses random sampling or single searches to build distracting documents, resulting in low-confusion distractors that models can easily identify as irrelevant. However, real-world distractors are more deceptive (e.g., surface-relevant but actually irrelevant, containing partially relevant information but insufficient to answer the question).

Limitation 2: Sparse Outcome-Oriented Rewards

Using only final answer correctness as the reward signal leads to:

No supervision for intermediate steps
Reward hacking (models get correct answers through wrong reasoning)
Inability to distinguish differences in reasoning quality among correct answers

Analogy: A teacher only tells students they 'passed' without pointing out specific mistakes or improvement directions.

Section 04

Core Innovation 1: Hierarchical Distractor Construction Based on Search Agent Trajectories

Multi-hop Question Generation via Knowledge Graph Random Walk

Select a starting entity from the knowledge graph
Randomly walk through multi-relation edges to the target entity
Convert the path into a natural language question
Record intermediate entities in the reasoning chain (gold entities)

Search Agent Trajectory Collection

Deploy a search agent to attempt answering multi-hop questions, and record its complete behavior trajectory (multiple searches, reading documents, citing evidence, generating answers) for constructing hierarchical distractors.

Hierarchical Distractor Design

High-Confusion Distractors: Documents read but not cited by the agent (surface-relevant but insufficient to support the answer, highly deceptive)
Low-Confusion Distractors: Documents in search results not opened by the agent (surface-relevant but not worth reading, easy to identify)

This design makes training data more challenging and simulates real-world complexity.

Section 05

Core Innovation 2: Entity-Level Scoring Rewards and Process Supervision

Scoring Reward Design

Core Idea: Use gold entities in the reasoning chain as checkpoints to evaluate whether the model cites correct evidence at each step:

The gold answer for a multi-hop question contains a sequence of key entities
Parse the model's answer to extract cited entities
Calculate the entity matching degree (fine-grained feedback)

Positive-Only Strategy

Scoring rewards are only applied to responses with correct final answers
Responses with wrong answers only receive sparse correctness rewards (negative feedback)
Scoring rewards are used to distinguish reasoning quality among correct answers

This prevents reward hacking and encourages quality competition among correct answers.

Advantages of Process Supervision

Compared to sparse outcome rewards, it provides:

Intermediate step feedback
Evidence quality evaluation
Encouragement for reasoning completeness
Interpretability (analysis of reasoning behavior)

Section 06

Experimental Results: Consistent Improvement in Long-Context Reasoning Ability

Experimental Setup

Model Scales: 4B, 7B, and 30B parameter reasoning LLMs
Benchmarks: Five long-context reasoning benchmarks
Baselines: Strong baselines like standard RLVR and supervised fine-tuning (SFT)

Core Results

Consistent Performance Improvement: Outperforms strong baselines across all model scales and benchmarks, with significant average improvements
Improved Reasoning Quality: More comprehensive, evidence-based reasoning, less missing key information, and less misled by high-confusion distractors
Scale Generalization: Advantages are maintained across different model scales, with strong universality

Ablation Experiments

Value of Hierarchical Distractors: Compared to random distractors, it improves robustness to real distractors; performance drops on hard samples without high-confusion distractors
Value of Scoring Rewards: Compared to sparse rewards, it improves reasoning quality; the positive-only strategy effectively prevents reward hacking
Synergy Effect: Better results when combining data construction and reward design

Section 07

Application Value and Insights: Significance for Long-Context Applications and RLVR Research

Application Value and Insights

Long-Context Applications

Directly applicable to:

Document QA systems (key information localization in legal, medical, and scientific documents)
Multi-hop search (complex queries integrating multiple information sources)
Evidence chain construction (scenarios requiring clear reasoning basis)

Insights for RLVR Research

Data quality is crucial: The difficulty and authenticity of training data affect the upper limit of model capabilities
Value of process supervision: Fine-grained intermediate feedback is more effective than sparse outcome rewards
Prevent reward hacking: Strategies like positive-only maintain reasoning honesty

AI Safety Implications

Fine-grained supervision via scoring rewards helps:

Improve interpretability (analyze reasoning paths)
Detect error patterns (identify common error types)
Alignment verification (verify consistency between reasoning processes and expectations)

Section 08

Limitations, Future Directions, and Conclusion

Limitations

Search Agent Limitations: Current basic agents are not optimal; stronger agents may improve trajectory quality
Entity Recognition Accuracy: Scoring rewards rely on accurate entity recognition and alignment, which may fail in complex texts
Domain Generalization: Experiments are mainly on general knowledge QA; specific domains (medical, legal) need adaptation
Computational Cost: Trajectory collection and distractor construction require large computational resources

Future Directions

Optimize search agents, improve entity recognition accuracy, expand domain applications, and reduce computational cost

Conclusion

LongTraceRL significantly enhances long-context reasoning ability through innovative data construction and fine-grained process rewards, demonstrating the key role of training data design and reward engineering in RLVR. As LLMs are increasingly used in knowledge-intensive tasks, such methods will become a reliable technical foundation. We look forward to more progress to help AI find true knowledge in massive information.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15