# LongTraceRL: Long-Context Reasoning Learning Based on Search Agent Trajectories and Scoring Rewards

> LongTraceRL addresses the challenges of handling distracting information and process supervision in long-context reasoning by constructing hierarchical distracting documents and using entity-level scoring rewards, achieving excellent performance across multiple benchmarks.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-29T17:51:40.000Z
- 最近活动: 2026-06-01T02:57:54.388Z
- 热度: 102.9
- 关键词: 长上下文推理, 强化学习, 过程监督, 知识图谱, 搜索智能体, 奖励设计, 多跳推理, RLVR
- 页面链接: https://www.zingnex.cn/en/forum/thread/longtracerl
- Canonical: https://www.zingnex.cn/forum/thread/longtracerl
- Markdown 来源: floors_fallback

---

## LongTraceRL: Long-Context Reasoning Learning Based on Search Agent Trajectories and Scoring Rewards (Introduction)

# LongTraceRL: Long-Context Reasoning Learning Based on Search Agent Trajectories and Scoring Rewards
**Abstract**: LongTraceRL addresses the challenges of handling distracting information and process supervision in long-context reasoning by constructing hierarchical distracting documents and using entity-level scoring rewards, achieving excellent performance across multiple benchmarks.
**Keywords**: Long-context reasoning, reinforcement learning, process supervision, knowledge graph, search agent, reward design, multi-hop reasoning, RLVR
**Core Insights**: LongTraceRL targets issues like model attention dispersion and information omission in long-context reasoning. It innovatively uses search agent trajectories to construct hierarchical distractors and designs entity-level scoring rewards to achieve fine-grained process supervision, significantly enhancing the model's reasoning ability in complex scenarios.

## Problem Background: Core Challenges of Long-Context Reasoning

## Problem Background: Core Challenges of Long-Context Reasoning
Long-context reasoning is one of the core challenges faced by large language models. Although modern LLMs have expanded their context windows to millions of tokens, their ability to locate key information and integrate scattered evidence is limited, as shown in:
1. **Attention Dispersion**: Distracted by irrelevant information, unable to focus on key paragraphs
2. **Information Omission**: Failing to notice details critical to the answer
3. **Spurious Association**: Incorrectly linking irrelevant information to the question
4. **Reasoning Chain Breakage**: Losing logical connections between intermediate steps in multi-hop reasoning

These problems stem from information integration that humans consider 'obvious', but for models, it is a complex skill that requires explicit learning.

## Limitations of Existing Methods: Shortcomings of RLVR in Long-Context Reasoning

## Limitations of Existing Methods: Shortcomings of RLVR in Long-Context Reasoning
Reinforcement Learning with Verifiable Rewards (RLVR) has great potential in reasoning tasks, but existing methods have two key limitations:
### Limitation 1: Low-Confusion Distractors
Existing training data often uses random sampling or single searches to build distracting documents, resulting in low-confusion distractors that models can easily identify as irrelevant. However, real-world distractors are more deceptive (e.g., surface-relevant but actually irrelevant, containing partially relevant information but insufficient to answer the question).

### Limitation 2: Sparse Outcome-Oriented Rewards
Using only final answer correctness as the reward signal leads to:
- No supervision for intermediate steps
- Reward hacking (models get correct answers through wrong reasoning)
- Inability to distinguish differences in reasoning quality among correct answers

Analogy: A teacher only tells students they 'passed' without pointing out specific mistakes or improvement directions.

## Core Innovation 1: Hierarchical Distractor Construction Based on Search Agent Trajectories

## Core Innovation 1: Hierarchical Distractor Construction Based on Search Agent Trajectories
### Multi-hop Question Generation via Knowledge Graph Random Walk
1. Select a starting entity from the knowledge graph
2. Randomly walk through multi-relation edges to the target entity
3. Convert the path into a natural language question
4. Record intermediate entities in the reasoning chain (gold entities)

### Search Agent Trajectory Collection
Deploy a search agent to attempt answering multi-hop questions, and record its complete behavior trajectory (multiple searches, reading documents, citing evidence, generating answers) for constructing hierarchical distractors.

### Hierarchical Distractor Design
- **High-Confusion Distractors**: Documents read but not cited by the agent (surface-relevant but insufficient to support the answer, highly deceptive)
- **Low-Confusion Distractors**: Documents in search results not opened by the agent (surface-relevant but not worth reading, easy to identify)

This design makes training data more challenging and simulates real-world complexity.

## Core Innovation 2: Entity-Level Scoring Rewards and Process Supervision

## Core Innovation 2: Entity-Level Scoring Rewards and Process Supervision
### Scoring Reward Design
Core Idea: Use gold entities in the reasoning chain as checkpoints to evaluate whether the model cites correct evidence at each step:
1. The gold answer for a multi-hop question contains a sequence of key entities
2. Parse the model's answer to extract cited entities
3. Calculate the entity matching degree (fine-grained feedback)

### Positive-Only Strategy
- Scoring rewards are only applied to responses with correct final answers
- Responses with wrong answers only receive sparse correctness rewards (negative feedback)
- Scoring rewards are used to distinguish reasoning quality among correct answers

This prevents reward hacking and encourages quality competition among correct answers.

### Advantages of Process Supervision
Compared to sparse outcome rewards, it provides:
1. Intermediate step feedback
2. Evidence quality evaluation
3. Encouragement for reasoning completeness
4. Interpretability (analysis of reasoning behavior)

## Experimental Results: Consistent Improvement in Long-Context Reasoning Ability

## Experimental Results: Consistent Improvement in Long-Context Reasoning Ability
### Experimental Setup
- Model Scales: 4B, 7B, and 30B parameter reasoning LLMs
- Benchmarks: Five long-context reasoning benchmarks
- Baselines: Strong baselines like standard RLVR and supervised fine-tuning (SFT)

### Core Results
- **Consistent Performance Improvement**: Outperforms strong baselines across all model scales and benchmarks, with significant average improvements
- **Improved Reasoning Quality**: More comprehensive, evidence-based reasoning, less missing key information, and less misled by high-confusion distractors
- **Scale Generalization**: Advantages are maintained across different model scales, with strong universality

### Ablation Experiments
- **Value of Hierarchical Distractors**: Compared to random distractors, it improves robustness to real distractors; performance drops on hard samples without high-confusion distractors
- **Value of Scoring Rewards**: Compared to sparse rewards, it improves reasoning quality; the positive-only strategy effectively prevents reward hacking
- **Synergy Effect**: Better results when combining data construction and reward design

## Application Value and Insights: Significance for Long-Context Applications and RLVR Research

## Application Value and Insights
### Long-Context Applications
Directly applicable to:
- Document QA systems (key information localization in legal, medical, and scientific documents)
- Multi-hop search (complex queries integrating multiple information sources)
- Evidence chain construction (scenarios requiring clear reasoning basis)

### Insights for RLVR Research
1. Data quality is crucial: The difficulty and authenticity of training data affect the upper limit of model capabilities
2. Value of process supervision: Fine-grained intermediate feedback is more effective than sparse outcome rewards
3. Prevent reward hacking: Strategies like positive-only maintain reasoning honesty

### AI Safety Implications
Fine-grained supervision via scoring rewards helps:
- Improve interpretability (analyze reasoning paths)
- Detect error patterns (identify common error types)
- Alignment verification (verify consistency between reasoning processes and expectations)

## Limitations, Future Directions, and Conclusion

## Limitations, Future Directions, and Conclusion
### Limitations
1. Search Agent Limitations: Current basic agents are not optimal; stronger agents may improve trajectory quality
2. Entity Recognition Accuracy: Scoring rewards rely on accurate entity recognition and alignment, which may fail in complex texts
3. Domain Generalization: Experiments are mainly on general knowledge QA; specific domains (medical, legal) need adaptation
4. Computational Cost: Trajectory collection and distractor construction require large computational resources

### Future Directions
Optimize search agents, improve entity recognition accuracy, expand domain applications, and reduce computational cost

### Conclusion
LongTraceRL significantly enhances long-context reasoning ability through innovative data construction and fine-grained process rewards, demonstrating the key role of training data design and reward engineering in RLVR. As LLMs are increasingly used in knowledge-intensive tasks, such methods will become a reliable technical foundation. We look forward to more progress to help AI find true knowledge in massive information.