Zing Forum

Reading

RIEQE: Enhancing Translation Quality Estimation Capabilities of Large Reasoning Models via Synergistic Evolution of Implicit and Explicit Reasoning

The research team proposes the RIEQE two-stage training framework, which achieves the synergistic evolution of implicit and explicit reasoning through NonThinking-SFT and Thinking-RLVR training, and outperforms all baseline models on the WMT test set.

翻译质量评估大型推理模型隐式推理显式推理强化学习机器翻译QwenWMT
Published 2026-05-29 22:47Recent activity 2026-06-01 12:01Estimated read 10 min
RIEQE: Enhancing Translation Quality Estimation Capabilities of Large Reasoning Models via Synergistic Evolution of Implicit and Explicit Reasoning
1

Section 01

[Introduction] RIEQE Framework: Enhancing Translation Quality Estimation Capabilities of Large Models via Synergistic Evolution of Implicit and Explicit Reasoning

Core Information

  • Research Outcome: Propose the RIEQE two-stage training framework, which achieves the synergistic evolution of implicit and explicit reasoning through NonThinking-SFT and Thinking-RLVR training, and outperforms all baseline models on the WMT test set
  • Original Author/Source: arXiv submission (published on May 29, 2026), title Unlocking Fine-Grained Translation Quality Estimation in LRMs through Synergistically Evolving Implicit and Explicit Reasoning, link: http://arxiv.org/abs/2605.31378v1
  • Keywords: Translation Quality Estimation, Large Reasoning Models, Implicit Reasoning, Explicit Reasoning, Reinforcement Learning, Machine Translation, Qwen, WMT

This framework aims to address the performance bottleneck of Large Reasoning Models (LRMs) in fine-grained Translation Quality Estimation (QE) tasks, and enhance model capabilities by synergizing the two reasoning modes.

2

Section 02

Dilemmas and Problem Diagnosis of Translation Quality Estimation

Dilemmas

LRMs perform excellently in reasoning tasks such as mathematical problem-solving and code generation, but still underperform in fine-grained QE tasks even with long reasoning chains. Fine-grained QE requires models to evaluate translation quality without reference translations, locate errors, and identify error types (lexical/grammatical/semantic errors), which is crucial for post-translation editing and quality control.

Problem Diagnosis

The research team found that LRMs have strong multilingual capabilities, and the core issue lies in the inherent complexity of QE tasks—needing to handle three dimensions simultaneously: source language, target language, and error analysis, which is difficult to learn directly. The solution direction is to reduce task complexity and fully leverage the reasoning capabilities of LRMs.

3

Section 03

RIEQE Framework: Synergistic Evolution of Implicit and Explicit Reasoning

Core Innovations

The RIEQE framework cultivates the model's implicit and explicit reasoning capabilities and promotes their synergistic evolution through two-stage training:

  • Implicit Reasoning: Intuitive responses from the model's internal layers, no readable reasoning chain, efficient but lack interpretability
  • Explicit Reasoning: Token-level readable reasoning chain, transparent and verifiable

Two-Stage Training Strategy

  1. NonThinking-SFT Stage: Decompose complex QE tasks into simple subtasks (e.g., error detection, position localization, type judgment), directly learn input-output mapping without reasoning chains, and enhance implicit reasoning capabilities
  2. Thinking-RLVR Stage: Use Reinforcement Learning with Verifiable Rewards (RLVR) to encourage the generation of detailed reasoning chains, organize thinking processes based on the implicit foundation from the first stage, and reward correct answers and the quality of reasoning chains
4

Section 04

Empirical Evidence of Synergistic Evolution

Mutual Promotion Mechanism

  • Implicit reasoning provides a knowledge foundation for explicit reasoning, helping the model naturally convert intuition into reasoning chains
  • Explicit reasoning training strengthens implicit capabilities, making the model's understanding of QE task structure clearer

Experimental Verification

The RIEQE model based on Qwen3-4B-Thinking-2507 on the WMT test set:

  • Explicit reasoning performance surpasses all baseline models
  • Implicit reasoning capabilities are comparable to current best encoder models This proves the effectiveness of collaborative training.
5

Section 05

Technical Details and Implementation Considerations

Task Decomposition Strategy

Explore various decomposition methods:

  • Error type decomposition (lexical/syntactic/semantic-level evaluation)
  • Position decomposition (evaluate different parts of the translation)
  • Binary to multi-class decomposition (transition from good/bad classification to fine-grained scoring)

Reward Design

The reward function in the RLVR stage considers:

  • Correctness of the final answer
  • Quality of the reasoning chain (logical coherence, step completeness, redundancy)

Training Efficiency

The two-stage method is more efficient than end-to-end long reasoning chain training: the first stage (supervised learning) converges quickly, and the second stage (RLVR) is easier to train stably due to good initialization.

6

Section 06

New Insights into the Capability Boundaries of LRMs

Key Insights

  1. Impact of Task Complexity: LRMs may underperform when facing inherently complex tasks; evaluating models needs to consider task structure characteristics
  2. Complementarity of Reasoning Modes: Implicit and explicit reasoning each have their value; future LRMs need to switch modes flexibly
  3. Refined Training Strategies: Refined training for specific tasks is more effective than simply scaling up model size

Research Conclusion

The RIEQE framework successfully unlocks the potential of LRMs in fine-grained QE tasks, deepens the understanding of LRM capability characteristics and training methods, and provides insights for model performance improvement.

7

Section 07

Application Prospects and Expansion Directions

Cross-Domain Applications

  • NLP Tasks: Multi-dimensional complex tasks such as text summary quality evaluation, dialogue system evaluation, code review
  • Multimodal Tasks: Evaluation integrating visual and language information
  • Educational Applications: Intelligent teaching assistants (quickly judge answer correctness + provide detailed explanations)

This methodology has wide applicability and can be transferred to various scenarios requiring complex reasoning.

8

Section 08

Limitations and Future Work

Limitations

Current task decomposition relies on manual design, limiting generality

Future Directions

  1. Explore automated task decomposition methods
  2. Integrate more reasoning modes
  3. Improve cross-language transfer capabilities

The research team will continue to optimize the framework and expand its application scope.