Zing Forum

Reading

MeasHalu: A Framework to Mitigate Scientific Measurement Hallucinations in Large Language Models via Enhanced Reasoning

The MeasHalu framework, developed by the team at the Shenzhen Institute of Advanced Technology (SIAT), Chinese Academy of Sciences, effectively mitigates hallucinations in scientific measurement information extraction by large language models through fine-grained hallucination taxonomy, reasoning-aware fine-tuning, and progressive reward curriculum optimization. It achieves performance comparable to the competition champion on the MeasEval benchmark.

AI for Science大语言模型幻觉缓解科学文献理解测量数据提取ACL 2026强化学习推理优化
Published 2026-06-12 00:45Recent activity 2026-06-12 00:53Estimated read 6 min
MeasHalu: A Framework to Mitigate Scientific Measurement Hallucinations in Large Language Models via Enhanced Reasoning
1

Section 01

Introduction: MeasHalu Framework—A New Solution to Mitigate Scientific Measurement Hallucinations in Large Language Models

The team at the Shenzhen Institute of Advanced Technology (SIAT), Chinese Academy of Sciences, has launched the MeasHalu framework. It effectively mitigates hallucinations in scientific measurement information extraction by large language models through fine-grained hallucination taxonomy, reasoning-aware fine-tuning, and progressive reward curriculum optimization. It achieves performance comparable to the competition champion on the MeasEval benchmark, providing a key technical breakthrough for the AI for Science field.

2

Section 02

Background: Challenges and Impacts of Scientific Measurement Hallucinations

In the wave of AI for Science, extracting measurement data from scientific literature is a core requirement. However, large language models often suffer from hallucinations: generating incorrect data when extracting quantities, units, modifiers, or relationships, which undermines the reliability of automated understanding. This problem not only affects basic research but also may lead to safety risks such as failed chemical experiments and drug development errors, making it a core challenge to be addressed urgently in AI for Science.

3

Section 03

Core Innovative Methods of the MeasHalu Framework

The MeasHalu framework has three core innovations:

  1. Fine-grained Hallucination Taxonomy: Classifies measurement hallucinations into four categories—quantity errors, unit errors, modifier errors, and relationship errors—for targeted correction;
  2. Two-stage Reasoning-aware Fine-tuning: The first stage uses supervised fine-tuning to learn correct extraction patterns, while the second stage applies reinforcement learning to optimize complex reasoning decisions;
  3. Progressive Reward Curriculum Optimization: Type-specific penalties increase with training difficulty to enhance reasoning stability.
4

Section 04

Experimental Results: Performance Validation of MeasHalu

MeasEval Benchmark Performance

Model F1 Score
MeasHalu-7B 0.512
LIORI (Competition Champion) 0.519
GPT-5 (Optimized Prompt) 0.406
Gemini-2.5-Pro (Optimized Prompt) 0.440
CONNER 0.473
MeasHalu-7B's performance is close to the competition champion, and it is more than 10 F1 points higher than GPT-5.

Fine-grained Entropy Analysis

Semantic Role Entropy Reduction Peak Ratio Reduction
Quantity ↓52.1% Minimal Fluctuation
Relationship ↓42.7% ↓56.8%
The model's reasoning stability is significantly improved.
5

Section 05

Application Scenarios and Academic Contributions

Embodied Intelligence Applications

Can generate execution sequences from experimental text: Input: "Heat 100mg sample to 80°C" Output: ADD(100 mg), HEAT(80°C) Facilitates automated laboratories and intelligent research assistants.

Academic Recognition and Open Source

The work has been accepted by ACL 2026 Findings. The code, model, and dataset are open-source (GitHub: https://github.com/CAS-SIAT-XinHai/MeasHalu). It will serve as a core component of the MeasureMine framework, and the MeasBench benchmark will be launched subsequently.

6

Section 06

Technical Insights and Future Outlook

Technical Insights

  1. Value of problem decomposition: Fine-grained classification enhances targeting;
  2. Importance of process supervision: Focusing on reasoning processes improves stability;
  3. Necessity of domain optimization: General models need adaptation to scientific fields.

Future Outlook

Specialized frameworks like MeasHalu will promote the development of AI for Science. The team will launch the comprehensive MeasBench benchmark subsequently to build more reliable scientific intelligent systems.