Zing Forum

Reading

Detecting Right-Answer Wrong-Reason: Identifying the 'Correct Answer but Wrong Reason' Behavior in Open-Source Reasoning Models

This is a complete research framework for detecting the 'shortcut-driven reasoning' phenomenon in open-source weight reasoning models. By combining behavioral testing and mechanistic interpretability methods, it evaluates whether models arrive at correct answers through genuine reasoning or superficial shortcuts, providing a systematic tool for understanding and improving the reasoning capabilities of small models.

大语言模型推理模型可解释性开源模型认知偏见机制解释模型评估Chain-of-Thought
Published 2026-05-31 20:36Recent activity 2026-05-31 20:53Estimated read 6 min
Detecting Right-Answer Wrong-Reason: Identifying the 'Correct Answer but Wrong Reason' Behavior in Open-Source Reasoning Models
1

Section 01

[Introduction] Analysis of the Research Framework for the 'Correct Answer but Wrong Reason' Phenomenon in Open-Source Reasoning Models

This study constructs a complete framework to detect the 'shortcut-driven reasoning' phenomenon (i.e., correct answer but wrong reason) in open-source weight reasoning models. The framework combines behavioral testing and mechanistic interpretability methods to evaluate whether models obtain correct answers through genuine reasoning or superficial shortcuts. Key finding: Reasoning failures in small models with fewer than 2 billion parameters mainly stem from 'confused reasoning' rather than 'shortcut dependence', providing a systematic tool for understanding and improving the reasoning capabilities of small models.

2

Section 02

Research Background and Core Issues

With the improvement of large language model capabilities, the community is concerned about a key question: When a model gives a correct answer, is it through effective reasoning or shortcut dependence? The 'correct answer but wrong reason' phenomenon refers to cases where the model outputs the correct answer but has fundamental flaws in the reasoning process (e.g., ignoring key information, relying on superficial statistical correlations, etc.), which is more common in small open-source models. This project aims to build a pipeline to systematically detect and quantify this phenomenon.

3

Section 03

Research Methods and Framework Design

Project Architecture: Modular design, including data layer (raw/processed/labeled data), source code layer (model tools, evaluation/analysis/interpretability modules), and result layer (scores/reports/charts). Benchmark Dataset: 19 cognitive questions × 3 conditions (Clean: no interference, Hinted: correct prompts, Misleading: misleading prompts). Compare performance to determine shortcut dependence. Audit Scoring System: Four-dimensional weighted scoring (Clean Accuracy: 0.2, Misleading Resistance:0.3, Reasoning Faithfulness:0.3, Mechanistic Consistency:0.2).

4

Section 04

Model Test Results and Key Findings

Tested 4 open-source small models: Qwen2.5-1.5B (47.4 points), Qwen2.5-0.5B (43.3), SmolLM-135M (43.3), TinyLlama-1.1B (37.6). Key findings:

  1. Qwen1.5B's accuracy under Clean condition is only 15.8%, others are lower;
  2. When giving correct answers, models are 100% vulnerable to misleading prompts;
  3. 81-82% of failure cases are due to 'confusion' rather than shortcut dependence, challenging the 'cheating' assumption of small models.
5

Section 05

Mechanistic Interpretability Analysis

In-depth analysis of the model's interior through three methods:

  • Activation Extraction: Compare activation patterns at different layers to identify neural activity differences between correct and incorrect reasoning;
  • Sparse Autoencoder Analysis: Extract interpretable features that explain the internal representation structure of the model;
  • Activation Patching: Causal intervention to test the impact of specific layer activations on output, locating key components of reasoning.
6

Section 06

Application Value, Limitations, and Future Directions

Application Value: Provides researchers/developers with model selection guidance, improvement directions, and safety assessment tools; the open-source community can reproduce tests for new models. Limitations: Small test set (57 entries), English context, possible misjudgments in automatic annotation. Future Work: Expand the dataset to cover more reasoning types, manually review and calibrate annotations, explore specialized training methods for small models to improve reasoning faithfulness.