Reading

Nine Mainstream Large Models Give Nine Different Answers to the Same Working Hours Calculation Problem: AI Logical Consistency Benchmark Test Reveals Striking Disparities

A benchmark test on nine mainstream large models including GPT, Claude, Gemini, and Qwen shows that the same simple working hours calculation problem yielded completely opposite conclusions—ranging from "the company owes the employee 160 hours" to "the employee owes the company 48 hours"—exposing the severe inconsistency of current large models in logical reasoning and arithmetic calculation.

大模型逻辑推理基准测试AI一致性GPTClaudeGeminiQwenDeepSeek算术计算

Published 2026-04-23 23:13Recent activity 2026-04-23 23:55Estimated read 9 min

Nine Mainstream Large Models Give Nine Different Answers to the Same Working Hours Calculation Problem: AI Logical Consistency Benchmark Test Reveals Striking Disparities

Section 01

【Introduction】Nine Mainstream Large Models Show Striking Disparities in Working Hours Calculation Results; Severe Defects in Logical Consistency

A benchmark test on nine mainstream large models including GPT, Claude, Gemini, and Qwen shows that the same simple working hours calculation problem yielded completely opposite conclusions—ranging from "the employee owes the company 48 hours" to "the company owes the employee 160 hours"—exposing the severe inconsistency of current large models in logical reasoning and arithmetic calculation. This test was based on a real work scenario, and its results serve as an important warning for enterprises and individuals relying on AI for critical decision-making.

Section 02

Background: AI Applications in Critical Scenarios Are Increasing, but Simple Calculations Expose Consistency Issues

Large language models are increasingly used in scenarios such as calculation, legal reasoning, and human resources consulting. However, the latest benchmark test reveals that when faced with a working hours calculation problem involving basic arithmetic and clear rules, nine mainstream large models gave nine different answers—even the direction of the result (who owes whom) was inconsistent. The test question was not a trick question but based on a real work scenario, and the results indicate that large models have significant defects in logical consistency.

Section 03

Test Design and Methodology: Covering Nine Mainstream Models, Based on Real Working Hours Scenarios

The test used a standard working hours calculation problem involving parameters such as annual standard working days, monthly salary benchmark (21.75 days/month), actual working days, and unused paid leave, requiring calculation of the working hours settlement result when an employee resigns. The nine models covered include:

OpenAI: GPT 5.4 (Deep Thinking)
Anthropic: Claude Opus4.7, Claude Sonnet4.6
Google: Gemini3.1 Pro
Alibaba: Qwen3 Max(Thinking), Qianwen3.5
ByteDance: Doubao(Super Mode/Regular Mode)
DeepSeek: DeepSeek(Expert Mode) All tests were conducted in April 2026 to ensure the timeliness of model versions.

Section 04

Test Results: Nine Answers with Huge Disparities, Obvious Directional Divisions

The nine models gave nine different answers, with values ranging from -48 hours (employee owes company) to +160 hours (company owes employee), a span of 208 hours. Comparison of conclusions from each model:

Model	Conclusion	Calculation Result
GPT5.4(Deep Thinking)	Employee owes company	8 hours
Claude Opus4.7	Company owes employee	32 hours
Claude Sonnet4.6	Employee owes company	8 hours
Gemini3.1 Pro	Company owes employee	80 hours
Qwen3 Max(Thinking)	Company owes employee	48 hours
Qianwen3.5	Company owes employee	160 hours*
Doubao(Super Mode)	Company owes employee	40 hours
Doubao(Regular Mode)	Employee owes company	48 hours
DeepSeek(Expert Mode)	Company owes employee	40 hours
*Note: Qianwen3.5 initially gave 160 hours and self-corrected to 96 hours in the same response.

Section 05

Core Problem Analysis: Directional Divisions and Differences in Basic Assumptions Are the Main Causes

Directional Divisions: 6 models concluded the company owes the employee, while 3 concluded the employee owes the company—indicating that models have serious problems understanding the basic logical relationships of the problem.
Differences in Basic Assumptions: Different models used different annual standard working days (248/250/261 days), calculation benchmarks (21.75 days/month or actual working days), and handling methods for unused paid leave, directly leading to result discrepancies.
Self-Contradiction Phenomenon: Qianwen3.5 first gave 160 hours and then corrected it to 96 hours in the same response, exposing the instability of the model's reasoning process.

Section 06

Underlying Causes: Jointly Caused by Training Data, Model Architecture, and Prompt Sensitivity

Limitations of Training Data: Training data comes from the internet and contains a large amount of unstructured and contradictory information, which is affected by noise when dealing with precise calculations and logical reasoning.
Differences in Reasoning Ability: Different model architectures and training methods have different focuses—some emphasize pattern matching, others have strong symbolic reasoning capabilities—and these differences are amplified in complex reasoning scenarios.
Prompt Engineering Sensitivity: Minor differences in wording may lead models to take different reasoning paths, affecting the reliability of practical applications.

Section 07

Industry Insights and Recommendations: Establish Manual Review, Choose Models Rationally

Enterprise Risk Warning: Large models are not yet sufficient to handle precise logical reasoning tasks independently; strict manual review is required in critical business scenarios.
Model Selection Reference: In this test, Claude Opus4.7, DeepSeek Expert Mode, and Doubao Super Mode had relatively better result consistency, but verification based on specific scenarios is needed.
Future Improvement Direction: The test project is open-source (MIT license), and the community is welcome to contribute test cases to promote model improvement.

Section 08

Conclusion: Treat AI Boundaries Rationally, Critical Decisions Require Verification

Large models have made significant progress in natural language understanding and generation, but they still have shortcomings in logical reasoning and precise calculation. It is unrealistic to treat AI as a one-size-fits-all solution; we need to understand its capability boundaries and establish usage norms. Before relying on AI for critical decisions, developers and users must fully verify and test, and not blindly trust the output.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Building an AWS Generative AI Application from Scratch: EC2 + Bedrock Hands-On Tutorial

A complete cloud-native AI application development guide for beginners, building a simple generative AI chatbot using Amazon EC2, Apache, Python CGI, and Amazon Bedrock, covering architecture design, IAM permission configuration, security best practices, and cost optimization suggestions.

Recent activity 2026-06-02 19:49