# Nine Mainstream Large Models Give Nine Different Answers to the Same Working Hours Calculation Problem: AI Logical Consistency Benchmark Test Reveals Striking Disparities

> A benchmark test on nine mainstream large models including GPT, Claude, Gemini, and Qwen shows that the same simple working hours calculation problem yielded completely opposite conclusions—ranging from "the company owes the employee 160 hours" to "the employee owes the company 48 hours"—exposing the severe inconsistency of current large models in logical reasoning and arithmetic calculation.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-23T15:13:09.000Z
- 最近活动: 2026-04-23T15:55:19.891Z
- 热度: 163.3
- 关键词: 大模型, 逻辑推理, 基准测试, AI一致性, GPT, Claude, Gemini, Qwen, DeepSeek, 算术计算
- 页面链接: https://www.zingnex.cn/en/forum/thread/llm-ai-4692eb09
- Canonical: https://www.zingnex.cn/forum/thread/llm-ai-4692eb09
- Markdown 来源: floors_fallback

---

## 【Introduction】Nine Mainstream Large Models Show Striking Disparities in Working Hours Calculation Results; Severe Defects in Logical Consistency

A benchmark test on nine mainstream large models including GPT, Claude, Gemini, and Qwen shows that the same simple working hours calculation problem yielded completely opposite conclusions—ranging from "the employee owes the company 48 hours" to "the company owes the employee 160 hours"—exposing the severe inconsistency of current large models in logical reasoning and arithmetic calculation. This test was based on a real work scenario, and its results serve as an important warning for enterprises and individuals relying on AI for critical decision-making.

## Background: AI Applications in Critical Scenarios Are Increasing, but Simple Calculations Expose Consistency Issues

Large language models are increasingly used in scenarios such as calculation, legal reasoning, and human resources consulting. However, the latest benchmark test reveals that when faced with a working hours calculation problem involving basic arithmetic and clear rules, nine mainstream large models gave nine different answers—even the direction of the result (who owes whom) was inconsistent. The test question was not a trick question but based on a real work scenario, and the results indicate that large models have significant defects in logical consistency.

## Test Design and Methodology: Covering Nine Mainstream Models, Based on Real Working Hours Scenarios

The test used a standard working hours calculation problem involving parameters such as annual standard working days, monthly salary benchmark (21.75 days/month), actual working days, and unused paid leave, requiring calculation of the working hours settlement result when an employee resigns. The nine models covered include:
- OpenAI: GPT 5.4 (Deep Thinking)
- Anthropic: Claude Opus4.7, Claude Sonnet4.6
- Google: Gemini3.1 Pro
- Alibaba: Qwen3 Max(Thinking), Qianwen3.5
- ByteDance: Doubao(Super Mode/Regular Mode)
- DeepSeek: DeepSeek(Expert Mode)
All tests were conducted in April 2026 to ensure the timeliness of model versions.

## Test Results: Nine Answers with Huge Disparities, Obvious Directional Divisions

The nine models gave nine different answers, with values ranging from -48 hours (employee owes company) to +160 hours (company owes employee), a span of 208 hours. Comparison of conclusions from each model:
| Model | Conclusion | Calculation Result |
|---|---|---|
| GPT5.4(Deep Thinking) | Employee owes company | 8 hours |
| Claude Opus4.7 | Company owes employee |32 hours |
| Claude Sonnet4.6 | Employee owes company |8 hours |
| Gemini3.1 Pro | Company owes employee |80 hours |
| Qwen3 Max(Thinking) | Company owes employee |48 hours |
| Qianwen3.5 | Company owes employee |160 hours* |
| Doubao(Super Mode) | Company owes employee |40 hours |
| Doubao(Regular Mode) | Employee owes company |48 hours |
| DeepSeek(Expert Mode) | Company owes employee |40 hours |
*Note: Qianwen3.5 initially gave 160 hours and self-corrected to 96 hours in the same response.

## Core Problem Analysis: Directional Divisions and Differences in Basic Assumptions Are the Main Causes

1. **Directional Divisions**: 6 models concluded the company owes the employee, while 3 concluded the employee owes the company—indicating that models have serious problems understanding the basic logical relationships of the problem.
2. **Differences in Basic Assumptions**: Different models used different annual standard working days (248/250/261 days), calculation benchmarks (21.75 days/month or actual working days), and handling methods for unused paid leave, directly leading to result discrepancies.
3. **Self-Contradiction Phenomenon**: Qianwen3.5 first gave 160 hours and then corrected it to 96 hours in the same response, exposing the instability of the model's reasoning process.

## Underlying Causes: Jointly Caused by Training Data, Model Architecture, and Prompt Sensitivity

1. **Limitations of Training Data**: Training data comes from the internet and contains a large amount of unstructured and contradictory information, which is affected by noise when dealing with precise calculations and logical reasoning.
2. **Differences in Reasoning Ability**: Different model architectures and training methods have different focuses—some emphasize pattern matching, others have strong symbolic reasoning capabilities—and these differences are amplified in complex reasoning scenarios.
3. **Prompt Engineering Sensitivity**: Minor differences in wording may lead models to take different reasoning paths, affecting the reliability of practical applications.

## Industry Insights and Recommendations: Establish Manual Review, Choose Models Rationally

1. **Enterprise Risk Warning**: Large models are not yet sufficient to handle precise logical reasoning tasks independently; strict manual review is required in critical business scenarios.
2. **Model Selection Reference**: In this test, Claude Opus4.7, DeepSeek Expert Mode, and Doubao Super Mode had relatively better result consistency, but verification based on specific scenarios is needed.
3. **Future Improvement Direction**: The test project is open-source (MIT license), and the community is welcome to contribute test cases to promote model improvement.

## Conclusion: Treat AI Boundaries Rationally, Critical Decisions Require Verification

Large models have made significant progress in natural language understanding and generation, but they still have shortcomings in logical reasoning and precise calculation. It is unrealistic to treat AI as a one-size-fits-all solution; we need to understand its capability boundaries and establish usage norms. Before relying on AI for critical decisions, developers and users must fully verify and test, and not blindly trust the output.
