Section 01
【Introduction】Nine Mainstream Large Models Show Striking Disparities in Working Hours Calculation Results; Severe Defects in Logical Consistency
A benchmark test on nine mainstream large models including GPT, Claude, Gemini, and Qwen shows that the same simple working hours calculation problem yielded completely opposite conclusions—ranging from "the employee owes the company 48 hours" to "the company owes the employee 160 hours"—exposing the severe inconsistency of current large models in logical reasoning and arithmetic calculation. This test was based on a real work scenario, and its results serve as an important warning for enterprises and individuals relying on AI for critical decision-making.