Key Findings: Asymmetric Evidence Use and Adaptive Rigidity
Asymmetry of Win-Stay and Lose-Shift
The experimental results show a striking pattern: among all tested models, the "win-stay" behavior (continuing to choose the same option after receiving a reward) is close to the ceiling level, while the "lose-shift" behavior (switching to another option after not receiving a reward) is significantly weakened.
This asymmetry reveals that LLMs have a systematic bias in using positive and negative evidence. The models can make good use of successful experiences, but their response to failure experiences is relatively slow. This contrasts with human behavior—humans are usually more sensitive to losses, and this loss aversion has adaptive significance in evolution.
Inter-Model Differences: From Extreme Stubbornness to Relative Flexibility
Among the three models, DeepSeek-V3.2 showed the most extreme behavioral pattern: it exhibited severe perseveration after a reversal occurred, i.e., continuing to choose the previously rewarded option, while its overall learning acquisition ability was also weak. In contrast, Gemini-3 and GPT-5.2 adapted faster, although their sensitivity to losses was still lower than that of humans.
This finding suggests that different architectures and training methods may lead to essential differences in the behavioral characteristics of models in dynamic environments.
Coexistence of High Returns and Rigid Adaptation
An interesting finding is that random transitions increased the stubborn behavior of LLMs after reversals, but did not consistently reduce the total number of wins. This indicates that high aggregate returns and rigid adaptation can coexist—the models may maintain overall performance through other strategies (such as exploiting short-term fluctuations) rather than truly learning to flexibly adapt to environmental changes.