1. GSM8K Score Drop is an Evaluation Artifact
SFT training changed the model's output format—now the model generates <think> reasoning chains before giving the answer, while the GSM8K parser in lm-evaluation-harness is calibrated for the original Instruct model's direct answer style. This does not mean a regression in reasoning ability.
2. MATH Benchmark +3.6% is a Real Ability Improvement
The model was never trained on MATH problems (training data only includes GSM8K and NuminaMath), but the increase from 20.60% to 24.20% indicates that SFT successfully installed a generalizable reasoning format rather than simple pattern matching.
3. Reason for Limited GRPO Improvement: Reward Saturation
The project authors discovered an important technical phenomenon: since the SFT cold start was very successful (most GSM8K rollouts were correct), the 4 rollouts in a group often received the same reward, leading to an advantage signal close to zero.
Measurement data shows: frac_reward_zero_std averages 0.63, meaning 63% of batches produced near-zero gradient signals. This is the problem that curriculum filtering mentioned in the DeepSeek-R1 paper aims to solve—we should select medium-difficulty problems where only 1-2 rollouts are correct for the model, rather than simple problems where 80% are correct.