Section 01
[Introduction] WISE: Enabling Multimodal Models to 'Learn Thick First, Then Thin'—Achieves SOTA Even With 5x Reasoning Compression
WISE uses a training structure of 'Concise Reason → Answer → Detailed Explanation' and a self-distillation objective to guide models to compress detailed reasoning into a compact form. On the ReasonSeg benchmark, WISE-S achieves a SOTA result of 58.3 cIoU, while reducing the number of reasoning tokens from 112 to 23 (a compression ratio of nearly 5x), achieving a win-win of quality and efficiency.