Section 01
[Introduction] Latent State RL: VAE-based Implicit World Model for Optimizing Post-Inference Training
The core innovations of this project are: using a Variational Autoencoder (VAE) to learn a compact implicit Markov state representation from inference trajectories, replacing traditional token history; and introducing an uncertainty-driven exploration mechanism, providing new state modeling ideas for post-training reinforcement learning methods like GRPO, which is expected to solve problems such as high computational overhead and key information being buried in long inference chain training.