PPO CartPole Demo

Interactive demonstration of PPO learning to balance a pole. Click "Run Episode" to collect data, then "PPO Update" to train.

Episode:0
Avg Reward:0
KL:0.000
Clip Frac:0%
How it works: The agent uses a simple linear policy to decide whether to push the cart left or right. "Run Episode" collects a trajectory using the current policy. "PPO Update" performs K=4 epochs of minibatch gradient ascent on the clipped surrogate objective (epsilon=0.2). Watch the policy improve, the KL divergence stay bounded, and the clip fraction indicate how much clipping occurs.