Interactive demonstration of PPO learning to balance a pole.
Click "Run Episode" to collect data, then "PPO Update" to train.
Episode:0
Avg Reward:0
KL:0.000
Clip Frac:0%
How it works: The agent uses a simple linear policy to decide
whether to push the cart left or right. "Run Episode" collects a trajectory
using the current policy. "PPO Update" performs K=4 epochs of minibatch gradient
ascent on the clipped surrogate objective (epsilon=0.2). Watch the policy improve,
the KL divergence stay bounded, and the clip fraction indicate how much clipping occurs.