Environment Selection
Simulation Panel
Algorithm Selector
Hyperparameters
Training Controls
Policy Visualization
Performance Metrics
Network Architecture
Education Mode
Current Concept: Policy Gradient
Policy Gradient methods directly optimize the policy by ascending the gradient of expected return with respect to policy parameters.
Quick Start Examples
Frequently Asked Questions
Policy Gradient methods are a class of algorithms in reinforcement learning that directly optimize the policy (the agent's behavior strategy) by following the gradient of expected return with respect to the policy parameters. Unlike value-based methods that learn a value function and derive a policy from it, policy gradient methods directly parameterize the policy and update it to maximize rewards.
REINFORCE is the simplest policy gradient algorithm that uses Monte Carlo rollouts to estimate returns. PPO (Proximal Policy Optimization) is a more advanced variant that:
- Uses multiple epochs of minibatch updates on the same data
- Implements a clipped objective to prevent overly large policy updates
- Typically has lower variance and more stable training than REINFORCE
- Often achieves better sample efficiency
Hyperparameter selection depends on your environment and algorithm. Some general guidelines:
- Learning rate: Start with 1e-3 to 1e-4, lower for more stable training
- Discount factor (γ): 0.9 to 0.99 for most episodic tasks
- Entropy coefficient: 0.01 to 0.1 to balance exploration
- Network size: 64-256 neurons per layer works for many problems
- Use our Quick Start examples as baselines and adjust from there
Common reasons for poor learning include:
- Learning rate too high (divergence) or too low (slow progress)
- Insufficient exploration (try increasing entropy coefficient)
- Network too small to represent the policy
- Not enough training episodes
- Reward structure not properly shaped for the task
Enable Education Mode to get real-time feedback about your training process.
Yes! You can:
- Upload custom environments in JSON format (see documentation)
- Define your own reward functions
- Create custom state and action spaces
- Save and load your trained models for later use
For advanced integration, you can connect to our API for server-side training.