name: reinforcement-learning description: This skill should be used when the user asks about "reinforcement learning", "RL", "reward function", "policy gradient", "PPO", "SAC", "DQN", "Q-learning", "actor-critic", "MDP", "Markov decision process", "environment design", "reward shaping", "exploration strategy", "experience replay", "multi-agent RL", "RLHF", "reward hacking", or when training an agent to interact with an environment to maximize cumulative reward. version: 1.0.0
Reinforcement Learning — Full-Stack RL Engine
Provides the complete framework for designing, training, and evaluating reinforcement learning agents, from MDP formalization through policy optimization to deployment in production control systems.
MDP Framework
Every RL problem is formalized as a Markov Decision Process (S, A, P, R, γ):
| Component | Description | Design Considerations |
|---|---|---|
| S (State Space) | Observation the agent receives | Ensure Markov property; normalize to [0,1] or z-score |
| A (Action Space) | Controls available to the agent | Discrete → DQN/PPO; Continuous → SAC/PPO |
| P (Transition) | Dynamics of the environment | Usually unknown, learned through interaction |
| R (Reward) | Signal encoding the objective | Most critical design decision; see Reward Shaping below |
| γ (Discount) | Future reward discount factor | 0.99 for long-horizon tasks; 0.95 for shorter episodes |
Algorithm Selection Guide
| Scenario | Algorithm | Key Advantage |
|---|---|---|
| Discrete actions, off-policy, sample efficient | DQN (+ PER + Dueling) | Replay buffer enables high sample efficiency |
| Continuous control, off-policy | SAC (Soft Actor-Critic) | Maximum entropy RL; stable training, automatic α tuning |
| Discrete or continuous, on-policy, stable | PPO | Clipped objective prevents destructive policy updates |
| Partial observability, memory required | PPO + LSTM backbone | Recurrent policy handles non-Markovian observations |
| Human preference alignment | RLHF (PPO + reward model) | Reward from human comparisons, not engineering |
Reward Shaping
The reward function is the most consequential RL design decision:
Potential-based shaping (safe — guarantees policy invariance):
- F(s, a, s') = γ·Φ(s') − Φ(s) where Φ is a potential function
- Does not change the optimal policy of the original MDP (Ng et al., 1999)
Common shaping components:
- Progress reward: proportional to movement toward the goal state
- Constraint penalty: large negative reward for safety violations
- Smoothness reward: penalize jerky or oscillatory actions (robotics)
- Curiosity bonus: intrinsic motivation for exploration in sparse-reward tasks
Red flags (reward hacking indicators):
- Reward that can be maximized by a degenerate policy (e.g., agent finds a bug in the simulator)
- Reward achievable without learning the intended behavior
See references/policy-gradient.md for PPO/SAC mathematical derivations and implementation details.
See references/value-based-methods.md for DQN, Double DQN, Dueling DQN, and Rainbow implementations.
See references/mdp-framework.md for environment design patterns and Gym interface specifications.
Training Stability Guidelines
Apply all of the following to any RL training run:
- Gradient clipping (max_grad_norm = 0.5): prevents catastrophic policy updates
- Observation normalization: running mean/variance normalization across episodes
- Reward normalization: divide rewards by running standard deviation (not mean)
- GAE (Generalized Advantage Estimation): λ = 0.95 for PPO — balances bias-variance in advantage estimates
- Entropy regularization: encourages exploration; adjust coefficient during training
- Multiple seeds: always report results across 5+ random seeds with mean ± std error
Evaluation Protocol
- Run 100 deterministic evaluation episodes (ε = 0 for DQN, deterministic policy for SAC/PPO).
- Report: mean reward, std, min/max, success rate (if applicable), episode length.
- Compare against baselines: random policy, hand-coded heuristic, previous best agent.
- Learning curve: episode reward vs. total environment steps (not wall-clock time).