reinforcement-learning

star 2

This skill should be used when the user asks about "reinforcement learning", "RL", "reward function", "policy gradient", "PPO", "SAC", "DQN", "Q-learning", "actor-critic", "MDP", "Markov decision process", "environment design", "reward shaping", "exploration strategy", "experience replay", "multi-agent RL", "RLHF", "reward hacking", or when training an agent to interact with an environment to maximize cumulative reward.

harsh040506 By harsh040506 schedule Updated 3/7/2026

name: reinforcement-learning description: This skill should be used when the user asks about "reinforcement learning", "RL", "reward function", "policy gradient", "PPO", "SAC", "DQN", "Q-learning", "actor-critic", "MDP", "Markov decision process", "environment design", "reward shaping", "exploration strategy", "experience replay", "multi-agent RL", "RLHF", "reward hacking", or when training an agent to interact with an environment to maximize cumulative reward. version: 1.0.0

Reinforcement Learning — Full-Stack RL Engine

Provides the complete framework for designing, training, and evaluating reinforcement learning agents, from MDP formalization through policy optimization to deployment in production control systems.

MDP Framework

Every RL problem is formalized as a Markov Decision Process (S, A, P, R, γ):

Component Description Design Considerations
S (State Space) Observation the agent receives Ensure Markov property; normalize to [0,1] or z-score
A (Action Space) Controls available to the agent Discrete → DQN/PPO; Continuous → SAC/PPO
P (Transition) Dynamics of the environment Usually unknown, learned through interaction
R (Reward) Signal encoding the objective Most critical design decision; see Reward Shaping below
γ (Discount) Future reward discount factor 0.99 for long-horizon tasks; 0.95 for shorter episodes

Algorithm Selection Guide

Scenario Algorithm Key Advantage
Discrete actions, off-policy, sample efficient DQN (+ PER + Dueling) Replay buffer enables high sample efficiency
Continuous control, off-policy SAC (Soft Actor-Critic) Maximum entropy RL; stable training, automatic α tuning
Discrete or continuous, on-policy, stable PPO Clipped objective prevents destructive policy updates
Partial observability, memory required PPO + LSTM backbone Recurrent policy handles non-Markovian observations
Human preference alignment RLHF (PPO + reward model) Reward from human comparisons, not engineering

Reward Shaping

The reward function is the most consequential RL design decision:

Potential-based shaping (safe — guarantees policy invariance):

  • F(s, a, s') = γ·Φ(s') − Φ(s) where Φ is a potential function
  • Does not change the optimal policy of the original MDP (Ng et al., 1999)

Common shaping components:

  • Progress reward: proportional to movement toward the goal state
  • Constraint penalty: large negative reward for safety violations
  • Smoothness reward: penalize jerky or oscillatory actions (robotics)
  • Curiosity bonus: intrinsic motivation for exploration in sparse-reward tasks

Red flags (reward hacking indicators):

  • Reward that can be maximized by a degenerate policy (e.g., agent finds a bug in the simulator)
  • Reward achievable without learning the intended behavior

See references/policy-gradient.md for PPO/SAC mathematical derivations and implementation details. See references/value-based-methods.md for DQN, Double DQN, Dueling DQN, and Rainbow implementations. See references/mdp-framework.md for environment design patterns and Gym interface specifications.

Training Stability Guidelines

Apply all of the following to any RL training run:

  1. Gradient clipping (max_grad_norm = 0.5): prevents catastrophic policy updates
  2. Observation normalization: running mean/variance normalization across episodes
  3. Reward normalization: divide rewards by running standard deviation (not mean)
  4. GAE (Generalized Advantage Estimation): λ = 0.95 for PPO — balances bias-variance in advantage estimates
  5. Entropy regularization: encourages exploration; adjust coefficient during training
  6. Multiple seeds: always report results across 5+ random seeds with mean ± std error

Evaluation Protocol

  • Run 100 deterministic evaluation episodes (ε = 0 for DQN, deterministic policy for SAC/PPO).
  • Report: mean reward, std, min/max, success rate (if applicable), episode length.
  • Compare against baselines: random policy, hand-coded heuristic, previous best agent.
  • Learning curve: episode reward vs. total environment steps (not wall-clock time).
Install via CLI
npx skills add https://github.com/harsh040506/Claude-Code-Unified-Skill-Plugin-Library --skill reinforcement-learning
Repository Details
star Stars 2
call_split Forks 1
navigation Branch main
article Path SKILL.md
More from Creator