name: rl-methodology description: > Comprehensive RL methodology — mathematical analysis, convergence proofs, algorithm design patterns, and implementation templates grounded in Zhao's "Mathematical Foundations of Reinforcement Learning." layer: L0 domain: [general-rl] source-project: Book-Mathematical-Foundation-of-RL depends-on: [] tags: [theory, convergence, algorithm-design, implementation, stochastic-approximation]
RL Methodology
Purpose & When to Use
Invoke this skill when you need to:
- Mathematically analyze an RL algorithm's update rule, classify it, or identify failure modes
- Prove or disprove convergence using contraction mapping, Dvoretzky, or Robbins-Monro theorems
- Design a new RL algorithm by composing known design patterns
- Implement an RL algorithm from pseudocode with correct mathematical grounding
- Debug an RL implementation by checking mathematical correctness
- Look up theoretical foundations (Bellman equations, SA conditions, policy gradient theorem)
Knowledge Base Navigation
The knowledge-base/zhao-mathematical-foundations/ directory contains the full mathematical reference. Use this lookup table to find the right file:
| Topic | KB File | Key Content |
|---|---|---|
| MDPs, states, actions, rewards | 01-basic-concepts.md |
Definitions, notation |
| Bellman equation, v_pi, q_pi | 02-bellman-equation.md |
Matrix-vector form, closed-form solution |
| Bellman optimality equation, v* | 03-bellman-optimality-equation.md |
BOE, contraction mapping |
| Value iteration, policy iteration | 04-value-iteration-policy-iteration.md |
VI, PI, truncated PI, GPI |
| Monte Carlo methods | 05-monte-carlo-methods.md |
MC prediction, MC control, exploring starts |
| Stochastic approximation theory | 06-stochastic-approximation.md |
Robbins-Monro, Dvoretzky theorems |
| TD(0), Sarsa, Q-learning | 07-temporal-difference-methods.md |
TD methods, convergence proofs |
| Function approximation, DQN | 08-value-function-methods.md |
Linear FA, semi-gradient TD, deadly triad |
| Policy gradient, REINFORCE | 09-policy-gradient-methods.md |
PG theorem, baseline subtraction |
| Actor-critic (A2C, DDPG, SAC) | 10-actor-critic-methods.md |
QAC, A2C, off-policy AC, DPG |
| Math prerequisites | 11-appendix.md |
Norms, probability, linear algebra |
| Algorithm comparison & index | 12-cross-reference-index.md |
Convergence table, theorem index |
Analysis Procedure
Step 1: Identify the Update Rule
Extract the core update equation. Express it in the standard SA form:
w_{k+1} = w_k - alpha_k * [w_k - target_k]
or equivalently:
w_{k+1} = (1 - alpha_k) * w_k + alpha_k * target_k
Step 2: Identify the Fixed Point Equation
Determine what equation the algorithm solves at convergence:
- Bellman equation (BE):
v_pi = r_pi + gamma * P_pi * v_pi— evaluates a given policy - Bellman optimality equation (BOE):
v* = max_pi (r_pi + gamma * P_pi * v*)— finds optimal policy - Projected Bellman equation (PBE):
Phi * w = Pi_D * T_pi(Phi * w)— FA setting
Step 3: Classify the Algorithm
Model-free vs Model-based:
- Model-free: Does NOT require p(s'|s,a) or p(r|s,a) — uses samples instead
- Model-based: Requires the transition/reward model
On-policy vs Off-policy:
- On-policy: Behavior policy = target policy (Sarsa, REINFORCE)
- Off-policy: Behavior policy ≠ target policy (Q-learning, DPG)
- Key test: Does the update use actions from the behavior policy that differ from what the target policy would choose?
Value-based vs Policy-based vs Actor-Critic:
- Value-based: Learns v(s) or q(s,a), derives policy via argmax (Q-learning, DQN)
- Policy-based: Directly parameterizes and optimizes pi(a|s,theta) (REINFORCE)
- Actor-Critic: Maintains both policy (actor) and value (critic) networks (A2C, DDPG)
Step 4: Formulate as Stochastic Approximation
Rewrite the update as a Robbins-Monro or Dvoretzky problem:
Robbins-Monro form: w_{k+1} = w_k - alpha_k * g_tilde(w_k, eta_k)
where g_tilde(w, eta) = g(w) + eta, E[eta|history] = 0
Dvoretzky form: Delta_{k+1} = (1 - alpha_k) * Delta_k + beta_k * eta_k
where Delta_k = w_k - w* is the error
For Q-learning specifically, use the Extended Dvoretzky (Theorem 6.3):
- The expectation condition allows bias:
||E[eta_k | H_k]||_inf <= gamma * ||Delta_k||_inf - This is critical because Q-learning's target contains the max operator
Step 5: Check for Failure Modes
The Deadly Triad (divergence risk when ALL THREE present):
- Function approximation (not tabular)
- Bootstrapping (using current estimates in targets)
- Off-policy learning (behavior ≠ target policy)
Common failure modes:
- Constant learning rate with stochastic updates → oscillation, no convergence
- Learning rate too large → divergence
- No exploration → stuck at suboptimal policy
- Linear FA + off-policy + bootstrapping → potential divergence (Baird's counterexample)
- Nonlinear FA → no convergence guarantees in general
Step 6: Analyze Convergence Properties
Reference the appropriate convergence theorem:
- Tabular value-based: Contraction mapping theorem (gamma < 1 guarantees convergence)
- Tabular TD/Q-learning: Dvoretzky's theorem or Extended Dvoretzky
- Linear FA + on-policy: TD converges to w* = A^{-1}b where A is positive definite
- Policy gradient: Converges to local optimum under standard conditions
- Nonlinear FA + off-policy: No general guarantees
Output format — when analyzing an algorithm, produce:
- Algorithm Classification: (model-free/based, on/off-policy, value/policy/AC)
- Update Rule: (in standard form)
- Fixed Point: (what equation it solves)
- SA Formulation: (Robbins-Monro or Dvoretzky form)
- Convergence: (theorem applicable, conditions required)
- Failure Modes: (potential issues and mitigations)
Convergence Proof Procedure
Step 1: Identify the Algorithm Type
| Type | Primary Tool |
|---|---|
| Value iteration (exact) | Contraction mapping theorem |
| Policy iteration (exact) | Monotonic improvement + bounded values |
| TD methods (tabular) | Dvoretzky's theorem |
| Q-learning (tabular) | Extended Dvoretzky's theorem |
| TD with linear FA | Matrix A positive definiteness |
| Policy gradient | Standard gradient ascent convergence |
| SGD-based | Robbins-Monro theorem |
Step 2: Formulate as SA Problem
Rewrite the update rule in one of these forms:
- Error form: Delta_{k+1} = (1-alpha_k)Delta_k + beta_keta_k
- RM form: w_{k+1} = w_k - alpha_k*(g(w_k) + noise_k)
- Operator form: v_{k+1} = T(v_k) [for contraction mapping]
Step 3: Verify Conditions
Check each condition of the applicable theorem:
Learning rate conditions (almost always needed):
- sum(alpha_k) = infinity (ensures we can reach any point)
- sum(alpha_k^2) < infinity (ensures noise averages out)
- Common valid schedules: alpha_k = 1/k, alpha_k = c/(c+k)
Noise conditions:
- E[eta_k | H_k] = 0 (for Dvoretzky) or ||E[eta_k | H_k]||_inf <= gamma*||Delta||_inf (for Extended Dvoretzky)
- Second moment bounded: E[eta_k^2 | H_k] <= C
Contraction conditions:
- gamma < 1 (discount factor)
- Operator satisfies ||T(v1) - T(v2)|| <= gamma * ||v1 - v2||
Step 4: Handle Common Difficulties
Problem: Biased noise (e.g., Q-learning)
- Solution: Use Extended Dvoretzky instead of standard Dvoretzky
- Show bias is bounded by gamma * ||Delta_k||_inf
Problem: Function approximation
- Linear FA + on-policy: Show matrix A is positive definite
- Linear FA + off-policy: Check for potential divergence (deadly triad)
- Nonlinear FA: Generally no convergence guarantees; use empirical validation
Step 5: State the Result
Clearly state:
- The fixed point the algorithm converges to
- The conditions required for convergence
- The convergence rate (if available)
- Any limitations or caveats
Output format — when proving convergence, produce:
- Algorithm: (name and update rule)
- Theorem Applied: (which convergence theorem)
- SA Formulation: (error form or RM form)
- Condition Verification: (check each condition with justification)
- Conclusion: (converges to what, under what conditions, at what rate)
- Caveats: (limitations, failure modes)
Quick Reference: Convergence Results
| Algorithm | Converges? | Conditions | Fixed Point |
|---|---|---|---|
| Value Iteration | Yes | gamma < 1 | v* (optimal value) |
| Policy Iteration | Yes | gamma < 1 | v* (optimal value) |
| TD(0) tabular | Yes | SA learning rates | v_pi |
| Sarsa tabular | Yes | SA learning rates + GLIE | q* |
| Q-learning tabular | Yes | SA learning rates + exploration | q* |
| TD linear FA, on-policy | Yes | SA learning rates | w* = A^{-1}b |
| Q-learning linear FA | May diverge | Deadly triad risk | — |
| Q-learning nonlinear FA | No guarantee | Empirical only | — |
| REINFORCE | Yes (local) | SA learning rates | Local optimum of J |
| A2C | Yes (local) | SA learning rates, two timescales | Local optimum |
Core Convergence Theorems
Theorem 1: Contraction Mapping — Let T: X → X be a contraction with modulus gamma in [0,1). Then T has a unique fixed point x*, and x_{k+1} = T(x_k) converges with rate gamma^k.
Theorem 2: Dvoretzky's (Thm 6.2) — For Delta_{k+1} = (1-alpha_k)Delta_k + beta_keta_k, if SA learning rates hold, sum(beta_k^2) < inf, E[eta_k|H_k] = 0, and E[eta_k^2|H_k] <= C, then Delta_k → 0 a.s.
Theorem 3: Extended Dvoretzky (Thm 6.3) — Same as above but allows biased noise: ||E[eta_k|H_k]||_inf <= gamma*||Delta_k||_inf. Critical for Q-learning convergence.
Theorem 4: Robbins-Monro — For w_{k+1} = w_k - alpha_kg_tilde(w_k, eta_k), if g is monotone, SA learning rates hold, and noise is zero-mean, then w_k → w a.s.
Theorem 5: Linear TD Convergence — For TD(0) with linear FA v_hat(s,w) = phi(s)^Tw: A = Phi^TD*(I-gammaP_pi)Phi is positive definite, so w = A^{-1}b exists and is unique. Error bound: ||Phiw - v_pi||_D <= (1/sqrt(1-gamma^2))min_w||Phiw - v_pi||_D.
Algorithm Design Procedure
Step 1: Define the Problem Setting
- State space: discrete or continuous?
- Action space: discrete or continuous?
- Model availability: known or unknown?
- Episode structure: episodic or continuing?
- Data regime: online or batch?
Step 2: Choose the Core Approach
| Setting | Recommended |
|---|---|
| Discrete S, A, known model | Dynamic Programming (VI/PI) |
| Discrete S, A, unknown model, episodic | MC or TD methods |
| Large/continuous S, discrete A | DQN-style (value + FA) |
| Continuous S and A | Actor-Critic (A2C, DDPG, SAC) |
| Need sample efficiency | Off-policy + replay buffer |
Step 3: Select Design Patterns
Compose from the 10 core patterns:
Pattern 1: Generalized Policy Iteration (GPI) — Alternating evaluation and improvement. VI (j=1), PI (j=∞), truncated PI (j steps).
Pattern 2: Bootstrapping — Using current estimates in targets: r + gamma*v(s'). Trades bias for variance. Risk: part of deadly triad.
Pattern 2.5: Monte Carlo Estimation — Complete episode returns as unbiased estimates. High variance but no bootstrapping bias.
Pattern 3: Stochastic Approximation — Replace expectations with samples. Use diminishing step sizes: sum(alpha_k)=inf, sum(alpha_k^2)<inf.
Pattern 4: Epsilon-Greedy Exploration — Greedy with prob 1-eps, random with prob eps. Variants: decaying epsilon, Boltzmann, UCB.
Pattern 5: Experience Replay — Store transitions in buffer, sample for training. Breaks correlation, reuses data. Used in DQN, DDPG, SAC.
Pattern 6: Target Networks — Separate slowly-updated network for targets. Prevents moving target problem. Used in DQN, DDPG, TD3.
Pattern 7: Importance Sampling — Correct for distribution mismatch in off-policy learning. Weight by rho = pi(a|s)/beta(a|s).
Pattern 8: Baseline Subtraction — Subtract b(s) from returns in PG. Key: E[grad ln pi * b(S)] = 0. Best baseline: b(s) = v_pi(s).
Pattern 9: Function Approximation — Parameterize value/policy: v_hat(s,w) = phi(s)^T*w or neural net. Enables generalization.
Pattern 10: Actor-Critic Decomposition — Separate actor (policy) and critic (value). Lower variance than pure PG, handles continuous actions.
Step 4: Derive the Update Rule
Value-based: w_{t+1} = w_t + alpha * [target - v_hat(s_t, w_t)] * grad_w v_hat(s_t, w_t)
Policy gradient: theta_{t+1} = theta_t + alpha * grad_theta ln pi(a_t|s_t, theta) * [G_t - b(s_t)]
Actor-critic:
Critic: w_{t+1} = w_t + alpha_w * delta_t * grad_w v_hat(s_t, w_t)
Actor: theta_{t+1} = theta_t + alpha_theta * delta_t * grad_theta ln pi(a_t|s_t, theta)
where delta_t = r_{t+1} + gamma * v_hat(s_{t+1}, w) - v_hat(s_t, w)
Step 5: Verify Mathematical Soundness
- Check convergence conditions (reference
06-stochastic-approximation.md) - Verify no deadly triad combination
- Ensure learning rate schedule satisfies SA conditions
- Check that the update targets the correct fixed point
Step 6: Specify Hyperparameters
- Learning rate schedule: alpha_k = 1/k or alpha_k = alpha_0 / (1 + k/tau)
- Discount factor: gamma in [0.9, 0.999] typical
- Exploration: epsilon schedule, entropy coefficient
- Target network: update frequency C or soft update tau
- Replay buffer: size, mini-batch size, prioritization
Output format — when designing an algorithm, produce:
- Problem Setting: (state/action spaces, model availability, etc.)
- Algorithm Name: (descriptive name)
- Design Patterns Used: (list of patterns composed)
- Update Rules: (complete mathematical specification)
- Pseudocode: (step-by-step algorithm)
- Hyperparameters: (with recommended defaults)
- Convergence Notes: (conditions and guarantees)
Implementation Templates
Environment Interfaces
Templates below are written against three interfaces, from most to least specific:
Tabular MDP interface (VI, PI, MC, Sarsa, Q-learning, REINFORCE, A2C):
n_states: int # |S|, states are integers 0..n_states-1
n_actions: int # |A|, actions are integers 0..n_actions-1
step(state, action) -> (next_state, reward) # transition (deterministic or stochastic)
is_terminal(state) -> bool # episode termination check
sample_start() -> state # sample initial state
Known-model extension (VI, PI only — adds the transition/reward model):
P: ndarray[n_states, n_actions, n_states] # p(s'|s,a) transition probabilities
R: ndarray[n_states, n_actions] # E[r|s,a] expected rewards
Gymnasium interface (DQN — stateful, continuous observations):
reset() -> (observation, info) # start episode
step(action) -> (observation, reward, terminated, truncated, info)
Any environment that implements the appropriate interface works with the corresponding templates.
Model-Based: Value Iteration (Algorithm 4.1)
import numpy as np
def value_iteration(P, R, n_states, n_actions, gamma=0.9, theta=1e-6,
max_iter=1000):
"""v_{k+1}(s) = max_a [R(s,a) + gamma * sum_{s'} P(s,a,s') * v_k(s')]
Args:
P: Transition probabilities, shape (n_states, n_actions, n_states).
R: Expected rewards, shape (n_states, n_actions).
Returns:
V: Optimal value function, shape (n_states,).
policy: Greedy policy, shape (n_states,).
"""
V = np.zeros(n_states)
for _ in range(max_iter):
# Q(s,a) = R(s,a) + gamma * P(s,a,:) @ V
Q = R + gamma * np.einsum('ijk,k->ij', P, V)
V_new = np.max(Q, axis=1)
if np.max(np.abs(V_new - V)) < theta:
V = V_new
break
V = V_new
Q = R + gamma * np.einsum('ijk,k->ij', P, V)
policy = np.argmax(Q, axis=1)
return V, policy
Model-Based: Policy Iteration (Algorithm 4.2)
def policy_iteration(P, R, n_states, n_actions, gamma=0.9, theta=1e-6,
max_iter=1000):
"""Evaluate pi exactly, then improve greedily.
Args:
P: Transition probabilities, shape (n_states, n_actions, n_states).
R: Expected rewards, shape (n_states, n_actions).
Returns:
V: Optimal value function, shape (n_states,).
policy: Optimal policy, shape (n_states,).
"""
policy = np.zeros(n_states, dtype=int)
V = np.zeros(n_states)
for _ in range(max_iter):
# Policy evaluation: solve v_pi = r_pi + gamma * P_pi @ v_pi
P_pi = P[np.arange(n_states), policy] # (n_states, n_states)
r_pi = R[np.arange(n_states), policy] # (n_states,)
V = np.linalg.solve(np.eye(n_states) - gamma * P_pi, r_pi)
# Policy improvement
Q = R + gamma * np.einsum('ijk,k->ij', P, V)
new_policy = np.argmax(Q, axis=1)
if np.array_equal(new_policy, policy):
break
policy = new_policy
return V, policy
Model-Free: MC Epsilon-Greedy (Algorithm 5.3)
def mc_epsilon_greedy(env, n_episodes=5000, gamma=0.9, epsilon=0.1,
max_steps=200):
"""First-visit MC with epsilon-greedy. Uses complete episode returns.
env must provide: n_states, n_actions, step(s, a), is_terminal(s),
sample_start().
"""
Q = np.zeros((env.n_states, env.n_actions))
returns_count = np.zeros((env.n_states, env.n_actions))
for _ in range(n_episodes):
trajectory = []
state = env.sample_start()
for _ in range(max_steps):
if np.random.random() < epsilon:
action = np.random.randint(env.n_actions)
else:
action = np.argmax(Q[state])
next_state, reward = env.step(state, action)
trajectory.append((state, action, reward))
state = next_state
if env.is_terminal(state):
break
# Backward return computation (first-visit)
G = 0.0
visited = set()
for t in reversed(range(len(trajectory))):
s_t, a_t, r_t = trajectory[t]
G = r_t + gamma * G
if (s_t, a_t) not in visited:
visited.add((s_t, a_t))
returns_count[s_t, a_t] += 1
Q[s_t, a_t] += (G - Q[s_t, a_t]) / returns_count[s_t, a_t]
return Q, np.argmax(Q, axis=1)
Model-Free: Sarsa (Algorithm 7.1)
def sarsa(env, n_episodes=5000, gamma=0.9, alpha_0=0.5, epsilon_0=0.1,
max_steps=200):
"""On-policy TD control: q(s,a) += alpha * [r + gamma*q(s',a') - q(s,a)]
env must provide: n_states, n_actions, step(s, a), is_terminal(s),
sample_start().
"""
Q = np.zeros((env.n_states, env.n_actions))
visit_count = np.zeros((env.n_states, env.n_actions))
def eps_greedy(state, epsilon):
if np.random.random() < epsilon:
return np.random.randint(env.n_actions)
return np.argmax(Q[state])
for episode in range(n_episodes):
state = env.sample_start()
epsilon = epsilon_0 / (1 + episode / 1000)
action = eps_greedy(state, epsilon)
for _ in range(max_steps):
next_state, reward = env.step(state, action)
visit_count[state, action] += 1
next_action = eps_greedy(next_state, epsilon)
alpha = alpha_0 / visit_count[state, action]
Q[state, action] += alpha * (
reward + gamma * Q[next_state, next_action] - Q[state, action])
state, action = next_state, next_action
if env.is_terminal(state):
break
return Q, np.argmax(Q, axis=1)
Model-Free: Q-Learning (Algorithm 7.x)
def q_learning(env, n_episodes=5000, gamma=0.9, alpha_0=0.5, epsilon_0=0.1,
max_steps=200):
"""Off-policy TD control: q(s,a) += alpha * [r + gamma*max q(s',.) - q(s,a)]
env must provide: n_states, n_actions, step(s, a), is_terminal(s),
sample_start().
"""
Q = np.zeros((env.n_states, env.n_actions))
visit_count = np.zeros((env.n_states, env.n_actions))
for episode in range(n_episodes):
state = env.sample_start()
for _ in range(max_steps):
epsilon = epsilon_0 / (1 + episode / 1000)
if np.random.random() < epsilon:
action = np.random.randint(env.n_actions)
else:
action = np.argmax(Q[state])
next_state, reward = env.step(state, action)
visit_count[state, action] += 1
alpha = alpha_0 / visit_count[state, action]
Q[state, action] += alpha * (
reward + gamma * np.max(Q[next_state]) - Q[state, action])
state = next_state
if env.is_terminal(state):
break
return Q, np.argmax(Q, axis=1)
Policy Gradient: REINFORCE (Algorithm 9.1)
def reinforce(env, n_episodes=5000, gamma=0.99, alpha=0.01, max_steps=200):
"""theta += alpha * gamma^t * G_t * grad ln pi(a|s,theta)
Tabular softmax policy. env must provide: n_states, n_actions,
step(s, a), is_terminal(s), sample_start().
"""
theta = np.zeros((env.n_states, env.n_actions))
def softmax_policy(state):
logits = theta[state] - np.max(theta[state])
exp_logits = np.exp(logits)
return exp_logits / np.sum(exp_logits)
for episode in range(n_episodes):
trajectory = []
state = env.sample_start()
for _ in range(max_steps):
probs = softmax_policy(state)
action = np.random.choice(env.n_actions, p=probs)
next_state, reward = env.step(state, action)
trajectory.append((state, action, reward))
state = next_state
if env.is_terminal(state):
break
G = 0.0
for t in reversed(range(len(trajectory))):
s_t, a_t, r_t = trajectory[t]
G = r_t + gamma * G
probs = softmax_policy(s_t)
grad_ln_pi = -probs.copy()
grad_ln_pi[a_t] += 1.0
theta[s_t] += alpha * (gamma ** t) * G * grad_ln_pi
return theta, softmax_policy
Actor-Critic: A2C (Algorithm 10.2)
def a2c(env, n_episodes=5000, gamma=0.99, alpha_theta=0.01, alpha_w=0.05,
max_steps=200):
"""Actor (theta) + critic (w) with TD error as advantage estimate.
Tabular softmax actor, tabular state-value critic. env must provide:
n_states, n_actions, step(s, a), is_terminal(s), sample_start().
"""
theta = np.zeros((env.n_states, env.n_actions))
w = np.zeros(env.n_states)
def softmax_policy(state):
logits = theta[state] - np.max(theta[state])
exp_logits = np.exp(logits)
return exp_logits / np.sum(exp_logits)
for episode in range(n_episodes):
state = env.sample_start()
for _ in range(max_steps):
probs = softmax_policy(state)
action = np.random.choice(env.n_actions, p=probs)
next_state, reward = env.step(state, action)
# TD error = advantage estimate: delta = r + gamma*V(s') - V(s)
delta = reward + gamma * w[next_state] - w[state]
w[state] += alpha_w * delta
grad_ln_pi = -probs.copy()
grad_ln_pi[action] += 1.0
theta[state] += alpha_theta * delta * grad_ln_pi
state = next_state
if env.is_terminal(state):
break
return theta, w, softmax_policy
Function Approximation: TD(0) with Linear FA (Ch 8)
def td_linear(env, feature_fn, n_episodes=5000, gamma=0.9, alpha_0=0.01,
max_steps=200):
"""Semi-gradient TD(0): w += alpha * [r + gamma*phi(s')^T w - phi(s)^T w] * phi(s)
env must provide: n_actions, step(s, a), is_terminal(s), sample_start().
feature_fn(state) -> ndarray of shape (d,).
"""
d = feature_fn(0).shape[0]
w = np.zeros(d)
for episode in range(n_episodes):
state = env.sample_start()
alpha = alpha_0 / (1 + episode / 1000)
for _ in range(max_steps):
action = np.random.randint(env.n_actions) # uniform random behavior
next_state, reward = env.step(state, action)
phi_s = feature_fn(state)
phi_s_next = feature_fn(next_state)
td_error = reward + gamma * (phi_s_next @ w) - (phi_s @ w)
w += alpha * td_error * phi_s
state = next_state
if env.is_terminal(state):
break
return w
Deep RL: DQN (Algorithm 8.3 extended)
import torch
import torch.nn as nn
import torch.optim as optim
from collections import deque
import random
class DQN(nn.Module):
def __init__(self, obs_dim, n_actions, hidden=64):
super().__init__()
self.net = nn.Sequential(
nn.Linear(obs_dim, hidden), nn.ReLU(),
nn.Linear(hidden, hidden), nn.ReLU(),
nn.Linear(hidden, n_actions)
)
def forward(self, x):
return self.net(x)
def train_dqn(env, obs_dim, n_actions, n_episodes=10000, gamma=0.99,
lr=1e-3, buffer_size=10000, batch_size=64, target_update=100):
"""DQN with experience replay and target network.
env must follow Gymnasium interface: reset() -> (obs, info),
step(action) -> (obs, reward, terminated, truncated, info).
"""
q_net = DQN(obs_dim, n_actions)
target_net = DQN(obs_dim, n_actions)
target_net.load_state_dict(q_net.state_dict())
optimizer = optim.Adam(q_net.parameters(), lr=lr)
replay_buffer = deque(maxlen=buffer_size)
for episode in range(n_episodes):
obs, _ = env.reset()
epsilon = max(0.01, 1.0 - episode / (n_episodes * 0.5))
for _ in range(200):
if random.random() < epsilon:
action = random.randint(0, n_actions - 1)
else:
with torch.no_grad():
action = q_net(torch.FloatTensor(obs)).argmax().item()
next_obs, reward, terminated, truncated, _ = env.step(action)
done = terminated or truncated
replay_buffer.append((obs, action, reward, next_obs, done))
obs = next_obs
if len(replay_buffer) >= batch_size:
batch = random.sample(replay_buffer, batch_size)
s, a, r, s2, d = zip(*batch)
s_t = torch.FloatTensor(np.array(s))
a_t = torch.LongTensor(a)
r_t = torch.FloatTensor(r)
s2_t = torch.FloatTensor(np.array(s2))
d_t = torch.FloatTensor(d)
q_vals = q_net(s_t).gather(1, a_t.unsqueeze(1)).squeeze()
with torch.no_grad():
targets = r_t + gamma * target_net(s2_t).max(1)[0] * (1 - d_t)
loss = nn.MSELoss()(q_vals, targets)
optimizer.zero_grad()
loss.backward()
optimizer.step()
if done:
break
if episode % target_update == 0:
target_net.load_state_dict(q_net.state_dict())
return q_net
Reference Environment: Zhao's 3x3 Grid World
Concrete implementation of the tabular MDP interface, matching the book's examples:
class GridWorld:
"""3x3 grid world from Zhao's 'Mathematical Foundations of RL'.
Implements the tabular MDP interface: n_states, n_actions,
step(s, a), is_terminal(s), sample_start().
"""
def __init__(self, size=3, gamma=0.9):
self.size = size
self.n_states = size * size
self.n_actions = 5 # up, down, left, right, stay
self.gamma = gamma
self.action_effects = {
0: (-1, 0), 1: (1, 0), 2: (0, -1), 3: (0, 1), 4: (0, 0)
}
self.forbidden_states = set()
self.terminal_states = set() # e.g., {8} for goal at bottom-right
self.rewards = {} # (s, a, s') -> r; default is -1
def step(self, state, action):
r, c = state // self.size, state % self.size
dr, dc = self.action_effects[action]
nr, nc = r + dr, c + dc
if 0 <= nr < self.size and 0 <= nc < self.size:
next_state = nr * self.size + nc
if next_state not in self.forbidden_states:
reward = self.rewards.get((state, action, next_state), -1)
return next_state, reward
return state, self.rewards.get((state, action, state), -1)
def is_terminal(self, state):
return state in self.terminal_states
def sample_start(self):
return np.random.randint(self.n_states)
def get_model(self):
"""Returns (P, R) for model-based algorithms.
P: shape (n_states, n_actions, n_states) — transition probabilities.
R: shape (n_states, n_actions) — expected rewards.
"""
P = np.zeros((self.n_states, self.n_actions, self.n_states))
R = np.full((self.n_states, self.n_actions), -1.0)
for s in range(self.n_states):
for a in range(self.n_actions):
s_next, reward = self.step(s, a)
P[s, a, s_next] = 1.0
R[s, a] = reward
return P, R
Learning Rate Schedules
SA-Compliant Schedules (sum=inf, sum-sq<inf)
# Schedule 1: Harmonic (1/k)
alpha_k = 1.0 / (visit_count + 1)
# Schedule 2: Polynomial decay
alpha_k = alpha_0 / (1 + k / tau) # tau controls decay speed
# Schedule 3: Per-state-action (recommended for tabular)
alpha_k_sa = 1.0 / n_visits[s, a]
Practical Schedules (constant or slow decay)
# Constant (does NOT satisfy SA conditions, but works with target networks)
alpha = 1e-3 # for DQN, DDPG
# Linear decay
alpha_k = alpha_0 * (1 - k / n_total)
# Exponential decay
alpha_k = alpha_0 * decay_rate ** k
Feature Engineering
Tabular (one-hot) Features
def tabular_features(state, n_states):
"""Exact representation — linear FA with one-hot recovers tabular methods."""
phi = np.zeros(n_states)
phi[state] = 1.0
return phi
Polynomial Features (2D grid)
def polynomial_features(coords, degree=3):
"""Polynomial basis for 2D state spaces.
Args:
coords: (x, y) normalized to [0, 1].
degree: Maximum polynomial degree.
Returns:
Feature vector of dimension (degree+1)(degree+2)/2.
"""
x, y = coords
features = []
for i in range(degree + 1):
for j in range(degree + 1 - i):
features.append(x**i * y**j)
return np.array(features)
Fourier Basis Features (2D grid)
def fourier_features(coords, order=3):
"""Fourier cosine basis for 2D state spaces.
Args:
coords: (x, y) normalized to [0, 1].
order: Maximum frequency order.
Returns:
Feature vector of dimension (order+1)^2.
"""
x, y = coords
features = []
for i in range(order + 1):
for j in range(order + 1):
features.append(np.cos(np.pi * (i * x + j * y)))
return np.array(features)
Adapting to any state representation
# For a grid world, convert state index to normalized coords:
def grid_to_coords(state, size):
r, c = state // size, state % size
return r / (size - 1), c / (size - 1)
# Then compose: feature_fn = lambda s: polynomial_features(grid_to_coords(s, 3))
Debugging Guide
| Symptom | Likely Cause | Fix |
|---|---|---|
| Values diverge | Learning rate too high | Reduce alpha or use SA schedule |
| No learning | Learning rate too low or no exploration | Increase alpha_0 or epsilon |
| Oscillating values | Constant alpha with noise | Use decreasing alpha_k |
| Wrong policy | Reward function incorrect | Verify r(s,a,s') |
| Slow convergence | Poor exploration | Increase epsilon, use exploring starts |
| FA divergence | Deadly triad | Use target networks, reduce bootstrap |
Key Equations Reference
Bellman Equation (matrix-vector form)
v_pi = r_pi + gamma * P_pi * v_pi
Closed-form: v_pi = (I - gamma * P_pi)^{-1} * r_pi
Bellman Optimality Equation
v*(s) = max_a [sum_r p(r|s,a)*r + gamma * sum_{s'} p(s'|s,a) * v*(s')]
Action Value
q_pi(s,a) = sum_r p(r|s,a) * r + gamma * sum_{s'} p(s'|s,a) * v_pi(s')
Relationship: v_pi(s) = sum_a pi(a|s) * q_pi(s,a)
TD(0) Update
v_{t+1}(s_t) = v_t(s_t) + alpha_t * [r_{t+1} + gamma * v_t(s_{t+1}) - v_t(s_t)]
Q-learning Update
q_{t+1}(s_t,a_t) = q_t(s_t,a_t) + alpha_t * [r_{t+1} + gamma * max_{a'} q_t(s_{t+1},a') - q_t(s_t,a_t)]
Policy Gradient
grad_theta J(theta) = E_{S~d_pi, A~pi}[grad_theta ln pi(A|S,theta) * q_pi(S,A)]