pg-dpo-non-exponential-discounting

star 1

Pontryagin-Guided Direct Policy Optimization (PG-DPO) — a variational RL framework that replaces Bellman recursions with Pontryagin Maximum Principle for non-exponential discounting (hyperbolic, survival-discount). Handles settings where standard value/actor-critic methods fail. Use when: RL with non-exponential discounting, hyperbolic discounting RL, human-like time preferences, survival processes, Pontryagin-based RL. Activation: PG-DPO, non-exponential discounting, Pontryagin RL, hyperbolic discounting, Bellman breakdown, Adjoint-MC projection.

hiyenwong By hiyenwong schedule Updated 6/4/2026

name: pg-dpo-non-exponential-discounting description: "Pontryagin-Guided Direct Policy Optimization (PG-DPO) — a variational RL framework that replaces Bellman recursions with Pontryagin Maximum Principle for non-exponential discounting (hyperbolic, survival-discount). Handles settings where standard value/actor-critic methods fail. Use when: RL with non-exponential discounting, hyperbolic discounting RL, human-like time preferences, survival processes, Pontryagin-based RL. Activation: PG-DPO, non-exponential discounting, Pontryagin RL, hyperbolic discounting, Bellman breakdown, Adjoint-MC projection."

Beyond the Bellman Recursion: Pontryagin-Guided Direct Policy Optimization

Source paper: arXiv:2605.20996 Authors: Hojin Ko, Jeonggyu Huh

Core Problem

Most value-based and actor-critic RL methods rely on Bellman-style recursions, which collapse under non-exponential discounting (e.g., hyperbolic discounting common in human preferences, survival processes). This work:

  • Shows the breakdown is structural: exponential discounting sits at a fragile intersection of multiplicativity and time homogeneity
  • Proposes Pontryagin-Guided Direct Policy Optimization (PG-DPO), a variational framework that abandons recursion entirely

Key Contributions

1. Structural Breakdown Analysis

  • Proves why Bellman recursions fail for non-exponential discounting
  • Identifies: violating multiplicativity or time homogeneity breaks standard dynamic programming
  • Explains why existing approaches diverge under hyperbolic/survival discounting

2. PG-DPO Framework

  • Abandons recursion: no Bellman backup, no value function
  • Couples the Pontryagin Maximum Principle (PMP) with Monte Carlo rollouts
  • Uses Adjoint-MC projection enforcing pointwise Hamiltonian maximization
  • A variational framework that optimizes policy via the necessary conditions of optimal control

3. Performance

  • Improves accuracy and stability where equation-driven solvers and critic-based baselines diverge
  • Handles multi-dimensional hyperbolic and survival-discount benchmarks

Algorithm Design

PG-DPO Framework

1. Parameterize policy π_θ(a|s)
2. For each rollout:
   a. Collect trajectory τ = (s_0, a_0, r_0, ..., s_T)
   b. Compute Hamiltonian H(t) = p(t)·f(s(t), a(t)) - γ(t)·r(s(t), a(t))
      where p(t) is the adjoint (costate) variable
      and γ(t) is the non-exponential discount function
   c. Adjoint-MC projection: enforce pointwise Hamiltonian maximization
      by projecting policy gradients onto the Pontryagin conditions
3. Update θ via policy gradient with Hamiltonian-consistent advantage

Key Equations

Bellman breakdown condition:

  • Standard: V(s) = max_a [r(s,a) + γ·V(s')]
  • Non-exponential: V(s) = max_a [r(s,a) + γ(t)·V(s')] → NOT valid when γ(t) is not exponential

PG-DPO alternative:

  • Hamiltonian: H(s,a,p,t) = p·f(s,a) - γ(t)·r(s,a)
  • Policy optimality: a*(s,p,t) = argmax_a H(s,a,p,t)
  • Adjoint equation: dp/dt = -∂H/∂s

Application Scenarios

  • Human preference modeling: Hyperbolic discounting in behavioral economics
  • Survival analysis: Processes with non-constant hazard rates
  • Long-horizon planning: Where standard geometric discounting is inappropriate
  • Robust RL: Settings where discount factor sensitivity causes instability
  • Behavioral cloning with human data: Human decisions follow non-exponential discounting

Related Skills

  • [[rlhf-from-human-feedback]] - Standard RLHF with exponential discounting
  • [[learning-zone-energy-data-selection]] - Efficient RL post-training
  • [[advantage-collapse-grpo-avspo]] - GRPO improvements

Implementation Considerations

  • Replace standard GAE(λ) with Hamiltonian-consistent advantage estimation
  • No critic network needed (avoids critic convergence issues)
  • Adjoint variables introduce additional ODE solve per step (computational overhead)
  • Best for: domains with known discount structure or where Bellman methods fail

Activation Keywords

PG-DPO, non-exponential discounting, Pontryagin Maximum Principle, hyperbolic discounting, Bellman breakdown, Adjoint-MC projection, Hamiltonian RL, optimal control RL, survival discount, time-inconsistent preferences

Install via CLI
npx skills add https://github.com/hiyenwong/ai_collection --skill pg-dpo-non-exponential-discounting
Repository Details
star Stars 1
call_split Forks 0
navigation Branch main
article Path SKILL.md
More from Creator