tt-opd-medical-agent-training

star 1

Turn-level Truncated On-Policy Distillation (TT-OPD) methodology for training multi-turn medical AI agents via reinforcement learning. Addresses multi-turn collapse, response length explosion, and tool-use erosion in clinical dialogue. Use when training medical AI agents, RL-based dialogue systems, or multi-turn agentic systems where sparse terminal rewards cause training instability.

hiyenwong

By hiyenwong schedule Updated 6/3/2026

play_arrow Run Skill in Manus View GitHub

name: tt-opd-medical-agent-training description: "Turn-level Truncated On-Policy Distillation (TT-OPD) methodology for training multi-turn medical AI agents via reinforcement learning. Addresses multi-turn collapse, response length explosion, and tool-use erosion in clinical dialogue. Use when training medical AI agents, RL-based dialogue systems, or multi-turn agentic systems where sparse terminal rewards cause training instability."

TT-OPD: Turn-level Truncated On-Policy Distillation

Problem

Multi-turn agentic structures in RL training collapse into verbose single-turn monologues, with response length explosion and tool-use frequency erosion. Root cause: misalignment between sparse terminal rewards and sequential clinical trajectories.

Core Mechanism

Self-Distillation Architecture

Maintain a teacher model updated via gradient-free EMA (Exponential Moving Average)
Apply dense, outcome-aware KL regularization at every conversation turn
Leverage outcome-privileged information unavailable at inference time

Key Steps

Initialize student and teacher models from same base
For each turn in multi-turn dialogue:
- Compute turn-level reward signal (not just terminal)
- Apply KL divergence between student output and teacher output
- Weight KL by outcome quality (privileged signal)
- Update student; update teacher via EMA of student weights
Monitor tool-use frequency as stability metric

When to Use

Training multi-turn dialogue agents with RL
Clinical/medical AI agent training environments
Any agentic system where GRPO/vanilla RL causes length explosion
Systems requiring sustained tool-use across conversation turns

Pitfalls

Vanilla GRPO achieves strong final accuracy but suffers training instability
Monitor response length oscillations as early warning of collapse
Outcome-privileged information must be available during training only
EMA update rate critically affects convergence speed

Verification

Tool-use frequency should remain stable across training
Response length should not monotonically increase
Performance on held-out benchmarks should improve steadily

Install via CLI

npx skills add https://github.com/hiyenwong/ai_collection --skill tt-opd-medical-agent-training

Repository Details

star Stars 1

call_split Forks 0

navigation Branch main

article Path SKILL.md

Occupations

Medical Scientists Except Epidemiologists

More from Creator

hiyenwong

hiyenwong Explore all skills →