sd-search-on-policy-hindsight-distillation - SKILL.md Agent Skill

name: sd-search-on-policy-hindsight-distillation description: SD-Search methodology for search-augmented reasoning. Derives step-level supervision from the policy itself through on-policy hindsight self-distillation, without external teacher or annotations. category: ai_collection

SD-Search: On-Policy Hindsight Self-Distillation

Overview

SD-Search addresses the credit assignment problem in search-augmented reasoning agents. Under outcome-reward RL, every search decision shares the same trajectory-level reward, leaving individual queries without step-specific credit. SD-Search derives step-level supervision from the policy itself through on-policy hindsight self-distillation.

Core Methodology

1. Dual-Role Architecture

Student: Sees only context available at inference time
Teacher: Additionally conditions on a compact hindsight block summarizing search queries and final outcomes of a group of rollouts
Same model, different conditioning — no external teacher needed

2. Hindsight Self-Distillation

Sample a group of rollouts from the same question
Teacher knows how each rollout unfolded and which succeeded
Teacher's query distribution implicitly marks valuable decisions
Student trained to recover teacher behavior via token-level Jensen-Shannon divergence

3. Integration with GRPO

Layers dense step-level signal on top of GRPO's coarse trajectory reward
Signal produced by policy within standard RL training loop
No external model inference, annotation pipeline, or additional training stage

4. Training Loop

def sd_search_training_step(policy, questions):
    # Step 1: Sample rollout group
    rollouts = sample_group_rollouts(policy, question)
    
    # Step 2: Build hindsight block (queries + outcomes)
    hindsight_block = summarize_rollouts(rollouts)
    
    # Step 3: Teacher forward (with hindsight)
    teacher_logits = policy(context + hindsight_block)
    
    # Step 4: Student forward (without hindsight)
    student_logits = policy(context)
    
    # Step 5: Jensen-Shannon divergence at query positions
    loss = js_divergence(
        teacher_logits[query_positions],
        student_logits[query_positions]
    )
    
    # Step 6: Combine with GRPO trajectory reward
    total_loss = loss + grpo_loss
    
    return total_loss

Advantages Over Alternatives

Method	External Teacher	Annotations	Training Stages
Process supervision	Required	Required	Multiple
SD-Search	None	None	Single

Key Design Principles

Self-supervision: Policy teaches itself via hindsight conditioning
Token-level credit: JS divergence provides fine-grained step-level signal
No external dependency: Everything happens within standard RL loop
Group sampling: Hindsight block summarizes multiple rollouts for contrast

Activation

sd-search, on-policy distillation, hindsight self-distillation, search-augmented reasoning, step-level credit, process supervision alternative

Reference

arXiv: 2605.18299
"SD-Search: On-Policy Hindsight Self-Distillation for Search-Augmented Reasoning"