pope-learning-reason-hard - SKILL.md Agent Skill

name: "pope-learning-reason-hard" description: "Apply the POPE (Privileged On-Policy Exploration) technique to solve hard reasoning problems by decomposing them with oracle-guided prefixes and transferring learned reasoning back to unguided attempts. Use when: 'help me solve this hard problem step by step', 'I'm stuck on this complex algorithm', 'break down this difficult reasoning task', 'guide me through this math/logic problem', 'use privileged hints to bootstrap a solution', 'scaffold a hard problem with partial solutions'."

POPE: Solving Hard Problems via Privileged Guided Exploration

This skill enables Claude to tackle problems where direct reasoning fails by applying the POPE technique from reinforcement learning research. Instead of attempting a hard problem head-on (where the chance of finding a correct solution is near zero), POPE works by providing partial oracle solutions as prefixes, allowing the model to practice completing solutions from guided starting points. The key insight: reasoning patterns learned during guided completions transfer back to fully unguided attempts through overlapping internal reasoning states. Claude applies this by decomposing intractable problems into guided sub-completions, building up reasoning capability incrementally, and then synthesizing a full unguided solution.

When to Use

When the user presents a complex algorithmic, mathematical, or logic problem that resists direct solution attempts
When a first-pass reasoning attempt produces no viable solution path (the "zero reward" scenario)
When the user asks to debug or solve a problem they are stuck on after multiple failed attempts
When breaking a hard coding challenge (e.g., competitive programming, system design) into solvable pieces
When the user provides reference solutions, test cases, or partial answers that can serve as oracle guidance
When mixing easy and hard sub-tasks causes the easy ones to dominate attention (ray interference)
When the user says things like "I have a solution but need to understand the reasoning" or "here's a hint, help me finish"

Key Technique

The Core Problem: When reasoning about hard problems, direct attempts often fail completely -- producing zero useful signal. Classical strategies like trying harder, exploring more broadly, or mixing in easier related problems don't help. In fact, mixing easy and hard problems causes ray interference: the optimizer preferentially sharpens performance on already-solvable tasks while actively inhibiting progress on harder ones. This means solving easy sub-problems doesn't build toward solving hard ones.

POPE's Solution: Rather than using oracle solutions as direct training targets (which is off-policy and brittle), POPE uses oracle solutions as privileged exploration guides. Concretely: given a hard problem and its oracle solution, identify the shortest prefix of that solution that enables generating at least one correct completion. Prepend that prefix to the problem with an instruction to "study the partial response and complete the solution from where it left off." Train on a 1:1 mixture of these guided variants alongside the original unguided problems.

Why Transfer Works: The model learns successful continuations from states induced by oracle prefixes. These guided reasoning states overlap with states the unguided policy can plausibly reach on its own. The transfer is amplified by the model's self-verification and backtracking behaviors during guided rollouts -- these behaviors expand the overlap between guided and unguided state spaces. The result: reasoning patterns learned during guided exploration become available during fully unguided problem-solving.

Step-by-Step Workflow

Attempt the problem directly. Make an honest first-pass attempt at the full problem. If a viable solution path emerges, follow it -- POPE scaffolding is unnecessary. Record what went wrong if the attempt fails (dead ends, circular reasoning, missing insights).
Identify the hard core. Analyze where direct reasoning breaks down. Is it a missing mathematical insight? An algorithmic technique not being applied? A combinatorial explosion? Pinpoint the specific sub-problem or reasoning step that blocks progress.
Gather or construct oracle guidance. Use any available privileged information: user-provided hints, known solution patterns, reference implementations, test case reverse-engineering, or analogous solved problems. The oracle need not be a complete solution -- a partial reasoning prefix covering the hardest conceptual leap is sufficient.
Find the minimal effective prefix. Starting from the shortest possible prefix of the oracle solution, test whether you can successfully complete the rest. Use uniformly spaced checkpoints (e.g., 25%, 50%, 75% of the oracle). Select the shortest prefix that enables at least one correct completion. If no prefix works, use roughly the first quarter of the oracle reasoning.
Execute guided completion. With the chosen prefix established as context, complete the solution from that point forward. Explicitly study the partial reasoning, identify what steps are already provided, and continue the logic chain. Allow self-verification and backtracking -- revisiting earlier prefix steps strengthens transfer.
Extract transferable reasoning patterns. From the successful guided completion, identify which reasoning moves, techniques, or structural insights were critical. These are the patterns that must transfer to the unguided attempt.
Re-attempt the problem unguided. Using the reasoning patterns extracted in step 6, attempt the original problem from scratch without the oracle prefix. The goal is to independently reconstruct the reasoning path, not to memorize the oracle solution.
Verify and iterate. Check the unguided solution against available test cases or constraints. If it fails, return to step 4 with a longer prefix or different oracle decomposition. Each iteration expands the overlap between guided and unguided reasoning states.
Synthesize the final solution. Present the clean, fully-reasoned unguided solution with clear explanation of the key insights that made it solvable.

Concrete Examples

Example 1: Hard Dynamic Programming Problem

User: "Find the minimum cost to merge all stones into one pile where you can merge exactly k consecutive piles at a time."

Approach:

Direct attempt: Try standard DP formulations -- interval DP seems relevant but the "exactly k" constraint makes state transitions unclear. First attempt stalls.
Identify hard core: The difficulty is modeling when a merge is valid (only when the number of remaining sub-piles is compatible with k).
Oracle prefix (from known solution pattern): "The key insight is that we can only merge an interval into one pile if (number_of_piles - 1) % (k - 1) == 0. Define dp[i][j] as the minimum cost to optimally process stones[i..j]. We accumulate costs at intervals of (k-1)."
Guided completion from prefix: "Building on this, dp[i][j] tracks the minimum cost with the additional state of how many piles remain. We iterate with step size (k-1) to find valid split points..."
Extract pattern: The modular arithmetic constraint on valid merges, and the interval DP with step-size (k-1).
Unguided re-attempt:

def mergeStones(stones, k):
    n = len(stones)
    if (n - 1) % (k - 1) != 0:
        return -1
    prefix = [0] * (n + 1)
    for i in range(n):
        prefix[i + 1] = prefix[i] + stones[i]
    dp = [[0] * n for _ in range(n)]
    for length in range(k, n + 1):
        for i in range(n - length + 1):
            j = i + length - 1
            dp[i][j] = float('inf')
            for mid in range(i, j, k - 1):
                dp[i][j] = min(dp[i][j], dp[i][mid] + dp[mid + 1][j])
            if (length - 1) % (k - 1) == 0:
                dp[i][j] += prefix[j + 1] - prefix[i]
    return dp[0][n - 1]

Example 2: Complex System Design Reasoning

User: "Design a distributed rate limiter that works across 50 data centers with sub-millisecond local decisions and eventual consistency."

Approach:

Direct attempt: Token bucket per node with synchronization -- but synchronization latency contradicts sub-millisecond requirement. Stalls on the consistency/latency tradeoff.
Identify hard core: How to allow local decisions while preventing global over-admission.
Oracle prefix: "Split the global rate limit into per-datacenter quotas. Each DC gets a static share (global_limit / num_DCs) plus a dynamic borrowing mechanism. Local decisions use only local state."
Guided completion: "Building on the quota split: each DC maintains a local token bucket sized to its quota. A background gossip protocol (crdt-based) shares utilization data every 100ms. Under-utilized DCs advertise surplus tokens. Over-utilized DCs can borrow from the surplus pool. Local admission uses only the local bucket -- always sub-millisecond."
Extract patterns: Quota partitioning for local autonomy, background gossip for redistribution, separation of fast-path (local) from slow-path (rebalancing).
Unguided synthesis: Full design with local token buckets, CRDT-based gossip rebalancing, quota borrowing protocol, and failure handling when DCs go offline (their quota gets redistributed).

Example 3: Debugging a Failing Algorithm

User: "My A* implementation finds paths but they're not optimal. I've been debugging for hours. Here's my code: [200 lines of A* with a custom heuristic]"

Approach:

Direct attempt: Read the code, check standard A* correctness conditions. The heuristic function is complex and not obviously inadmissible.
Identify hard core: The heuristic may be inadmissible (overestimates) in specific edge cases but appears correct for most inputs.
Oracle prefix (from test case analysis): Run the failing test case manually. At node X, the heuristic returns 15 but the true shortest remaining distance is 12. The heuristic overestimates when paths cross diagonal boundaries in the grid.
Guided completion: "The diagonal crossing case triggers the overestimate because the heuristic uses Euclidean distance but the grid has non-uniform edge weights near boundaries. The fix is to take the minimum of the current heuristic and the octile distance scaled by the minimum edge weight."
Unguided re-synthesis: Present the corrected heuristic with proof of admissibility, explain exactly which cases triggered the overestimate, and provide a test that validates the fix.

Best Practices

Do: Always attempt the problem directly first. POPE scaffolding is a tool for genuinely hard problems, not a substitute for careful reasoning.
Do: Find the minimal prefix that enables completion. Over-long prefixes reduce what the model learns independently and weaken transfer.
Do: Allow backtracking and self-verification during guided completions. Revisiting prefix content creates the state-space overlap that enables transfer to unguided attempts.
Do: Maintain a 1:1 ratio of guided and unguided reasoning. Every guided completion should be paired with an unguided re-attempt to ensure real transfer.
Avoid: Mixing easy and hard sub-problems indiscriminately. This causes ray interference -- the easy problems absorb all the optimization effort. Instead, handle hard sub-problems with guided exploration specifically.
Avoid: Using the oracle solution as the direct answer. POPE uses oracle prefixes to enable exploration, not as training targets. The goal is to learn reasoning patterns, not to copy solutions.
Avoid: Suppressing self-referential reasoning during guided completions (e.g., "don't look back at the prefix"). This destroys the overlap mechanism that enables transfer.

Error Handling

No oracle available: When no reference solution, hints, or analogous problems exist, fall back to progressive decomposition: break the problem into sub-problems and attempt each independently. Use successful sub-solutions as pseudo-oracle prefixes for the integrated problem.
Prefix too short (guided completion still fails): Increase prefix length to the next checkpoint (e.g., from 25% to 50% of oracle). If no prefix enables completion, the problem may require domain knowledge outside current capability -- communicate this honestly.
Transfer fails (unguided re-attempt still wrong): The guided and unguided state spaces may not overlap enough. Try a different oracle decomposition angle, or use a longer prefix temporarily and shorten it over successive iterations.
Ray interference detected: If solving one part of a multi-component problem causes regression on another part, isolate the components. Apply POPE to each hard component independently before integrating.

Limitations

POPE requires some form of oracle or privileged information. For truly novel problems with no reference solutions, analogies, or decomposable structure, the technique has limited applicability.
Transfer from guided to unguided reasoning depends on state-space overlap. If the oracle solution uses fundamentally different reasoning than what the model can reach independently (e.g., requires specialized domain knowledge), transfer may not occur.
The technique is most effective for problems where the difficulty is exploratory (finding the right path) rather than representational (lacking the basic knowledge to understand the solution). It cannot substitute for missing domain knowledge.
For problems requiring precise numerical computation or formal proof, guided reasoning patterns may not transfer reliably since exact values matter more than reasoning structure.

Reference

Paper: POPE: Learning to Reason on Hard Problems via Privileged On-Policy Exploration (Qu et al., 2026) Key finding: Oracle solution prefixes used as exploration guides (not training targets) expand the set of solvable problems by 14-29% on hard reasoning benchmarks, with transfer enabled by overlapping reasoning states between guided and unguided rollouts. Focus on Section 4 (method) and Section 5 (transfer mechanism analysis).