rlm-debugging - SKILL.md Agent Skill

name: rlm-debugging description: Debug and diagnose RLM execution issues using the Trace system, REPL inspection, and policy analysis. Use when an RLM run produces incorrect results, hits policy limits, loops unexpectedly, or needs performance optimization. license: MIT metadata: author: apenab version: "1.0"

RLM Debugging Guide

Overview

RLM executions can fail or behave unexpectedly in several ways: wrong answers, infinite loops, policy limit exceptions, REPL errors, or excessive token usage. This skill helps you diagnose these issues using the built-in Trace system and execution analysis.

The Trace System

Every RLM.run() call returns a (result, trace) tuple. The Trace object records every step of execution.

TraceStep Fields

Field	Type	Purpose
`step_id`	int	Sequential step number
`kind`	str	Step type (see below)
`depth`	int	Recursion depth (0 = root)
`prompt_summary`	str	Truncated prompt sent to LLM
`code`	str	Python code executed or LLM response
`stdout`	str	REPL output
`error`	str	REPL error or None
`usage`	Usage	Token counts (prompt, completion, total)
`cache_hit`	bool	Whether subcall was served from cache
`input_hash`	str	SHA256 of subcall input
`output_hash`	str	SHA256 of subcall output
`cache_key`	str	Full cache key for subcall

Step Kinds

Kind	Meaning
`root_call`	Root LLM call (the controller)
`repl_exec`	Python code executed in REPL
`subcall`	Standard sub-LLM call
`recursive_subcall`	Subcall that ran its own mini-RLM loop
`sub_root_call`	Root call inside a recursive subcall
`sub_repl_exec`	REPL exec inside a recursive subcall
`sub_subcall`	Nested subcall inside a recursive subcall

Debugging Workflow

Step 1: Capture the Trace

from pyrlm_runtime import RLM, Context, Policy

rlm = RLM(adapter=my_adapter, policy=Policy(max_steps=20))
result, trace = rlm.run("your query", Context.from_text(your_text))

Step 2: Inspect the Trace

# Print all steps
for step in trace.steps:
    print(f"[{step.step_id}] {step.kind} depth={step.depth}")
    if step.code:
        print(f"  code: {step.code[:200]}")
    if step.stdout:
        print(f"  stdout: {step.stdout[:200]}")
    if step.error:
        print(f"  ERROR: {step.error}")
    if step.usage:
        print(f"  tokens: {step.usage.total_tokens}")
    if step.cache_hit:
        print(f"  (cache hit)")

Step 3: Serialize for Later Analysis

# Save trace to JSON
import json
with open("trace.json", "w") as f:
    f.write(trace.to_json())

# Load trace from JSON
with open("trace.json") as f:
    loaded_trace = Trace.from_json(f.read())

Common Issues and Diagnosis

1. MaxStepsExceeded

Symptom: MaxStepsExceeded exception raised.

Diagnosis: The LLM is not converging to a FINAL answer.

Check in trace:

Look at root_call steps: Is the LLM producing valid code?
Look for repl_exec steps with errors: Is REPL code failing repeatedly?
Check if the LLM keeps generating the same code (looping)

Common causes:

LLM doesn't understand the FINAL: / FINAL_VAR: syntax -> check system prompt
REPL errors prevent the LLM from getting useful output -> check error messages
Context is too complex for the model -> try a more capable model or simplify the query
require_repl_before_final=True but LLM keeps trying to answer directly

Fix: Increase Policy(max_steps=...), improve the system prompt, or use a more capable model.

2. MaxSubcallsExceeded

Symptom: MaxSubcallsExceeded exception during subcall execution.

Diagnosis: Too many sub-LLM calls being made.

Check in trace:

Count subcall and recursive_subcall steps
Check chunk sizes: are chunks too small, creating too many subcalls?
Look for subcall_batch calls with large chunk lists

Common causes:

LLM is chunking too aggressively (e.g., 100-char chunks on a 1M-char context)
LLM is calling llm_query in a loop instead of using ask_chunks batch
Recursive subcalls creating nested subcalls

Fix: Increase Policy(max_subcalls=...), adjust prompt to encourage larger chunks, or use parallel_subcalls=True for efficiency.

3. MaxTokensExceeded

Symptom: MaxTokensExceeded exception.

Diagnosis: Total token budget exhausted.

Check in trace:

total = sum(s.usage.total_tokens for s in trace.steps if s.usage)
print(f"Total tokens used: {total}")

# Breakdown by kind
from collections import Counter
by_kind = Counter()
for s in trace.steps:
    if s.usage:
        by_kind[s.kind] += s.usage.total_tokens
print(by_kind)

Common causes:

Too many subcalls (each costs tokens)
Root LLM calls with very large prompts
Recursive subcalls multiplying token usage

Fix: Increase Policy(max_total_tokens=...), reduce subcall count, use caching.

4. Wrong Answer / NO_ANSWER

Symptom: RLM returns incorrect result or "NO_ANSWER".

Diagnosis steps:

Check root_call responses: Is the LLM writing good inspection code?
Check repl_exec stdout: Is the REPL returning useful data?
Check subcall results: Are sub-LLMs extracting correct information?
Check the final step: How was the answer resolved (FINAL vs FINAL_VAR)?

Trace analysis:

# Find the final step
final_steps = [s for s in trace.steps if s.kind == "root_call"]
last_root = final_steps[-1]
print(f"Final LLM output: {last_root.code}")

# Check all subcall results
subcalls = [s for s in trace.steps if s.kind == "subcall"]
for s in subcalls:
    print(f"  input: {s.prompt_summary}")
    print(f"  output_hash: {s.output_hash}")

Common causes:

The answer isn't in the context (verify with manual search)
Sub-LLM prompt is too vague -> check SUBCALL_SYSTEM_PROMPT
Chunking missed the relevant section -> check chunk overlap
extract_after marker not found -> check deterministic extraction

5. REPL Errors

Symptom: repl_exec steps show errors repeatedly.

Check in trace:

errors = [s for s in trace.steps if s.kind == "repl_exec" and s.error]
for s in errors:
    print(f"Step {s.step_id}: {s.error}")
    print(f"  Code: {s.code}")

Common REPL errors:

ImportError: import of 'X' is not allowed -> LLM tried to import a blocked module. Only re, math, json, textwrap are allowed.
NameError -> LLM referenced a function that isn't registered in the REPL
IndexError / KeyError -> Bug in LLM-generated code, usually fixes on next iteration

6. Cache Issues

Symptom: Unexpected results when rerunning, or no caching benefit.

Check:

cache_hits = [s for s in trace.steps if s.cache_hit]
cache_misses = [s for s in trace.steps if s.kind == "subcall" and not s.cache_hit]
print(f"Cache hits: {len(cache_hits)}, misses: {len(cache_misses)}")

Fix: Cache is stored in .rlm_cache/. Delete the directory to reset. Cache keys include model name, max_tokens, and recursive flag, so changing any of these causes cache misses.

Performance Analysis

Token Usage Summary

def analyze_trace(trace):
    total_tokens = 0
    root_tokens = 0
    subcall_tokens = 0
    cache_hits = 0

    for step in trace.steps:
        if step.usage:
            total_tokens += step.usage.total_tokens
            if step.kind == "root_call":
                root_tokens += step.usage.total_tokens
            elif step.kind in ("subcall", "recursive_subcall"):
                subcall_tokens += step.usage.total_tokens
        if step.cache_hit:
            cache_hits += 1

    print(f"Total tokens: {total_tokens}")
    print(f"  Root LLM: {root_tokens}")
    print(f"  Subcalls: {subcall_tokens}")
    print(f"Root steps: {sum(1 for s in trace.steps if s.kind == 'root_call')}")
    print(f"Subcalls: {sum(1 for s in trace.steps if s.kind in ('subcall', 'recursive_subcall'))}")
    print(f"Cache hits: {cache_hits}")
    print(f"REPL errors: {sum(1 for s in trace.steps if s.error)}")

Identifying Bottlenecks

Too many root steps: The LLM is not converging. Check if it's getting useful REPL feedback.
Too many subcalls: Chunking is too fine-grained. Increase chunk sizes.
High token usage in subcalls: Sub-LLM responses are verbose. Check SUBCALL_SYSTEM_PROMPT.
No cache hits: Either first run, or cache keys differ between runs (model/token params changed).

Policy Tuning

Scenario	Recommended Policy
Quick extraction (needle-in-haystack)	`Policy(max_steps=10, max_subcalls=20)`
Multi-document analysis	`Policy(max_steps=30, max_subcalls=200)`
Deep research (100+ docs)	`Policy(max_steps=40, max_subcalls=500, max_total_tokens=500_000)`
Development / testing	`Policy(max_steps=5, max_subcalls=10, max_total_tokens=50_000)`

Using SHOW_TRAJECTORY

Set SHOW_TRAJECTORY=1 env var when running examples to see real-time execution flow. The TraceFormatter in router.py provides formatted trajectory output including ASCII visualization.

Logging

Enable debug logging for detailed runtime output:

import logging
logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger("pyrlm_runtime")

rlm = RLM(adapter=my_adapter, logger=logger)

This shows: root_call steps, REPL exec results, subcall details, cache hits/misses, and policy state.