name: rlm-debugging description: Debug and diagnose RLM execution issues using the Trace system, REPL inspection, and policy analysis. Use when an RLM run produces incorrect results, hits policy limits, loops unexpectedly, or needs performance optimization. license: MIT metadata: author: apenab version: "1.0"
RLM Debugging Guide
Overview
RLM executions can fail or behave unexpectedly in several ways: wrong answers, infinite loops, policy limit exceptions, REPL errors, or excessive token usage. This skill helps you diagnose these issues using the built-in Trace system and execution analysis.
The Trace System
Every RLM.run() call returns a (result, trace) tuple. The Trace object records every step of execution.
TraceStep Fields
| Field | Type | Purpose |
|---|---|---|
step_id |
int | Sequential step number |
kind |
str | Step type (see below) |
depth |
int | Recursion depth (0 = root) |
prompt_summary |
str | Truncated prompt sent to LLM |
code |
str | Python code executed or LLM response |
stdout |
str | REPL output |
error |
str | REPL error or None |
usage |
Usage | Token counts (prompt, completion, total) |
cache_hit |
bool | Whether subcall was served from cache |
input_hash |
str | SHA256 of subcall input |
output_hash |
str | SHA256 of subcall output |
cache_key |
str | Full cache key for subcall |
Step Kinds
| Kind | Meaning |
|---|---|
root_call |
Root LLM call (the controller) |
repl_exec |
Python code executed in REPL |
subcall |
Standard sub-LLM call |
recursive_subcall |
Subcall that ran its own mini-RLM loop |
sub_root_call |
Root call inside a recursive subcall |
sub_repl_exec |
REPL exec inside a recursive subcall |
sub_subcall |
Nested subcall inside a recursive subcall |
Debugging Workflow
Step 1: Capture the Trace
from pyrlm_runtime import RLM, Context, Policy
rlm = RLM(adapter=my_adapter, policy=Policy(max_steps=20))
result, trace = rlm.run("your query", Context.from_text(your_text))
Step 2: Inspect the Trace
# Print all steps
for step in trace.steps:
print(f"[{step.step_id}] {step.kind} depth={step.depth}")
if step.code:
print(f" code: {step.code[:200]}")
if step.stdout:
print(f" stdout: {step.stdout[:200]}")
if step.error:
print(f" ERROR: {step.error}")
if step.usage:
print(f" tokens: {step.usage.total_tokens}")
if step.cache_hit:
print(f" (cache hit)")
Step 3: Serialize for Later Analysis
# Save trace to JSON
import json
with open("trace.json", "w") as f:
f.write(trace.to_json())
# Load trace from JSON
with open("trace.json") as f:
loaded_trace = Trace.from_json(f.read())
Common Issues and Diagnosis
1. MaxStepsExceeded
Symptom: MaxStepsExceeded exception raised.
Diagnosis: The LLM is not converging to a FINAL answer.
Check in trace:
- Look at
root_callsteps: Is the LLM producing valid code? - Look for
repl_execsteps with errors: Is REPL code failing repeatedly? - Check if the LLM keeps generating the same code (looping)
Common causes:
- LLM doesn't understand the
FINAL:/FINAL_VAR:syntax -> check system prompt - REPL errors prevent the LLM from getting useful output -> check error messages
- Context is too complex for the model -> try a more capable model or simplify the query
require_repl_before_final=Truebut LLM keeps trying to answer directly
Fix: Increase Policy(max_steps=...), improve the system prompt, or use a more capable model.
2. MaxSubcallsExceeded
Symptom: MaxSubcallsExceeded exception during subcall execution.
Diagnosis: Too many sub-LLM calls being made.
Check in trace:
- Count
subcallandrecursive_subcallsteps - Check chunk sizes: are chunks too small, creating too many subcalls?
- Look for
subcall_batchcalls with large chunk lists
Common causes:
- LLM is chunking too aggressively (e.g., 100-char chunks on a 1M-char context)
- LLM is calling
llm_queryin a loop instead of usingask_chunksbatch - Recursive subcalls creating nested subcalls
Fix: Increase Policy(max_subcalls=...), adjust prompt to encourage larger chunks, or use parallel_subcalls=True for efficiency.
3. MaxTokensExceeded
Symptom: MaxTokensExceeded exception.
Diagnosis: Total token budget exhausted.
Check in trace:
total = sum(s.usage.total_tokens for s in trace.steps if s.usage)
print(f"Total tokens used: {total}")
# Breakdown by kind
from collections import Counter
by_kind = Counter()
for s in trace.steps:
if s.usage:
by_kind[s.kind] += s.usage.total_tokens
print(by_kind)
Common causes:
- Too many subcalls (each costs tokens)
- Root LLM calls with very large prompts
- Recursive subcalls multiplying token usage
Fix: Increase Policy(max_total_tokens=...), reduce subcall count, use caching.
4. Wrong Answer / NO_ANSWER
Symptom: RLM returns incorrect result or "NO_ANSWER".
Diagnosis steps:
- Check root_call responses: Is the LLM writing good inspection code?
- Check repl_exec stdout: Is the REPL returning useful data?
- Check subcall results: Are sub-LLMs extracting correct information?
- Check the final step: How was the answer resolved (FINAL vs FINAL_VAR)?
Trace analysis:
# Find the final step
final_steps = [s for s in trace.steps if s.kind == "root_call"]
last_root = final_steps[-1]
print(f"Final LLM output: {last_root.code}")
# Check all subcall results
subcalls = [s for s in trace.steps if s.kind == "subcall"]
for s in subcalls:
print(f" input: {s.prompt_summary}")
print(f" output_hash: {s.output_hash}")
Common causes:
- The answer isn't in the context (verify with manual search)
- Sub-LLM prompt is too vague -> check
SUBCALL_SYSTEM_PROMPT - Chunking missed the relevant section -> check chunk overlap
extract_aftermarker not found -> check deterministic extraction
5. REPL Errors
Symptom: repl_exec steps show errors repeatedly.
Check in trace:
errors = [s for s in trace.steps if s.kind == "repl_exec" and s.error]
for s in errors:
print(f"Step {s.step_id}: {s.error}")
print(f" Code: {s.code}")
Common REPL errors:
ImportError: import of 'X' is not allowed-> LLM tried to import a blocked module. Onlyre,math,json,textwrapare allowed.NameError-> LLM referenced a function that isn't registered in the REPLIndexError/KeyError-> Bug in LLM-generated code, usually fixes on next iteration
6. Cache Issues
Symptom: Unexpected results when rerunning, or no caching benefit.
Check:
cache_hits = [s for s in trace.steps if s.cache_hit]
cache_misses = [s for s in trace.steps if s.kind == "subcall" and not s.cache_hit]
print(f"Cache hits: {len(cache_hits)}, misses: {len(cache_misses)}")
Fix: Cache is stored in .rlm_cache/. Delete the directory to reset. Cache keys include model name, max_tokens, and recursive flag, so changing any of these causes cache misses.
Performance Analysis
Token Usage Summary
def analyze_trace(trace):
total_tokens = 0
root_tokens = 0
subcall_tokens = 0
cache_hits = 0
for step in trace.steps:
if step.usage:
total_tokens += step.usage.total_tokens
if step.kind == "root_call":
root_tokens += step.usage.total_tokens
elif step.kind in ("subcall", "recursive_subcall"):
subcall_tokens += step.usage.total_tokens
if step.cache_hit:
cache_hits += 1
print(f"Total tokens: {total_tokens}")
print(f" Root LLM: {root_tokens}")
print(f" Subcalls: {subcall_tokens}")
print(f"Root steps: {sum(1 for s in trace.steps if s.kind == 'root_call')}")
print(f"Subcalls: {sum(1 for s in trace.steps if s.kind in ('subcall', 'recursive_subcall'))}")
print(f"Cache hits: {cache_hits}")
print(f"REPL errors: {sum(1 for s in trace.steps if s.error)}")
Identifying Bottlenecks
- Too many root steps: The LLM is not converging. Check if it's getting useful REPL feedback.
- Too many subcalls: Chunking is too fine-grained. Increase chunk sizes.
- High token usage in subcalls: Sub-LLM responses are verbose. Check
SUBCALL_SYSTEM_PROMPT. - No cache hits: Either first run, or cache keys differ between runs (model/token params changed).
Policy Tuning
| Scenario | Recommended Policy |
|---|---|
| Quick extraction (needle-in-haystack) | Policy(max_steps=10, max_subcalls=20) |
| Multi-document analysis | Policy(max_steps=30, max_subcalls=200) |
| Deep research (100+ docs) | Policy(max_steps=40, max_subcalls=500, max_total_tokens=500_000) |
| Development / testing | Policy(max_steps=5, max_subcalls=10, max_total_tokens=50_000) |
Using SHOW_TRAJECTORY
Set SHOW_TRAJECTORY=1 env var when running examples to see real-time execution flow. The TraceFormatter in router.py provides formatted trajectory output including ASCII visualization.
Logging
Enable debug logging for detailed runtime output:
import logging
logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger("pyrlm_runtime")
rlm = RLM(adapter=my_adapter, logger=logger)
This shows: root_call steps, REPL exec results, subcall details, cache hits/misses, and policy state.