name: rlm
description: >-
Recursive Language Model (RLM) loop for processing a context that is too large
to read into the conversation directly. Loads the context as a variable in a
persistent Python REPL and answers the query by writing code that probes,
chunks, and programmatically sub-queries a cheap LLM (llm_query) over slices
of it, then aggregates. Use this WHENEVER the user points you at a big context
file/log/transcript/codebase/scraped corpus (anything from ~50K chars up to
millions) and asks a question that needs most of the content -- counting,
aggregating, classifying every item, multi-hop lookup, or summarising the whole
thing -- ESPECIALLY when the answer "depends on almost every line" and a single
retrieval/grep won't do. Trigger it even if the user doesn't say "RLM": phrases
like "this file is huge", "go through the whole log", "how many X across all of
these", "label every row", or "it won't fit in context" are all signals to use
this skill. Prefer it over dumping the file into chat.
allowed-tools:
- Bash
- Read
- Write
- Edit
- Grep
- Glob
rlm — Recursive Language Model loop
A faithful instantiation of Recursive Language Models (Zhang, Kraska, Khattab; arXiv:2512.24601), Algorithm 1, on Claude Code's primitives. The paper's insight: an arbitrarily long prompt should not be fed into a model's context window at all. It should live in an environment the model interacts with programmatically, recursively calling a model over slices of it. That's what this skill does.
Mental model
You (the main Claude Code conversation) are the root model. You do not read the big context into this conversation. Instead:
- The context lives as a
contextvariable inside a persistent Python REPL (scripts/rlm_repl.py). You only ever see metadata about it (length, a short prefix) and the truncated stdout of code you run — never the whole thing. This is the one rule that lets the context be far larger than any window. - You answer by writing REPL code that probes the context, decomposes it, and
calls a cheap sub-LM over the pieces:
llm_query(prompt)/llm_query_map(prompts)— a single / a parallel batch of plain sub-LM calls (a nested headlessclaude -p, tools off). This is the leaf: it reads a bounded chunk in its own window and returns text.rlm_query(context, query)— a full recursive RLM over a sub-context, for sub-tasks that are themselves too big for one leaf call (depth > 1). Falls back tollm_queryat the depth limit.
- You build intermediate results into REPL variables/buffers, then return the
answer by setting it in the REPL:
FINAL(answer)orFINAL_VAR(varname).
The division of labour that makes this work: the LLM does the semantics (classify this question, extract this fact, summarise this section); your Python does the bookkeeping (loop over every chunk, count, aggregate, format). Do not ask the LLM to count or do arithmetic over the whole corpus, and do not try to do the semantics yourself in Python with keyword heuristics — that is exactly the failure mode the paper's ablations show. Split the work along that seam.
When to use this
Use it when the context won't fit comfortably in the conversation and the task
needs broad access to it: aggregation/counting over every item, labelling every
row, multi-hop questions across a corpus, whole-document summarisation, or
"the answer depends on almost every line". For a one-off needle lookup in a file
you can just grep, you don't need this.
Inputs ($ARGUMENTS)
context=<path>(required): path to the large context file.query=<question>(required): what to answer about it.- Optional:
sub_model=<alias>(defaulthaiku),max_workers=<int>(default 8),max_depth=<int>(default 1; >1 enables recursiverlm_query).
If the user didn't supply them, ask for the context file path and the query.
Set optional knobs via environment before running, e.g.:
export RLM_SUB_MODEL=haiku RLM_MAX_WORKERS=8 RLM_MAX_DEPTH=1.
The loop (Algorithm 1)
Run these via the Bash tool. State persists between calls in
.claude/rlm_state/state.pkl. By default, init also creates a standalone
audit replay package under .claude/rlm_runs/<run_id>/; every exec saves the
submitted Python as steps/step_XXXX.py.
1. Initialise — load the context, read only its metadata
python .claude/skills/rlm/scripts/rlm_repl.py init <context_path>
This prints the context's type, char/line/token estimate, and a short prefix.
Do not read the context file with the Read tool — that defeats the purpose.
It also prints the audit replay package path. Use --no-audit only when you do
not want standalone step scripts.
2. Probe — understand the format with small, cheap code
Look at the shape of the data before deciding a strategy. Print small slices and structure, not the bulk:
python .claude/skills/rlm/scripts/rlm_repl.py exec <<'PY'
print(peek(0, 1500)) # head
lines = [l for l in content.splitlines() if l.strip()]
print("lines:", len(lines))
print("sample:", lines[1] if len(lines) > 1 else "")
PY
Ask: Is it line-oriented? JSON objects? Markdown sections? Logs with timestamps? The format dictates the chunking.
3. Decompose + sub-query — write code that calls the LLM over chunks
This is the core. Chunk the context, build one prompt per chunk, and fan the
semantic work out to the sub-LM with llm_query_map (parallel). Keep the
chunks fat (a leaf can hold a large slice — batch to minimise call count) but small
enough that the sub-LM stays accurate. Accumulate results in a variable; let Python
do the aggregation.
python .claude/skills/rlm/scripts/rlm_repl.py exec <<'PY'
# Example shape for an aggregation task: derive records from the actual format,
# ask leaf LMs for semantic labels, then count/aggregate in Python.
records = [line.strip() for line in content.splitlines() if line.strip()]
# Fill these from the user's query and what you observed while probing. Do not
# assume the file's delimiter, item marker, or labels before inspecting it.
question = "What should be classified or extracted for each record?"
categories = ["category_a", "category_b", "category_c"]
def build(batch, start):
body = "\n".join(f"{start+i}: {record}" for i, record in enumerate(batch))
return (
f"{question}\n"
f"Use exactly one of these categories: {', '.join(categories)}.\n"
"Output exactly one line per record as 'N: <category>'. No extra text.\n\n"
+ body
)
BATCH = 50
prompts = [build(records[s:s+BATCH], s) for s in range(0, len(records), BATCH)]
outs = llm_query_map(prompts) # parallel sub-LM calls; order preserved
import re
from collections import Counter
labels = {}
for out in outs:
for ln in out.splitlines():
m = re.match(r"\s*(\d+)\s*[:.\)]\s*(.+)", ln)
if m:
labels[int(m.group(1))] = m.group(2).strip().strip("*[]").lower()
missing = [i for i in range(len(records)) if i not in labels]
counts = Counter(labels.values())
print("classified:", len(labels), "/", len(records), "missing:", len(missing))
print("counts:", dict(counts))
PY
Because the REPL is persistent, items, labels, and counts survive into your
next exec. Inspect, sanity-check, and re-run pieces as needed. Save durable
intermediate text with add_buffer(...) (it lives in the buffers list).
4. Aggregate + answer — compute the final answer, set it in the REPL
Do the final arithmetic/formatting in Python, then set the answer. The answer must be a REPL variable or literal — not just something you say in chat (so it can be arbitrarily long and is captured verbatim):
python .claude/skills/rlm/scripts/rlm_repl.py exec <<'PY'
top = counts.most_common(1)[0][0]
answer = f"Label: {top}"
FINAL_VAR("answer") # or: FINAL(f"Label: {top}")
PY
python .claude/skills/rlm/scripts/rlm_repl.py final # prints the stored answer
Then report that final answer to the user, in the exact output format the query asked for.
REPL interface (what's available inside exec)
Injected automatically every exec (you never import or define these):
| name | what it does |
|---|---|
context / content |
the full context, as a str (two names for the same value) |
llm_query(prompt, model=None, timeout=300, system=...) |
one sub-LM leaf call → text |
llm_query_map(prompts, max_workers=8, ...) |
many leaf calls in parallel → list of texts, in order |
rlm_query(context_text, query, ...) |
recursive RLM over a sub-context (depth>1); falls back to llm_query at max depth |
FINAL(answer) / FINAL_VAR(name) |
set the final answer (literal / by variable name) |
peek(start, end) |
a slice of the raw context |
grep(pattern, max_matches, window) |
regex search → matches with surrounding snippets |
chunked(seq, size) |
yield size-length slices of a list (lines, etc.) |
chunk_indices(size, overlap) / write_chunks(dir, ...) |
character chunk spans / write chunks to files |
add_buffer(text) / buffers |
append to / read the persistent list of intermediate results |
Your own variables persist between exec calls (anything pickleable). stdout is
truncated (~8000 chars) before you see it — print summaries and samples, not bulk.
Standalone audit replay
Each audited exec writes:
steps/step_XXXX.py- a normal Python script containing the original REPL code plus a small prelude that recreates the RLM globals.steps/step_XXXX.json- metadata such as hashes, output paths, and final status.steps/step_XXXX.stdout.txt/.stderr.txt- the original captured output.runtime/- a copy of the runtime needed by the generated scripts.replay_all.py- runs all saved steps from a cleanreplay_state.pkl.
Replay with:
python .claude/rlm_runs/<run_id>/replay_all.py
Replay calls llm_query live, so sub-LM text can differ from the original run.
The replay checkpoint is separate from the live REPL state and does not mutate
.claude/rlm_state/state.pkl.
Guardrails — these are where RLMs win or lose
- Never read the whole context into the conversation. No Read tool on the
context file, no
print(content). Work through the REPL and sub-LM calls. If you catch yourself wanting the full text in chat, chunk it andllm_queryit instead. - Split semantics from arithmetic. LLM = meaning (classify/extract/summarise);
Python = counting/aggregation/formatting. Counting with the LLM, or classifying
with
if "keyword" in line, both score badly. - Batch sub-calls; don't make one call per line. Put many items in each
llm_query(e.g. 50–100 short lines per call) and parallelise withllm_query_map. Thousands of one-item calls are slow and costly for no accuracy gain. But keep batches small enough that the sub-LM doesn't drop or miscount items — verifyclassified == totaland re-run any short/garbled batch. - Process the entire context before answering for aggregation tasks — the point is that you can't shortcut it. Check your coverage counts.
- Return the answer from the REPL via
FINAL/FINAL_VAR, then echo it to the user in the requested format. Don't stop at intermediate buffers. - Recursion (
rlm_query) is for sub-tasks too big for one leaf, e.g. "analyse these 500 documents that each need their own chunking". It is slower and costlier; most tasks (including pure aggregation) only needllm_query. Defaultmax_depthis 1.
Notes
- The sub-LM is a nested headless Claude Code (
claude -p) and reuses your existing login — no API key or SDK.llm_queryruns it with tools off (a plain LLM);rlm_queryruns it with bash + this skill on (its own REPL). - Keep all scratch/state under
.claude/rlm_state/.