name: "just-in-time-reinforcement-learning-continual" description: "Implement JitRL-style continual learning for LLM agents: training-free policy optimization via experience memory, advantage estimation, and logit modulation. Use when asked to 'add experience memory to an agent', 'implement continual learning without fine-tuning', 'build a JitRL agent', 'optimize agent actions from past trajectories', 'add non-parametric RL to an LLM pipeline', or 'make my agent learn from its mistakes at inference time'."
Just-In-Time Reinforcement Learning for LLM Agents
This skill enables Claude to implement JitRL, a training-free continual learning framework that makes LLM agents improve over time without gradient updates or fine-tuning. The core idea: maintain a non-parametric memory of past (state, action, return) experiences, retrieve relevant trajectories at inference time, estimate action advantages via k-NN, and apply an additive logit correction that is the provably optimal solution to the KL-constrained policy optimization objective. This gives agents the ability to learn from successes and failures across episodes while preserving the base LLM's capabilities.
When to Use
- When building an LLM agent (web navigation, game-playing, tool-use) that must improve its decisions over repeated episodes without retraining
- When a user asks to "add memory" or "learning from experience" to an existing agentic pipeline
- When implementing test-time policy optimization that avoids catastrophic forgetting
- When the user wants to reduce API costs compared to fine-tuning approaches like WebRL or online RLHF
- When building an agent that operates in a repeatable environment (web tasks, CLI workflows, interactive fiction) where past trajectories are informative
- When adding exploration-exploitation tradeoffs to an LLM agent's action selection
Key Technique
The Problem. Deployed LLMs have frozen weights. Traditional RL requires expensive gradient updates and risks catastrophic forgetting. JitRL solves the KL-constrained policy optimization objective pi* = argmax (E[A(s,a)] - (1/beta) * D_KL(pi' || pi_theta)) in closed form, avoiding both issues.
The Solution. The closed-form optimal policy is pi*(a|s) proportional to pi_theta(a|s) * exp(beta * A(s,a)). In logit space, this reduces to a simple additive update: z'(s,a) = z(s,a) + beta * A_hat(s,a), where z are the base LLM's output logits and A_hat is the estimated advantage. No gradients, no backpropagation -- just arithmetic on logits.
Advantage Estimation. Advantages come from a non-parametric experience memory storing (state, action, discounted_return) triplets. Given a current state, retrieve the top-k most similar past states using Jaccard similarity over tokenized representations. Estimate the state value V_hat(s) as the mean return of retrieved neighbors, and action value Q_hat(s,a) as the mean return of neighbors that took action a. The advantage is A_hat(s,a) = Q_hat(s,a) - V_hat(s). For unseen actions, a small exploration bonus alpha/|N(s)| is added with probability lambda to encourage trying new strategies.
Step-by-Step Workflow
Define the state representation for your domain. Encode environment observations into a canonical string form suitable for similarity matching. For web tasks, regularize URLs by replacing dynamic IDs with wildcards (e.g.,
/user/edit/42becomes/user/edit/*) and append recent action history. For CLI/tool-use agents, structure as[State: key_nouns] [Action: recent_verbs]. The representation must be stable enough that similar situations produce similar strings.Implement the experience memory store. Create a persistent store (JSON on disk or a lightweight DB) holding
(state_repr, action_text, discounted_return)triplets. Organize by episode so you can recompute returns when reward signals change. The store should support append, bulk retrieval, and optional size limits (LRU eviction or return-weighted pruning).Implement Jaccard similarity retrieval. Tokenize state representations into n-gram sets (unigrams or bigrams). For a query state
s, computeJ(s, s_i) = |T(s) intersect T(s_i)| / |T(s) union T(s_i)|against all stored states. Return the top-k entries exceeding a similarity threshold (default k=10, threshold=0.8). For large memories, use MinHash or locality-sensitive hashing to avoid O(N) scans.Estimate action advantages from retrieved neighbors. Compute
V_hat(s) = mean(G_i for i in N(s))across all retrieved neighbors. For each candidate actiona, computeQ_hat(s,a) = mean(G_j for j in N(s) where action_j == a)if the action has been seen before. For unseen actions, setQ_hat(s,a) = V_hat(s) + alpha/|N(s)|with probabilitylambda(exploration), otherwiseQ_hat(s,a) = V_hat(s). ThenA_hat(s,a) = Q_hat(s,a) - V_hat(s).Construct the augmented candidate action set. Merge actions proposed by the base LLM with actions seen in retrieved trajectories. This ensures the agent considers both its default policy and historically successful actions. Deduplicate by normalizing action text.
Apply the logit modulation. For each candidate action, compute updated logits:
z'(a) = z(a) + beta * A_hat(s,a), wherez(a)is the base LLM's log-probability for actionaandbetacontrols exploitation strength (start with beta=1.0, tune in [0.5, 5.0]). Pass updated logits through softmax and sample the action. If you lack direct logit access (e.g., API-only models), approximate by constructing a weighted prompt that emphasizes high-advantage actions and de-emphasizes low-advantage ones.Execute the chosen action and collect rewards. Run the selected action in the environment. Record the state-action pair. At episode boundaries (task success/failure or step limit), use an LLM evaluator to assign step-wise reward scores
r_tfor each action taken.Compute discounted returns and update memory. After an episode ends, compute
G_t = sum(gamma^(u-t) * r_u for u in [t, T])with gamma=0.95 for each timestep. Store all(state_t, action_t, G_t)triplets into the experience memory. Persist to disk.Iterate across episodes. On each new episode, the retrieval pool grows, advantage estimates become more accurate, and the agent's policy improves. Monitor cumulative reward per episode to verify learning. Expect meaningful improvement within 5-20 episodes for moderately complex tasks.
Tune hyperparameters based on observed behavior. If the agent is too conservative (always repeating known-good actions), lower
betaor raiselambda. If it wastes time on bad exploration, raisebetaor the similarity threshold. If retrieval is too sparse, lower the threshold or switch to bigram tokenization.
Concrete Examples
Example 1: Web Navigation Agent with Experience Memory
User: "Build a web agent that learns from past browsing sessions to get better at filling out forms on our internal tool."
Approach:
- Define state as the regularized current URL + visible form field names + last 3 actions taken.
- Store experience as JSON files under
memory/{task_type}/{timestamp}.json, each entry containing{state, action, return}. - On each new task, tokenize the current state, retrieve top-10 similar past states via Jaccard similarity.
- Compute advantage for each candidate action (click, type, select) based on returns of retrieved neighbors.
- Modulate the LLM's action probabilities:
z'(a) = z(a) + 1.5 * A_hat(s,a). - After task completion, score each step (1.0 for correct fills, -0.5 for errors, 0.0 for neutral navigation), compute discounted returns with gamma=0.95, and store to memory.
Output structure:
# experience_memory.py
class ExperienceMemory:
def __init__(self, memory_dir="memory/", gamma=0.95):
self.entries = [] # List of (state_repr, action, G)
self.memory_dir = memory_dir
def retrieve(self, query_state, top_k=10, threshold=0.8):
query_tokens = set(tokenize(query_state))
scored = []
for state_repr, action, G in self.entries:
entry_tokens = set(tokenize(state_repr))
jaccard = len(query_tokens & entry_tokens) / len(query_tokens | entry_tokens)
if jaccard >= threshold:
scored.append((jaccard, state_repr, action, G))
scored.sort(reverse=True)
return scored[:top_k]
def estimate_advantage(self, query_state, candidate_actions, alpha=0.1, lam=0.3):
neighbors = self.retrieve(query_state)
if not neighbors:
return {a: 0.0 for a in candidate_actions}
V_hat = sum(G for _, _, _, G in neighbors) / len(neighbors)
advantages = {}
for a in candidate_actions:
matching = [G for _, _, act, G in neighbors if act == a]
if matching:
Q_hat = sum(matching) / len(matching)
elif random.random() < lam:
Q_hat = V_hat + alpha / len(neighbors)
else:
Q_hat = V_hat
advantages[a] = Q_hat - V_hat
return advantages
def update(self, trajectory, step_rewards):
"""trajectory: list of (state, action), step_rewards: list of float"""
T = len(trajectory)
for t in range(T):
G_t = sum(self.gamma**(u - t) * step_rewards[u] for u in range(t, T))
state, action = trajectory[t]
self.entries.append((state, action, G_t))
self.save()
Example 2: CLI Automation Agent That Improves Over Deployments
User: "I have a DevOps agent that runs deployment scripts. It sometimes picks suboptimal rollback strategies. Make it learn from past incidents."
Approach:
- State representation:
[Service: {name}] [Error: {error_type}] [Metrics: {cpu/mem/latency_bucket}] [Recent: {last_2_actions}]. - Actions: "rollback-canary", "restart-pods", "scale-up", "revert-config", "escalate-to-human".
- After each incident resolution, score the outcome: 1.0 for fast resolution, 0.5 for slow resolution, -1.0 for escalation or extended downtime.
- On new incidents, retrieve similar past incidents, compute advantage per action, and present the agent with a reranked action list.
Output:
def select_action(llm, state, memory, beta=2.0):
# Get base LLM action proposals with log-probs
candidates = llm.propose_actions(state) # {action: log_prob}
# Get historically successful actions from memory
historical_actions = memory.get_historical_actions(state)
for a in historical_actions:
if a not in candidates:
candidates[a] = llm.score_action(state, a) # base log-prob
# Compute advantages
advantages = memory.estimate_advantage(state, list(candidates.keys()))
# Apply JitRL logit modulation
modulated = {a: candidates[a] + beta * advantages[a] for a in candidates}
# Softmax and sample
probs = softmax(modulated)
return sample(probs)
Example 3: API-Only Approximation (No Logit Access)
User: "I'm using GPT-4o through the API and can't access logits. Can I still use JitRL?"
Approach:
- Retrieve similar past experiences and compute advantages as usual.
- Instead of modulating logits directly, construct a prompt injection that encodes advantage information:
- Prepend a "learned strategy" section listing high-advantage actions with their estimated returns.
- Frame low-advantage actions as "previously unsuccessful approaches to avoid."
- Use the LLM's in-context learning to approximate the logit shift.
Output:
def build_jitrl_prompt(base_prompt, state, memory):
neighbors = memory.retrieve(state)
if not neighbors:
return base_prompt
advantages = memory.estimate_advantage(state, memory.get_all_actions(state))
sorted_actions = sorted(advantages.items(), key=lambda x: -x[1])
strategy_section = "Based on past experience in similar situations:\n"
for action, adv in sorted_actions[:3]:
if adv > 0:
strategy_section += f"- PREFER: '{action}' (historically effective, advantage={adv:.2f})\n"
for action, adv in sorted_actions[-2:]:
if adv < 0:
strategy_section += f"- AVOID: '{action}' (historically poor, advantage={adv:.2f})\n"
return f"{strategy_section}\n{base_prompt}"
Best Practices
- Do: Normalize state representations aggressively. Replace dynamic IDs, timestamps, and session tokens with wildcards. Similarity matching degrades fast when states contain ephemeral noise.
- Do: Start with a small beta (1.0) and increase gradually. Over-exploitation early (high beta) locks the agent into suboptimal strategies before the memory has enough coverage.
- Do: Use an LLM evaluator for step-wise rewards rather than only sparse episode-level signals. Intermediate rewards produce much more informative advantage estimates, especially in long-horizon tasks.
- Do: Persist memory across sessions. The entire value of JitRL comes from accumulating experience. Losing memory resets the agent to base LLM performance.
- Avoid: Applying JitRL when the environment is highly non-stationary (e.g., the task distribution changes completely between episodes). The advantage estimates from old trajectories will be misleading.
- Avoid: Setting the similarity threshold too low. Retrieving dissimilar states injects noise into advantage estimates. If retrieval returns fewer than 3 neighbors, fall back to the base LLM policy rather than using noisy estimates.
Error Handling
| Issue | Cause | Fix |
|---|---|---|
| Advantages are all zero | Memory is empty or threshold too high | Lower retrieval_threshold to 0.6, or fall back to base LLM until memory has 20+ entries |
| Agent loops on the same action | High beta with limited action diversity in memory | Increase lambda (exploration probability) to 0.5, or decrease beta |
| Retrieval is too slow | Memory grown beyond 10K entries with brute-force Jaccard | Switch to MinHash LSH indexing or prune low-return entries periodically |
| Step-wise rewards are unreliable | LLM evaluator hallucinating reward scores | Add calibration: normalize rewards to [-1, 1], use majority vote across 3 evaluator calls |
| Memory diverges from current environment | Environment changed (new UI, new API version) | Add a recency weight w = decay^(episode_age) to retrieved returns, or flush old entries |
Limitations
- Requires repeated episodes in similar environments. JitRL is not useful for one-shot tasks or completely novel environments where no prior trajectory data exists.
- Logit access needed for exact method. The provably optimal logit modulation requires access to output log-probabilities. API-only models (without
logprobsparameter) must use the prompt-based approximation, which is weaker. - Memory scales linearly. Without indexing, retrieval cost grows with memory size. For production systems with 100K+ entries, approximate nearest-neighbor methods are necessary.
- Sensitive to state representation quality. Poor state encoding (too verbose, too sparse, or including irrelevant information) degrades Jaccard similarity and makes retrieval unreliable.
- Does not handle partial observability well. If the true environment state is hidden, the observed state may map to very different optimal actions, causing high-variance advantage estimates.
Reference
Paper: Just-In-Time Reinforcement Learning: Continual Learning in LLM Agents Without Gradient Updates (Li et al., 2026). Key sections: Section 4 for the closed-form derivation of z' = z + beta * A_hat, Section 4.2 for the k-NN advantage estimation procedure, and the appendix theorems for convergence guarantees. Code: github.com/liushiliushi/JitRL.