name: rim-reasoning-in-memory description: Latent reasoning method that replaces autoregressive generation with memory blocks - working memory capacity for compute-efficient reasoning version: 1.0.0 author: Hermes Agent (from arXiv 2605.30343) tags: [LLM, reasoning, working-memory, latent-reasoning, inference-optimization] activation_keywords: [latent reasoning, working memory, memory blocks, test-time compute, reasoning steps, autoregressive]
RiM: Reasoning in Memory - Working Memory for Latent Reasoning
Overview
RiM (Reasoning in Memory) introduces a latent reasoning method that replaces autoregressive generation of reasoning steps with fixed memory blocks. These memory blocks unlock working-memory capacity in LLMs, enabling compute-efficient reasoning in a single forward pass.
Core Concept
Key Insight
Human cognition uses working memory to hold and manipulate information internally without externalizing intermediate thoughts. RiM applies this principle to LLMs.
Memory Blocks
- Fixed sequences of special tokens (not generated)
- Processed in single forward pass
- Unlock working-memory capacity
- Enable latent reasoning without autoregressive step generation
Two-Stage Curriculum
Stage 1: Grounding Phase
# Ground memory blocks by predicting explicit reasoning steps
def grounding_training(model, prompt, memory_block):
output = model(prompt + memory_block)
# Predict explicit reasoning step after memory block
reasoning_step = decode_explicit_step(output)
# Supervise on step-level outputs
loss = step_prediction_loss(reasoning_step, ground_truth_step)
Stage 2: Refinement Phase
# Discard step-level supervision, iterate on final answer
def refinement_training(model, prompt, memory_blocks):
# Process multiple memory blocks iteratively
for memory_block in memory_blocks:
output = model(prompt + memory_block)
answer = decode_answer(output)
# Refine answer prediction
loss = answer_prediction_loss(answer, ground_truth)
Implementation Pattern
Memory Block Design
# Define memory block as fixed token sequence
MEMORY_BLOCK_TOKENS = [MEM_START, MEM_TOKEN_1, ..., MEM_TOKEN_N, MEM_END]
def add_memory_block(prompt, position):
# Insert memory block at specified position
return prompt[:position] + MEMORY_BLOCK_TOKENS + prompt[position:]
Single Forward Pass Processing
def rim_forward_pass(model, prompt, num_memory_blocks):
# Create prompt with multiple memory blocks
enhanced_prompt = prompt
for i in range(num_memory_blocks):
enhanced_prompt = add_memory_block(enhanced_prompt, len(enhanced_prompt))
# Single forward pass processes all memory blocks
output = model(enhanced_prompt)
return decode_answer(output)
Key Results
- Matches or exceeds existing latent reasoning methods
- Avoids autoregressive generation of thoughts
- Works across different LLM families and sizes
- Compute-efficient reasoning
When to Use
- Test-time compute scaling without autoregressive chains
- Latent reasoning applications
- Working memory simulation in LLMs
- When inference efficiency is critical
Pitfalls
- Memory blocks need grounding phase training first
- Cannot skip two-stage curriculum
- Memory block length needs tuning for specific models
- Special tokens must be added to vocabulary
References
- arXiv: 2605.30343v1
- Authors: Lukas Aichberger, Sepp Hochreiter
- Published: 2026-05-28