evermembench-benchmarking-long-term-interactive - SKILL.md Agent Skill

name: "evermembench-benchmarking-long-term-interactive" description: "Build and evaluate long-term conversational memory systems for multi-party, multi-topic dialogues. Implements the EverMemBench framework for stress-testing memory architectures against realistic workplace conversation patterns with temporal evolution, cross-topic interleaving, and role-specific personas. Use when: 'build a memory system for multi-user chat', 'evaluate my RAG memory pipeline', 'benchmark long-term conversation recall', 'test memory across multi-party dialogues', 'design a temporal memory store for chat agents', 'audit retrieval quality for conversational AI'."

Long-Term Interactive Memory Benchmarking (EverMemBench)

This skill enables Claude to design, build, and evaluate long-term conversational memory systems that handle the hard cases real-world chat applications face: multiple speakers, evolving facts, interleaved topics, and implicit context that similarity search misses. It applies the EverMemBench framework's three-dimensional evaluation (fine-grained recall, memory awareness, user profile understanding) to diagnose exactly where a memory architecture fails and guide targeted fixes.

When to Use

When the user is building a memory layer for a multi-user chatbot or agent system and needs to handle conversations across groups, channels, or time
When evaluating whether a RAG-based memory pipeline correctly retrieves temporally superseded information (e.g., a budget changed three times across a week)
When designing QA benchmarks for conversational memory that go beyond single-turn retrieval
When debugging why a memory-augmented LLM performs worse than a full-context baseline on certain query types
When implementing user profile extraction from distributed conversation history (communication style, expertise, role inference)
When stress-testing memory systems against multi-hop reasoning across scattered conversation fragments
When choosing between memory architectures (knowledge graphs, persistent memory layers, profile-based stores) for a specific use case

Key Technique

EverMemBench identifies three orthogonal failure modes in conversational memory that existing benchmarks miss. Fine-grained recall tests whether the system can retrieve specific facts, with subcategories for single-hop (direct entity lookup), multi-hop (reasoning across multiple conversation groups or speakers), and temporal (tracking version history of evolving facts). Memory awareness tests whether the system knows what it knows: can it apply implicit constraints from past conversations, proactively surface relevant context for new proposals, and detect when earlier decisions have been superseded? Profile understanding tests whether the system can infer a user's communication style, professional skills, and organizational role from their conversational behavior across interactions.

The critical finding is a hierarchy of difficulty that exposes where architectures break. Single-hop recall is essentially solved (97-99% with oracle evidence), but multi-hop collapses to 26% even with perfect retrieval because reasoning must bridge information scattered across speakers and groups. Temporal reasoning caps at 60% because it requires version semantics -- understanding that "the Q3 budget" mentioned on Wednesday supersedes the one from Monday -- not just timestamp matching. Memory awareness tasks show a 20-70 point gap between full-context and retrieval-augmented systems, proving the bottleneck is retrieval fidelity, not reasoning capability. Strong models (Gemini-3-Flash) actually degrade from 72% to 52-62% when memory augmentation is added, because retrieval introduces artifacts and version conflicts that confuse capable reasoners.

The actionable insight: memory architectures must move beyond flat vector stores toward versioned, graph-structured representations that preserve provenance (who said what, when, in which context) and support explicit supersession relationships between facts. Retrieval must bridge the semantic gap between a user's query ("what's the current budget?") and implicitly relevant memories (a side conversation where a constraint change invalidated the previously agreed number).

Step-by-Step Workflow

Characterize the conversation corpus. Count the number of distinct speakers, conversation groups/channels, total token volume, and time span. Identify whether information evolves over time (facts get updated) and whether topics interleave across groups. This determines which EverMemBench dimensions are relevant.
Define the memory schema with provenance fields. Every memory entry must store: content, speaker identity, group/channel, timestamp, topic tags, and a supersession pointer (which earlier memory this updates, if any). Do not flatten conversations into anonymous text chunks -- the metadata is what makes multi-hop and temporal reasoning possible.
Implement tiered memory indexing. Build three retrieval paths: (a) semantic similarity for single-hop recall, (b) graph traversal for multi-hop queries that require connecting facts across speakers/groups, and (c) temporal version chains for queries about the current state of evolving facts. A single embedding-based retriever will fail on (b) and (c).
Generate diagnostic QA pairs across all three dimensions. For fine-grained recall: create single-hop (direct fact lookup), multi-hop (requires joining facts from 2+ conversation fragments), and temporal (requires identifying the latest version of a changed fact) questions. For memory awareness: create constraint-application, proactive-surfacing, and update-detection scenarios. For profile understanding: create style-matching, skill-inference, and role-identification queries.
Run a blind-test filter on generated QA pairs. Present each question to the LLM without any conversation context. If it answers correctly from world knowledge alone, the question is not testing memory -- discard it. This prevents inflated scores from parametric knowledge leakage.
Evaluate with evidence grounding. For each QA pair, verify: (a) the answer is derivable from the ground-truth evidence segment (sufficiency), and (b) the answer is NOT derivable from alternative segments (uniqueness). This ensures questions test retrieval precision, not lucky guessing from noisy context.
Benchmark retrieval isolation. Run the full pipeline, then run the same questions with oracle (ground-truth) evidence substituted for retrieved evidence. The gap between oracle and retrieved performance quantifies the retrieval bottleneck. If oracle performance is also low (as with multi-hop at 26%), the problem is reasoning, not retrieval.
Test strong-model degradation. Run the same queries with your strongest available LLM in both full-context mode and memory-augmented mode. If memory augmentation degrades performance, your retrieval is introducing artifacts or version conflicts. This is the most common failure in production systems.
Iterate on the weakest dimension. Use the per-dimension scores to prioritize fixes: temporal failures need version chains, multi-hop failures need graph structure, memory awareness failures need better retrieval bridging, profile failures need aggregation across distributed interactions.
Validate with adversarial perturbation. For memory awareness questions, introduce plausible distractors that share surface-level similarity with the correct answer but violate an implicit constraint from an earlier conversation. Systems that rely purely on similarity retrieval will consistently select the distractor.

Concrete Examples

Example 1: Evaluating a Slack-bot Memory System

User: "I built a memory layer for our Slack bot that stores conversation summaries in a vector DB. Users are complaining it gives outdated answers about project status. How do I diagnose this?"

Approach:

Identify this as a temporal reasoning failure -- project status evolves over time, and the system retrieves stale summaries.
Generate temporal QA pairs: "What is the current deadline for Project X?" where the deadline was discussed and changed across 3 different messages over 2 weeks.
Run retrieval isolation: check if the vector DB returns the most recent status update or an older one with higher semantic similarity.
Implement version chains: tag each project status memory with a supersession pointer to the previous version, and modify retrieval to always prefer the latest version in a chain.

Output:

Diagnosis Report:
- Dimension: Fine-grained Recall > Temporal
- Root cause: Vector similarity retrieves the most semantically detailed
  status update (the original project plan), not the most recent one
  (a brief "deadline moved to March 15" message).
- Fix: Add a `supersedes` field to memory entries. At retrieval time,
  follow supersession chains to return only the terminal (latest) node.
- Validation: Create 20 temporal QA pairs where facts changed 2-4 times.
  Measure % of queries returning the latest version.
  Target: >90% (current baseline likely <40%).

Example 2: Building a Multi-Party Meeting Memory System

User: "I need to build a memory system for an AI assistant that sits in team meetings with 8-12 people. It should remember who said what and answer questions about decisions made across multiple meetings."

Approach:

Design the memory schema with speaker attribution, meeting ID, timestamp, and topic tags.
Implement the three retrieval tiers: semantic (what was discussed), graph (who agreed with whom on what), and temporal (how decisions evolved).
Build diagnostic QA pairs covering all three EverMemBench dimensions.
Pay special attention to multi-hop: "Did the engineering team agree with the timeline that marketing proposed?" requires joining fragments from different speakers across potentially different meetings.

Output:

# Memory entry schema
@dataclass
class MeetingMemory:
    content: str                    # What was said
    speaker: str                    # Who said it
    meeting_id: str                 # Which meeting
    timestamp: datetime             # When
    topic_tags: list[str]           # Subject matter
    supersedes: str | None          # ID of earlier memory this updates
    references: list[str]           # IDs of memories this responds to
    decision_type: str | None       # "proposal" | "agreement" | "rejection" | "revision"

# Retrieval strategy per query type
RETRIEVAL_STRATEGY = {
    "single_hop":  "semantic_search(query, top_k=10)",
    "multi_hop":   "graph_traverse(seed=semantic_search(query, top_k=5), hops=2, expand_by=['references', 'speaker'])",
    "temporal":    "version_chain(semantic_search(query, top_k=10), prefer='latest')",
    "constraint":  "semantic_search(query, top_k=10) + graph_traverse(seed, filter='decision_type=rejection|revision')",
    "profile":     "aggregate_by_speaker(speaker_id, facets=['style', 'skill', 'role'])",
}

# Diagnostic QA generation template
diagnostic_pairs = [
    # Single-hop: "What budget did Sarah propose for Q3?"
    # Multi-hop: "Did engineering approve the timeline that marketing proposed in the Monday standup?"
    # Temporal: "What is the current agreed-upon launch date?" (changed 3 times across meetings)
    # Constraint: "Can we schedule a demo on Friday?" (someone said no client demos on Fridays, 2 weeks ago)
    # Profile/Style: "Draft a message to Tom in his usual communication style."
]

Example 3: Auditing Retrieval Quality for an Existing Memory Pipeline

User: "Our conversational AI uses Mem0 for memory. It scores well on simple recall but fails on complex questions. How do I figure out what's breaking?"

Approach:

Generate QA pairs across all three EverMemBench dimensions from actual conversation data.
Run the retrieval isolation test: compare Mem0-augmented answers vs. oracle-evidence answers.
Check for strong-model degradation: test if a capable base model scores higher without Mem0 than with it.
Map failures to specific subcategories to identify the architectural gap.

Output:

Audit Results (100 QA pairs across 6 subcategories):

Dimension                  | Mem0-augmented | Oracle evidence | Gap
---------------------------|----------------|-----------------|------
Single-hop recall          |     82%        |      98%        |  16%
Multi-hop recall           |      8%        |      26%        |  18%
Temporal recall            |     15%        |      54%        |  39%  <-- worst
Constraint awareness       |     45%        |      96%        |  51%  <-- worst
Proactive awareness        |     38%        |     100%        |  62%  <-- worst
Profile understanding      |     29%        |      61%        |  32%

Key findings:
1. Temporal gap (39%) indicates Mem0 retrieves stale versions. Need version chains.
2. Memory awareness gap (51-62%) indicates similarity search misses implicitly
   relevant memories. Need constraint-aware retrieval or graph expansion.
3. Multi-hop oracle is only 26%, meaning even with perfect retrieval,
   the base model struggles. Consider chain-of-thought prompting with
   explicit evidence citation to improve multi-hop reasoning.
4. Strong-model degradation detected: base Gemini scores 72% full-context
   but only 58% with Mem0, confirming retrieval artifacts harm capable models.

Best Practices

Do: Store full provenance (speaker, group, timestamp, supersession links) with every memory entry. Without this metadata, multi-hop and temporal queries are structurally impossible to answer correctly.
Do: Test retrieval in isolation by comparing memory-augmented vs. oracle-evidence performance. This separates retrieval failures from reasoning failures and prevents wasting effort optimizing the wrong component.
Do: Include a blind-test filter when generating evaluation QA pairs. Questions answerable from world knowledge alone inflate scores and mask real memory failures.
Do: Test with your strongest model in both full-context and memory-augmented modes. If augmentation hurts the strong model, your retrieval is introducing noise that outweighs the benefit of scalability.
Avoid: Relying solely on embedding similarity for retrieval. It works for single-hop but fails systematically on temporal (prefers detailed old facts over brief updates), multi-hop (can't bridge across speakers/groups), and constraint queries (can't connect "no Friday demos" to "schedule a demo Friday").
Avoid: Treating memory as a flat key-value or chunk store. Real conversations create versioned, graph-structured information where the relationships between memories (supersession, agreement, disagreement, elaboration) carry as much signal as the content itself.

Error Handling

Retrieval returns stale information: When temporal queries consistently return outdated facts, check whether your memory store has supersession pointers. If not, implement version chains and modify retrieval to follow chains to the terminal node. As a stopgap, boost recency weighting in your similarity score.

Multi-hop accuracy near zero: If multi-hop scores are below 10%, the retrieval is likely returning fragments from only one side of the reasoning chain. Implement two-stage retrieval: first retrieve seed documents, then expand the retrieval set by following reference and speaker links to related memories before passing to the LLM.

Strong model degrades with memory augmentation: This means retrieved context contains contradictory or outdated information that confuses the model. Filter retrieved memories for version conflicts before injecting them as context. When two memories contradict each other, keep only the more recent one or explicitly mark the conflict for the model.

Memory awareness scores far below oracle: The semantic gap between queries and implicitly relevant memories is too wide for similarity search. Consider augmenting queries with LLM-generated hypothetical relevant memories (HyDE-style) or maintaining explicit constraint indexes that map action types to relevant restrictions.

Limitations

The EverMemBench framework is validated on synthetic workplace conversations generated by Gemini-2.5-Pro. Real human conversations contain more disfluencies, implicit references, and domain-specific jargon that may shift the difficulty distribution.
The benchmark focuses on workplace collaboration (projects, departments, hierarchies). Social, educational, or personal assistant contexts may have different memory access patterns.
Multi-hop oracle performance caps at 26-88% depending on the model, meaning some failure is in reasoning, not memory. No memory architecture alone can solve this -- it requires advances in multi-hop reasoning over scattered evidence.
Profile understanding (especially communication style) requires aggregation across many interactions. For users with limited conversation history, style inference will be unreliable regardless of architecture.
The 1M+ token scale means full-context baselines are only feasible with the largest context window models. For smaller models, memory augmentation is necessary even if it introduces retrieval noise.

Reference

EverMemBench: Benchmarking Long-Term Interactive Memory in Large Language Models -- Hu et al., 2026. Look for: the three-dimensional evaluation taxonomy (Section 3), the retrieval isolation methodology comparing oracle vs. augmented performance (Section 5), and the six key findings on where current memory architectures fail (Section 5.1-5.6).