name: rfdb-stale-node-cleanup
description: |
Recover from stale RFDB nodes that survive commit_batch cleanup despite
having their file field listed in changed_files. Use when:
(1) re-running an enricher/orchestrator step but old data with the same file
field stays in the graph, (2) commit_batch(changed_files=[X], ...) reports
success but a query later still returns nodes with file = X, (3) you
re-pointed structural nodes (e.g., DIRECTORY/FILE) from a synthetic file
path to real file paths and the old synthetic-path nodes won't go away,
(4) graph-stream / find_by_type returns a mix of "stale" and "new" nodes
for the same logical entity. Root cause: handle_commit_batch calls
engine.find_by_attr({file: X}) to enumerate nodes-to-delete, but that
query path returns only a PARTIAL set after compaction (L1 segments and
the file→nodes index can desync). Workaround: enumerate stale nodes via
the TS rfdb client queryNodes({file: X}) (different code path that finds
all of them) and call deleteNode(id) per node, then re-commit the
intended state. Related: rfdb-manifest-l1-carryforward (same L1 family).
author: Claude Code
version: 1.0.0
date: 2026-04-07
RFDB Stale Node Cleanup After Compaction
Problem
commit_batch(changed_files=[X], nodes=[...], ...) is supposed to atomically
replace all nodes with file = X by tombstoning the old set and inserting
the new one. After compaction (RFD-20 L1 segments), this is unreliable: only
a partial subset of the old nodes gets tombstoned and the rest survive
indefinitely, contaminating queries.
This is especially visible when:
- An orchestrator step is rewritten to emit nodes with new
filefields (e.g., switching DIRECTORY/FILE structural nodes from a synthetic__grafema_virtual/...path to real directory paths). The new commit uses upsert by semantic_id, but the OLD nodes at the synthetic path are not all tombstoned. - Multiple commit_batch invocations target the same legacy file path — each invocation tombstones a few more nodes but never gets all of them.
Context / Trigger Conditions
- After running a commit pipeline,
find_by_type(NODE_TYPE)returns more rows than the latest commit produced. - Graph-stream output mixes nodes with the new
filevalue and a "stale" group with the oldfilevalue, both for entities you considered rewritten. changed_filescontained the legacy path, the orchestrator log sayscommitBatchsucceeded withfiles=Nmatching expectation.- Splitting nodes by file shows orphans only at the legacy/synthetic path while the rest of the graph looks correct.
- Re-running the same
commit_batchreduces the orphan count but never clears it to zero. manifest_index.jsonhas a recent compaction snapshot (l1_node_segmentspopulated) older than the failing commits.
Diagnostic Path
- Get a per-
filecount of the affected node type:curl -s "http://localhost:3333/api/graph-stream?nodeTypes=TYPE&maxNodes=5000" \ | python3 -c " import json, sys counts = {} for line in sys.stdin: try: d = json.loads(line) if d.get('type') == 'node': counts[d.get('file','')] = counts.get(d.get('file',''), 0) + 1 except: pass for k,v in sorted(counts.items(), key=lambda kv: -kv[1])[:20]: print(v, k) " - Look for entries clustered under the legacy/synthetic file path that should have been replaced.
- Confirm via
/api/node/<id>that one such node has the legacyfileand the metadata you remember from the old code path. - Confirm a corresponding "new" node exists (with the same name but new
file) — proves the rewrite ran but cleanup didn't.
Solution
Direct cleanup via TS rfdb client (most reliable)
The TS client's queryNodes({file: X}) uses a different storage code path
than handle_commit_batch's find_by_attr and reliably enumerates all
matching nodes, including those promoted to L1 segments.
// /tmp/cleanup-stale.mjs
import { RFDBClient } from '/path/to/grafema/packages/rfdb/dist/client.js';
const client = new RFDBClient(
'/path/to/.grafema/rfdb.sock', // socket path is positional arg, not options object
'cleanup',
);
await client.connect();
const stale = [];
for await (const n of client.queryNodes({ file: '__grafema_virtual/legacy-path' })) {
stale.push(n.id);
}
console.log('Found stale nodes:', stale.length);
for (const id of stale) {
try { await client.deleteNode(id); }
catch (e) { console.error('delete failed', id, e.message); }
}
await client._send('flush', {}).catch(e => console.error('flush:', e.message));
await client.close();
Then re-run the commit step that produces the desired state. The graph will now be clean.
Caveats
queryNodes({file: X})may match more than literal equality. It can return any node wherefilematches the pattern under the storage layer's filter rules (substring or exact, depending on backend version). Use the most specific path you can to avoid wiping live data.- Verify the count before deleting. If you expect ~300 stale and the
query returns the entire graph, abort. Print first few IDs and double-
check their type/file via
/api/node/<id>. - After deleting, re-commit the intended state immediately. The graph is in a partially-empty state until you do.
Why not just keep calling commit_batch?
handle_commit_batch (rfdb_server.rs) does:
for file in &changed_files {
let old_ids = engine.find_by_attr(&AttrQuery { file: Some(file.clone()), .. });
for id in &old_ids { engine.delete_node(*id); }
}
find_by_attr for a file filter consults the in-memory file_to_node_ids
index, which is rebuilt from segments at startup. After compaction, some
nodes for a given file path are in L1 with index entries that can be
incomplete or stale relative to what's still on disk. Subsequent
find_by_attr calls return only the index-known IDs, so other historical
copies of the same node are never enumerated for tombstoning.
The TS client's queryNodes route through a different store method that
walks segments directly instead of trusting the file index, so it sees
the missed entries.
Verification
After cleanup + recommit:
# Total count for the type matches what you committed
curl -s "http://localhost:3333/api/graph-stream?nodeTypes=DIRECTORY,FILE&maxNodes=2000" \
| tail -2
# → {"edgeCount":721,"elapsed":...,"nodeCount":722,"type":"done"}
# No nodes left at the legacy path
curl -s "http://localhost:3333/api/graph-stream?nodeTypes=DIRECTORY,FILE&maxNodes=2000" \
| python3 -c "
import json, sys
n = sum(1 for line in sys.stdin
for d in [json.loads(line)] if d.get('type')=='node'
and d.get('file','').startswith('__grafema_virtual'))
print('stale survivors:', n)
"
# → stale survivors: 0
Notes
- This is a workaround, not a fix. The root fix is to make
find_by_attr({file:X})consistent withqueryNodes({file:X})after compaction — both should walk every storage layer (write_buffer + L0 + L1) and respect tombstones. - After running this workaround, keep an eye out for the same symptom in
future analyzes; until the root fix lands, every step that re-points a
node's
filefield will need this cleanup. - This bug is in the same family as
rfdb-manifest-l1-carryforward(compaction interactions). Both surface as "data exists on disk but isn't queryable correctly." - Don't try
delete_nodesvia REST/HTTP — only the unix-socket protocol exposesdeleteNode. Use the TS client. - Always flush after delete via
client._send('flush', {})so subsequent queries see the tombstones.
Related Files
packages/rfdb-server/src/bin/rfdb_server.rs:1996—handle_commit_batch(the buggy delete path)packages/rfdb-server/src/bin/rfdb_server.rs:1242—Request::DeleteNodepackages/rfdb/dist/base-client.js:86—deleteNodeTS client APIpackages/rfdb/dist/base-client.js:254—queryNodesTS client API.claude/skills/rfdb-manifest-l1-carryforward/SKILL.md— sibling bug