name: gitnexus-taint-analysis
description: "Use when working on, reviewing, or extending GitNexus's CFG/taint/PDG subsystem (the --pdg layers), or when reasoning about source→sink data-flow findings. Examples: "How does taint analysis work here?", "Why didn't explain find this flow?", "Add a new sink/source", "Review the interprocedural taint code"."
CFG & Taint Analysis with GitNexus
Expert knowledge for the opt-in --pdg program-analysis subsystem: control-flow
graphs, reaching definitions, and intra- + inter-procedural taint. Read this
before touching gitnexus/src/core/ingestion/cfg/** or
gitnexus/src/core/ingestion/taint/**, or when explaining a finding.
When to Use
- "How does the taint engine work / why is this flow (not) reported?"
- Adding a source, sink, or sanitizer to the model.
- Extending or reviewing the CFG / reaching-defs / taint / summary code.
- Understanding the
explainMCP tool's findings (intra- vs inter-procedural). - Debugging a false positive or false negative in
--pdgoutput.
The layered substrate (build order)
Taint runs on the graph, not beside it. Each layer is opt-in behind --pdg
and a default analyze run is byte-identical (the golden parity gate is the
hard floor for every change here).
L1 CFG per-function basic blocks + control-flow edges (M1 #2081)
L2 REACHING_DEF GEN/KILL def→use data dependence (pure solver) (M2 #2082)
L3 Taint (intra) source→sink over RD facts, minus sanitizers (M3 #2083)
L4 Taint (inter) per-function summaries composed over CALLS (M4 #2084)
- Worker-built, main-thread-solved. The parse worker builds each function's
CFG + harvests def/use + call-site facts onto
ParsedFile.cfgSideChannel(plain, structured-clone-safe data — never AST nodes). The main thread runs the pure solvers. NEVER re-parse on the main thread (re-introduces the #1983 OOM). - In-phase emit (KTD1). L1–L4-harvest all run INSIDE the scope-resolution
pdg window (
scope-resolution/pipeline/run.ts, gatedinput.pdg === true), because the disk-backed ParsedFile store is cleared when that phase ends — a standalone post-mrophase would read empty data. The cross-function fixpoint (L4) is the exception: it runs in its OWN registered phase (taintSummaries) AFTER scope-resolution, because it needs the COMPLETE call graph, and consumes small plain summary data threaded out viaScopeResolutionOutput. - Pure-solver contract.
computeReachingDefs,computeTaintFlows,harvestFunctionSummary, andsolveInterprocTaintare pure and deterministic (no graph, no I/O, no logger; sorted outputs). Snapshot tests and content-derived edge ids depend on it.
Intra-procedural taint (L3)
Forward reachability over RD facts from matched sources to matched sinks, killed by sanitizers. Key design points worth internalizing:
- Occurrence-tagged sites. A flat per-arg binding set cannot tell
exec(escape(x))(safe) fromexec(x)(finding); the harvest records nested call structure (SiteRecord.parent/via-tags) so sanitizer interposition is precise. - Kind-set sanitizer model. A taint carries a set of neutralized
SinkKinds; a sink fires unless its kind is in the set. Soescape(req.body)suppressesres.send(xss) but STILL firesdb.query(sql) — a kind-blind kill would be a suppressed live injection (the forbidden FN direction).path.basename(t)neutralizes path-traversal only, not command-injection. - Statement-level finding identity. NOT block-pair (block conflation drops
distinct findings;
exec(req.body, req.query)is two findings). - Persisted as
TAINTEDedges (BasicBlock→BasicBlock); the path rides thereasoncolumn via the shared versioned codec (taint/path-codec.ts).
Interprocedural taint (L4) — the functional/summary method
The production approach (Sharir-Pnueli 1981; the same shape as Meta's Pysa and
Mariana Trench, and FB Infer) — NOT full IFDS tabulation. Each function is
reduced to a compact summary, and summaries are composed over the already-
resolved CALLS graph.
Summary shape (taint/summary-model.ts, whole-parameter granularity):
| Edge | Meaning | Analogue |
|---|---|---|
param→return |
a param flows to the return value | TITO — reserved (the floor already covers its recall; precision pass deferred) |
param→callee-arg |
a param flows into arg j of a call (carries the path's neutralized sink kinds) | TITO into callee |
param→sink |
a param reaches a modelled sink | partial/triggered sink |
source→return |
the function generates+returns a source | generative — composed via the caller's callResults |
source→callee-arg |
a generated source flows into a call | fixpoint SEED |
callResults |
a user-function call's result flows to a sink/return/callee-arg in the caller | composes with callee source→return |
The fixpoint (taint/interproc-solver.ts): the unit is (function, parameter, source). Seed from source→callee-arg, propagate via
param→callee-arg, fire a finding when a tainted param meets param→sink.
- Cycle-safe by monotonicity. The tainted-set is monotone over a finite
lattice (
fn × param × source), so the worklist converges — a recursive call just re-proposes an already-visited entry. SCC condensation would only refine processing order; correctness/termination don't require it. - Source-discriminated state (load-bearing). Key the state by the SOURCE
too. Keying only by
(fn, param)collapses multi-source flows: a sink param tainted by source A is marked visited and a later flow from source B is dropped before firing — the recurring multi-source bug class. (Bit M3; bit M4 U9.) - Name-based call join. Match a summary's call-arg edge to a
CALLSedge by CALLEE NAME, not call-site line — line-base parity (CFG 1-based vs reference site) is fragile; the callee identity is exact and context-insensitivity taints the callee's param identically at every call site. - Persisted as
TAINT_PATHedges (Function→Function), function-level hop chain inreasonvia the same codec; confidence < the intra-procedural 1.0.
Context-insensitivity is the accepted trade-off at this tier: one summary per function, return/call-site merging accepted (security-conservative). Expect some FP from merging; the bigger FN sources are unmodeled features (below).
Known false-negative classes (documented, deferred)
The largest is closures/callbacks (arr.forEach(() => sink(y))) — taint
into a callback is dropped without per-library models (true of CodeQL's JS libs
too). Also deferred: field/property flows (obj.x = taint; sink(obj.y)),
field-sensitive access paths, guard-style sanitizers, implicit/control-dependence
flows, promise/async-await threading, and destructured/rest params before a
tainted simple param (the summary port index is the binding ordinal, not the
formal arg position — needs a formal-param index threaded from the worker
BindingEntry). The interprocedural join is also context-insensitive: when one
caller invokes two distinct same-named callees, a flow into one
over-attributes to both (sound — over-report, never a missed flow). Absence of a
finding is NOT proof of safety.
GitNexus-specific gotchas
- Function↔CFG join.
FunctionCfg.functionStartLineis 1-based;Function/MethodnodestartLineis 0-based — join atstartLine - 1. Function nodes have no column, so same-line functions ({a:()=>x(), b:()=>y()}) are ambiguous → drop (the summary driver countsunresolved) rather than cross-wire. - No rel-property index (S1). Kuzu has no secondary index on relationship
properties, and unanchored
[:TAINTED*]/[:TAINT_PATH*]queries explode. TAINT_PATH is therefore MATERIALIZED + anchored at analyze time, never traversed live;explainreads it source-anchored + LIMIT-guarded. explainis the only discovery surface.TAINTED/TAINT_PATHare deliberately OUT ofVALID_RELATION_TYPES(impact's allow-list) and the web schema (pinned insecurity.test.ts).explainenumerates both layers (cross-function findings carryinterprocedural: true).- One shared codec. Both the emit path and
explainimporttaint/path-codec.ts. Two hand-rolled copies of a wire format drift — never fork it. New metadata extends the format WITHIN the version when writer + reader ship together. - Cache versioning. A worker-harvest shape change bumps the parse-cache pdg
NAMESPACE (
pdg:N), NOTSCHEMA_BUMP(which cold-invalidates every user). Persisted-graph/config changes rideRepoMeta.pdg's key-union mismatch → full writeback. Model content ridestaintModelVersion.
Adding a source / sink / sanitizer
Edit the language model in taint/typescript-model.ts (registered via the
explicit registerBuiltinTaintModels seam, keyed by SupportedLanguages). The
spec is hashable data (no functions). A sanitizer's neutralizes lists the
EXACT sink kinds it defends — never a blanket kill. Add a fixture + assert the
finding (or its absence) in test/unit/taint/ (real-source harness:
test/helpers/ts-cfg-harness.ts); the end-to-end proof is
test/integration/cfg/.
Validation checklist for any --pdg change
1. tsc clean (schema additions are exhaustiveness-checked; watch the
api.ts getNodeQuery runtime read-path if a node label is added).
2. Targeted vitest by directory (test/unit/taint, test/unit/cfg,
test/integration/cfg) — verify by ISOLATION, not full-suite exit
(known load-flakes). `node scripts/build.js` before worker/integration runs.
3. Flag-off golden byte-identical (pipeline-graph-golden.test.ts).
4. bench/cfg/measure.mjs --check (no fingerprint drift / budget regression).
5. detect_changes() before commit; impact({direction:'upstream'}) before
editing shared symbols (KnowledgeGraph, RepoMeta, RelationshipType, codec).
Prior art (for deeper design questions)
Sharir & Pnueli 1981 (functional approach); Reps-Horwitz-Sagiv IFDS (POPL 1995); FlowDroid/StubDroid (access-path summaries); Pysa & Mariana Trench (TITO / propagations, parallel SCC fixpoint); CodeQL Models-as-Data (the richest port notation, incl. callback ports); Infer (content-keyed incremental summaries).