name: research-integrity description: Pre-approval AI failure-mode checklist (M1-M7) for research agents. Generic across domains. Walk this before requesting an A1-A5 approval gate, before folding a result into durable project artifacts, before handoff, and before claiming a result is final.
Research Integrity (M1-M7)
This skill is a prompt-level discipline, not an MCP server or a CLI. It is the agent's own pre-flight checklist for catching seven recurring AI research failure modes before the work crosses an approval boundary or gets folded into durable project artifacts.
The skill does not replace the machine-enforced gates that already live
in the control plane (autoresearch A1–A5 approval gates,
HARNESS_INVOCATION_REQUIRED anchor gate, quality_compile,
quality_originality, and convergence gates). It is the agent-side
check that runs before those gates fire — so the gate hearing is
fair and the durable record actually reflects work that was done.
When to use
Run M1–M7 immediately before any of:
- Calling
autoresearch approve <approval_id>for an A1, A2, A3, A4, or A5 gate. - Folding a result into
research_contract.mdorresearch_plan.md#Current Status. - Handing off to another agent or human via the
research-harness"Fold Results Back" step. - Marking a
research-teamcycle as converged. - Requesting
autoresearch final-conclusionson a run. - Posting a draft to
research-writeror invoking areferee-reviewpass.
You may run a smaller subset earlier (for example M2/M4 during a literature pull, the Extraction / transcription fidelity check when you transcribe a source into a deep-read / extraction note, or the Reference-reproduction fidelity check when a result starts claiming to match a published value) and a fuller pass at the boundary. The boundary pass is non-optional.
What this skill is NOT
- Not a substitute for the citation and reference graph verification
that provider MCP tools perform (
inspire_*,openalex_*,arxiv_*,pdg_*,hepdata_*). Those are the evidence tools; this is the discipline that decides which evidence calls are required per boundary crossing and verifies they were made. - Not a re-implementation of any provider tool. When a mode below lists tool names under "Required evidence calls", those names point at existing MCP capabilities; the skill does not duplicate their logic, it only mandates when to call them.
- Not a structured artifact. There is no
integrity_report.jsonschema. The check result is recorded inline in the response or notebook entry next to the boundary-crossing action. - Not domain-specific. Each mode is genuinely domain-neutral. Resolver
and graph tools are routed by discipline of the cited work, not
by which MCP package they live in. The package name is not a domain
label — for example,
inspire_validate_bibliographylives inhep-mcpbut its default mode audits non-INSPIRE entries, andinspire_resolve_citekeytakes an INSPIRErecid(not a citekey string) and only resolves entries already in INSPIRE-HEP. Always consult the tool's actual schema and handler before reasoning about its scope. - Not a replacement for human review. It is a stricter version of the agent's own self-review that biases toward finding the failure mode rather than confirming the work is fine.
The seven modes
M1: implementation_bug_passing_self_review
Definition. A coding error that the agent reviewed and judged clean, because the review used the same broken mental model as the implementation. The bug is invisible to the author and visible to a fresh reader.
Signs.
- "Self-review: looks correct." with no specific counter-hypothesis attempted.
- Tests pass, but were written by the same agent that wrote the bug.
- Recent diff deleted code that "looked unused," and the symptom appeared after.
- The agent's mental model of the function and the code's actual behavior have not been independently re-derived.
Minimum disconfirming check. Pick one load-bearing assumption made in the code (e.g. "this branch is only reached when X is set", "this loop iterates over already-deduplicated entries") and search the codebase for a counterexample. If you cannot construct one, that itself is a signal worth flagging in the response.
Tools that help.
git diffagainst the prior known-good commit — sometimes the bug is the recent deletion.- A fresh subagent or peer model running an adversarial review of the diff in isolation.
- Call-graph tracing — grep for call sites, or a code-intelligence tool if one is available — to surface invariants you may have forgotten existed.
pnpm -r buildand the targeted package'svitest/pytest— type-checking and tests catch a subset of M1, but they do not substitute for the disconfirming check.
M2: hallucinated_citation
Definition. A citation made without verifying that the paper exists with the cited identifier, by the cited authors, in the cited venue, in the cited year, with the cited claim.
Signs.
- Citation key looks like a templated
{Authors}{Year}{Topic}slug but no resolver call appears in the transcript. - "I'm pretty sure paper X says Y" with no provider lookup backing it.
- Citation count, h-index, review status, or seminal-paper claim with no provider tool call.
- Bibliography assembled from web-search snippets alone, never cross-checked against a bibliography auditor.
- A BibTeX entry like
Vaswani2017AttentionorSmith:2023abcis accepted as proof the paper exists, with no DOI, arXiv ID, or provider-graph resolution actually performed.
Minimum disconfirming check. Route by the discipline of the cited work, not by tool name:
Find the paper. Use the resolver whose underlying data covers the cited work's discipline. If you do not know the discipline, prefer the broadest provider first.
- Any DOI →
openalex_get(id="<doi>")(cross-domain; OpenAlex indexes ~240M works including HEP, ML, condensed matter, biomedicine, etc.). - Any arXiv ID →
arxiv_get_metadataorarxiv_search(every arXiv category; not HEP-only). - HEP paper without DOI/arXiv →
inspire_search(INSPIRE-HEP database; HEP-bound by data). - Other discipline (ML / cond-mat / biomed / etc.) without
DOI/arXiv →
openalex_searchby title+author. - Anything still missing →
crossrefskill as cross-domain fallback (non-arXiv non-OpenAlex).
- Any DOI →
Verify the cited claim against the paper itself, not its abstract or third-party summary.
- Any arXiv preprint (any field) →
arxiv_paper_source. - HEP paper that you want INSPIRE/DOI URL enrichment for →
inspire_paper_source(handler internally resolves to arXiv, so it also works for any arXiv-resolvable identifier; the extra value overarxiv_paper_sourceis INSPIRE-side metadata enrichment). - Anything else (non-arXiv) →
openalex_getcontent payload, orpdf-mcpparser on the downloaded PDF.
- Any arXiv preprint (any field) →
HEP-only optional finishing step. Once the paper is confirmed in INSPIRE and you have an INSPIRE
recid, you may callinspire_resolve_citekey({recid})to get the canonical INSPIRE Texkey and BibTeX. This tool takes arecid, not a citekey string; it does not verify that an existing bibtex key is real. It is irrelevant for non-HEP citations because non-HEP papers are not in INSPIRE. There is no inversecitekey → recidlookup tool either; if you only hold a BibTeX key, extract a DOI / arXiv ID from the citation context and resolve via step 1, or useinspire_searchwithtexkeys:<key>.Bulk bibliography hygiene. Use
inspire_validate_bibliography— despite the name, its default mode (scope='manual_only',validate_against_inspire=false) audits non-INSPIRE entries for locatability only: each entry must carry a DOI, an arXiv ID, or a complete journal+volume+pages tuple, otherwise the tool emits amissing_locatorwarning. It does not check BibTeX syntax or author plausibility — pair it with a separate BibTeX linter if you need those. INSPIRE cross-validation is an optional opt-in mode (validate_against_inspire=true). Apply the default locatability audit to bibliographies of any discipline.
Required evidence calls. At least one resolver per cited paper (routed as above), and at least one content-verify call per cited claim (not per paper — a single paper can support many claims, but each claim's textual ground must be opened).
INSPIRE Texkey is INSPIRE-specific. An entry like Smith:2023abc
is the canonical citekey convention inside INSPIRE-HEP. BibTeX entries
for non-HEP papers do not use this convention and cannot be resolved
through INSPIRE; do not treat a missing INSPIRE record as evidence the
citation is fake when the cited work is outside HEP.
M3: hallucinated_measurement_or_result
Definition. A numerical value — a measured constant, a rate or ratio, a fit parameter, a benchmark accuracy, a sample size, a p-value, a simulation parameter — cited without verifying it against the cited source's actual table, equation, or figure.
Signs.
- Value looks "about right" from memory; uncertainty is absent or stated with the wrong number of significant digits.
- A cited measurement does not name the specific table, equation, or figure it came from in the source.
- The cited work has a canonical reference database for this kind of
quantity but no call to that database appears in the transcript
(HEP particle property → no
pdg_*call; HEP experiment data point → nohepdata_*call; ML benchmark accuracy → no version-pinned dataset reference + metric definition lookup). - Two papers' results are compared in prose but never aligned through a measurement-conflict check.
Minimum disconfirming check. The general rule is domain-neutral: for each cited numeric value, open the specific table, equation, or figure in the source and quote the value plus its uncertainty exactly. The additional check depends on whether the cited work's discipline has a canonical reference database:
- HEP particle properties (mass, lifetime, branching fraction,
decay width, etc.) → verify against the current PDG record via
pdg_get/pdg_get_measurements/pdg_get_property, and record the PDG year/edition that was checked. - HEP experiment data points (cross sections, asymmetries, etc.
with HEPData submissions) → fetch the table via
hepdata_get_tableand confirm the cited number matches the entry exactly. - ML / DL benchmark results → verify against the cited paper's specific dataset version, metric definition, evaluation split, and hyperparameter table. There is no centralized canonical reference; the paper's own §experiments / §results section is the canonical source.
- Astrophysics / cosmology observations → verify against the cited survey release version (e.g. DR-N), the calibration pipeline version, and the specific catalogue table — the survey documentation, not a third-party summary, is the canonical source.
- Condensed-matter / chemistry / biology / etc. → trace to the paper's specific table or figure; if a community database exists (PDB, ICSD, etc.) verify against it. Treat third-party reviews as candidates, not authority.
Cross-paper tension detection. When two HEP papers' results are
compared, inspire_detect_measurement_conflicts and
hep_project_compare_measurements are the HEP-specific tools. For
non-HEP comparisons the discipline check is the same general rule:
align units, methodology, and uncertainty conventions before claiming
agreement or tension.
Required evidence calls. For each cited numeric value: the content-fetch call appropriate to the paper's discipline (see M2), plus the canonical-database call if one applies to that quantity's class. PDG / HEPData calls are only required when the quantity is in their scope; not having one for an ML benchmark is correct, not a gap.
M4: shortcut_reliance
Definition. A relationship claim about papers — "X cites Y", "Y is a review of Z", "this is the seminal paper on Q", "field W mostly disagrees with claim V" — made without consulting the citation or reference graph.
Signs.
- "Most papers in this area cite X." with no citation-graph call.
- "Y is the standard reference." with no review-classification call.
- "Z built directly on W's work." with no chronological + citation-edge check.
- "These two communities cite each other heavily." with no network-analysis call.
Minimum disconfirming check. For each relationship claim, trace the edge in the citation graph. Web-search snippets are candidates, not authorities; they may be derivative summaries of derivative summaries. The graph itself is the authority. Route graph queries by the discipline of the papers in the relationship, not by tool name. As in M2, prefer the broadest provider first and fall back to a discipline-specific graph only when it adds something the broad provider cannot:
Cross-domain default graph → OpenAlex. Use
openalex_citations/openalex_referencesfor direct edges andopenalex_search+openalex_filterfor typed queries. This is the right starting point for non-HEP literature (ML, biomed, cond-mat, math, social science, etc.) and is also a usable fallback for HEP papers indexed in OpenAlex.HEP-specialised graph → INSPIRE. When both endpoints of the relationship are HEP papers (e.g.
hep-ph,hep-th,hep-ex,hep-lat, lattice QCD, HEP collaborations, phenomenology), the INSPIRE citation graph is denser and carries HEP-specific metadata that OpenAlex lacks. Useinspire_literature(mode=get_citations)/(mode=get_references),inspire_find_connections,inspire_network_analysis,inspire_classify_reviews,inspire_analyze_citation_stance.inspire_classify_reviewsoperates on INSPIRE-resident papers; treat its judgments as authoritative only for relationships fully inside HEP.Cross-discipline relationship (HEP paper cited by an ML paper, biomed paper using statistical methods from physics, etc.) → query both graphs and reconcile.
inspire_find_crossover_topicsis useful when the HEP side of the crossover is the focus; for the reverse direction (non-HEP discipline finding HEP-adjacent work) OpenAlex network queries are typically more complete.
Required evidence calls. At least one citation-graph call per relationship claim, routed to the appropriate graph by the discipline rule above. A graph call to the wrong provider (e.g. asking INSPIRE about a NeurIPS ML paper that is not in INSPIRE) is not a check — it is a guaranteed miss.
M5: bug_as_insight
In-cycle exemption. If you are inside an active research-team
cycle and the Reproducibility Capsule (specifically section
"G) Sweep semantics / parameter dependence") has been filled and
the convergence gate has accepted this milestone, the per-boundary
M5 walk reduces to: verify the capsule's G/H sections are filled
and validated for this milestone, and verify the cited finding falls
under the capsule's declared sweep / branch coverage. Do not
duplicate the perturbation work the capsule already locked in. If
you are not in a research-team cycle, or the capsule does not
cover the cited observable, perform the full check below.
This exemption covers ONLY sweep/branch coverage (§G/§H); it does
NOT cover method-validity preconditions. M5b below is performed in
full regardless of the cycle and is never discharged by a
gate pass alone: the capsule's §J records the precondition residual,
but M5b independently confirms it was actually measured at the
PRODUCTION configuration (a concrete residual + command/artifact +
matching config), not self-asserted. (Never let an exemption defer to
a gate that may not have run the check at the production scale.)
Definition. Treating an artifact of a code bug, numerical instability, plotting mistake, or unit error as a genuine scientific result. The agent reports the artifact as the finding instead of investigating its source.
Signs.
- Unexpected feature in output and the response says "this is interesting" without first ruling out a code-side cause.
- Effect disappears or reverses sign when a parameter unrelated to the modeled system is changed (random seed, batch size, numerical tolerance, mesh resolution, integration order).
- Effect cannot be reproduced from a clean checkout with the recorded seeds and config.
- The "finding" coincides with a recent code change that touched the observable.
Minimum disconfirming check. Reproduce the alleged finding from a
clean checkout under the same seeds and parameters recorded in
artifacts/runs/<run_id>/. If reproducible, perturb one numerical
knob (tolerance, precision, mesh size, integration order, sample
size, domain size / number of sites / grid parity / periodic-wrap
regime) and verify the effect persists in the expected direction. If
the effect comes and goes, the code is the first hypothesis to
investigate.
M5b: precondition_as_validity (no in-cycle exemption). If the
result comes from a method whose validity rests on a structural
property of the operator/method — an operator commuting with a
projector/symmetrizer, Hermiticity, self-adjointness, idempotency,
unitarity, a variational/Galerkin subspace being invariant under the
operator — you MUST evaluate that property's disconfirming residual
at the exact scale/configuration that produced the headline
number, regardless of the in-cycle exemption and regardless of how
clean the reproduction is. A precondition-violating result is
perfectly reproducible from a clean checkout and survives
knob-perturbation, so the M5 reproduce-and-perturb check passes it;
only the precondition residual at the production scale exposes it.
For a projected/effective eigenvalue report the true-operator
residual ‖Oψ − λψ‖/‖Oψ‖, not merely that ψ has the assumed
symmetry. A precondition verified only at a smaller/cheaper scale
than the result is NOT verified.
Tools that help.
autoresearch runwith explicitrun_idfor reproducibility.research-teamReproducibility Capsule (mandatory section ofresearch_contract.md).derivation-verify(or a CAS) for an independent symbolic re-derivation when an analytic cross-check is available.git bisectwhen the symptom postdates a known-clean reference point.
M6: methodology_fabrication
In-cycle exemption. If you are inside an active research-team
cycle and the Reproducibility Capsule has bound this milestone's
method steps to artifact pointers under artifacts/runs/<run_id>/
(the capsule mandate), the per-boundary M6 walk reduces to:
verify the capsule's method-to-artifact bindings cover the
boundary-crossing claim. Do not re-trace bindings the capsule
already locked in. If you are outside a research-team cycle or the
claim crosses a method step that is not in the capsule's binding
list, perform the full check below.
Definition. Describing an experimental protocol, derivation, or training procedure that did not actually run in the form described. The method section reads cleanly but the code, configs, or run artifacts do not back it.
Signs.
- "We used X-method with Y-cutoff" but no committed file imports X-method or sets Y-cutoff.
- Hyperparameter list is plausible but no run artifact records those exact values.
- Method step is described in
research_notebook.mdbut noartifacts/runs/<run_id>/entry shows it. - Two methodology versions exist in the project history and the drafted text describes the merged one that never actually executed.
Minimum disconfirming check. For each methodology step in the
work crossing the boundary, produce the exact code path or command,
plus the run artifact under artifacts/runs/<run_id>/ that records
its execution. If the step is not in the artifact, it did not happen.
Tools that help. autoresearch run with explicit run_id,
the artifacts/runs/<run_id>/ manifest, research-team
Reproducibility Capsule, and an evidence-binding/export step linking
manuscript claims to artifacts/runs/<run_id>/.
M7: frame_lock
Definition. Continuing to interpret a result through the initial framing of the question even after evidence has accumulated that would fit a different framing better. The agent's reasoning never crosses the framing boundary; new findings are made to fit the existing story.
Signs.
- Every new finding "confirms" the original hypothesis.
- Anomalies described as noise without testing the alternative they would support.
- The agent's wording mirrors the original prompt's wording too closely; no reformulation has occurred.
- A result that contradicts the framing is described as a "minor caveat" rather than as a load-bearing observation.
Minimum disconfirming check. State the result one more time using the opposing framing — switch sign of the effect, swap the role of the proposed cause and the proposed consequence, or try the null hypothesis as if it were the working hypothesis. If the opposing framing sounds equally natural, the framing is not load-bearing — proceed. If the opposing framing makes the result disappear or contradicts a different observation, you may be frame-locked.
Tools that help. None machine-enforceable; this is an explicit
prose step. research-harness recovery is a good moment to perform
it, by re-reading research_contract.md and
research_plan.md#Current Status with fresh eyes after the M1–M6
material checks have been completed.
Extraction / transcription fidelity (gate it; not a gate-exempt "reading task")
A source-extraction / transcription note — a deep-read / knowledge-base note that transcribes equations, numeric values, source locators (line / section / equation / table / figure / page pointers), and term-by-term mappings onto a consuming artifact (code or data) from a primary source — is a gateable artifact, not a gate-exempt "reading task." Its primary observable is fidelity to the source: every quoted equation, value, locator, and claimed mapping must match the primary source. Relying on such a note for a central claim, or folding it into a durable artifact, without gating its fidelity is the failure this guards.
This is not a new receipt mode. It is a cross-cutting fidelity check that
augments M2 and M3; record it under the modes it touches — typically M2
(equation / locator / mapping / inference fidelity) and M3 (numeric value +
factor fidelity) — so the machine receipt modes stay within M1–M7, with no new
mode introduced.
The transcription / extraction failure checklist. Walk every item against the source, not against the note:
- (a) equation misquote — a sign, coefficient, index, operator, or argument has drifted from the source equation.
- (b) wrong numeric value — a transposed digit, the wrong table row / column, or the wrong reported uncertainty.
- (c) wrong / stale locator — a pointer (line / section / equation / page) that does not point to the claimed content.
- (d) stale / wrong mapping to the consuming artifact — the note cites a symbol, function, file, or type in the consuming code / data that is wrong or no longer exists.
- (e) false "verbatim" — a quote labeled verbatim when whitespace, markup, or notation was silently normalized.
- (f) inference-as-source — a cross-source or derived inference presented as a direct statement of the cited source.
- (g) silent factor drop — notation using a reduced / normalized / unit argument where the full object is meant, dropping a magnitude or degree factor.
Minimum disconfirming check. Run a line-by-line comparison of the note against
the primary source with "do not trust the note" — a falsification gate, not a
confirmation read. When the note will carry a central claim or be folded into a
durable artifact, this check is independent (a fresh reader / subagent, not the
note's author re-reading their own work), and at least one reviewer must be
cross-model-family doing a literal (not loose-semantic) comparison — transcription
fidelity is exactly where a same-family looser read misses sign / factor / locator
drift. Run it through the gate harness (review-swarm's source-fidelity reviewer),
re-reviewing after every fix because a correction can introduce a fresh defect
(e.g. a rewritten line that drops a magnitude factor), and declare convergence only
when the independent reviewers agree — never self-pronounced after applying a fix.
(derivation-verify re-derives whether a re-derivable result is mathematically
correct — a separate axis that does not check fidelity to the source; use it in
addition to, never instead of, the literal comparison.)
Tools that help. claim-grounding is the active execution of this check for the
quote / value / locator items — it fetches the cited source and records a span-backed
verdict, downgrading any "substantiated" verdict that carries no verbatim source span.
deep-literature-review is the producer discipline that fills the note from the
source and runs this gate before handoff; persist the fetched primary source to a
stable, auditable location so the reviewer reads exactly the bytes that were
transcribed. review-swarm is the cross-family literal-comparison harness.
Reference-reproduction fidelity (a "matches a published value" claim is computed, not asserted)
A result reported as reproducing / matching / agreeing with a published reference value is making a quantitative claim, not a citation: the deliverable is the agreement itself. Two distinct verification dimensions fail silently here even after a multi-round correctness / methodology / honesty gate has passed — because that gate checks whether the implementation matches the derived form, not whether the result's number matches the literature number it claims:
- D1 — quantitative reproduction of the reference number. The failure mode is a match asserted only qualitatively — "same order of magnitude", "same sign", "of the right scale" — while the claimed observable was never computed on a comparable state / regime / configuration and compared to the published value numerically. Minimum disconfirming check: compute the claimed observable on the comparable regime the reference used (or the nearest reachable one, recorded as such) and compare numbers; where the claim is term-level, compare term by term, since a net total can agree while individual contributions are suppressed or sign-flipped. An order-of-magnitude same-direction discrepancy, or a sign reversal, is a finding — not a pass. A "matches in scale" claim with no computed comparison is ungrounded; treat it as an undisclosed gap, not a confirmation.
- D2 — independent cross-check that did not silently lapse. The failure mode is an established independent cross-validation pattern silently lapsing, or a structurally different-model engine / a degenerate-or-limit regime being presented as validation. Minimum disconfirming check: confirm any cross-validation evaluates the same model by a different route; if the only reachable alternative engine implements a structurally different model (or the check holds only in a limit), label it as a different-model / limit-regime comparison and record the absence of an apples-to-apples check as an explicit stated limitation — never let the prior cross-check pattern quietly disappear and never let the different-model check stand in for it.
This is not a new receipt mode. It is a cross-cutting check that augments M3 (the
cited / compared number) and M5b (when the result's own validity is the claimed
match); record it under those modes so the machine receipt set stays within M1–M7,
with no new mode introduced.
Tools that help. numerical-reliability-gate G8 is the active gate — compute the
claimed observable on the comparable regime and compare, with an order-of-magnitude or
sign discrepancy returning reference_mismatch; its G2 carries the D2
structural-independence honesty (a different-model or limit-regime check is labeled as
such or its absence recorded, never a cross-check pass). claim-grounding routes a
"reproduces / matches a published value" claim to that computation rather than grounding
it by quoting the published number. review-swarm's reference-reproduction reviewer
is the role that recomputes the claimed observable on the comparable state instead of
statically reading the assertion.
Pre-approval ritual
Walk the modes most relevant to the gate before invoking
autoresearch approve <approval_id>. The check is owed to the next
agent who will read the durable record, not to the current task's
deadline.
| Gate | Scope | Modes that bite most often |
|---|---|---|
| A1 | mass_search (literature pool definition) | M2, M4 |
| A2 | code_changes (implementation diff) | M1, M5 |
| A3 | compute_runs (numerical result acceptance) | M3, M5, M6 |
| A4 | paper_edits (manuscript text) | M2, M3, M4, M7 |
| A5 | final_conclusions (project closeout) | All seven |
For A5 specifically, run the full M1–M7 pass and record the check result inline in the conclusion artifact rather than as a separate file.
Integration with adjacent skills
research-harnessis the project-state recovery and routing skill. It is where M7 (frame_lock) reset typically happens: re-readresearch_contract.mdandresearch_plan.md#Current Statuswith fresh eyes. TheHARNESS_INVOCATION_REQUIREDanchor gate enforced in every*-mcpdispatcher ensures tool calls cannot proceed without re-anchoring; this skill's M7 step is the human-facing complement.research-teamenforces M5 and M6 mechanically via the Reproducibility Capsule contract inresearch_contract.md. Inside an active research-team cycle, M5 and M6 are partly covered by the cycle's convergence gate; this skill's job there is the M1–M4 and M7 angles.markdown-hygieneshould run before any check that relies on Markdown math being rendered correctly, otherwise numeric quotations may be misread during the M3 check.referee-reviewruns the integrity check from the reviewer side with a strict verdict contract. If a draft is heading toreferee-review, run M1–M7 first so the reviewer's BLOCKING findings are not symptoms the author could have caught.paper-reviseris the right surface for acting on M2/M3/M4/M7 findings that surface during late-stage drafting.claim-groundingis the active execution of the M2/M3 obligations. Where this skill mandates that citations and cited numbers be checked against their sources,claim-groundingis the generic, domain-routed way to do it: for each cited claim it fetches the source and records a span-backed verdict in aclaim_grounding_report_v1artifact, and asubstantiatedverdict that carries no verbatim source quote is mechanically downgraded. It also carries the transcription-fidelity dimension (does the note's quote / value / locator match the fetched source span, not merely "is the claim true") used by the Extraction / transcription fidelity check above. It stays a generic skill plus a@autoresearch/sharedcontract — not ahep-mcptool — consistent with the criterion below.
HEP-specific augmentation (future, out of scope here)
This generic skill stays domain-neutral. Future machine checks may
belong inside @autoresearch/hep-mcp — but only when the check is
genuinely HEP-bound by its core contract, not merely because it
involves a tool whose name contains hep, inspire, or pdg.
Criterion for whether a check is truly HEP-bound (judge by what the check does, not by the package it would live in):
- A PDG-drift check that compares a cited mass / branching fraction against the current PDG record and flags excess deviation is HEP-bound — PDG only tracks HEP particle properties, no equivalent exists in other disciplines.
- A FeynRules / FeynArts model consistency check is HEP-bound — the underlying QFT model formalism is HEP-specific.
- A lattice ensemble metadata check (action, $\beta$, $a$, $V$, sea-quark content) is HEP-bound — the ensemble metadata schema is lattice-QCD-specific.
- A generic "verify each cited number against its source table"
check is not HEP-bound — every discipline has tables and
numbers; this belongs in this skill (M3), not in
hep-mcp. - A generic "verify each cited paper resolves to a real record"
check is not HEP-bound —
inspire_validate_bibliography's default mode already audits non-INSPIRE entries for locatability, and OpenAlex / arXiv resolvers are cross-domain by data. This belongs in this skill (M2), not as a newhep-mcptool.
If a candidate future tool fails the criterion, it does not become
HEP-bound by being implemented inside hep-mcp; it stays a generic
discipline obligation, captured here as a mode rather than as a tool.
PDG-drift is the leading HEP-bound candidate tracked separately for future work, not part of this skill's initial scope.
Recording the check
For the narrative record — read by the next agent or human — record the check inline in the response or notebook entry, in the order:
- Which modes you checked, by
Mxnumber. - For each checked mode: the specific disconfirming check you ran and what it returned. Quote tool calls and their results where the check is provider-graph-backed.
- Modes you explicitly judged not applicable, with a one-sentence reason.
For the machine record that gates autoresearch approve (see below),
run autoresearch integrity-record after the narrative is written:
autoresearch integrity-record \
--approval-id <approval_id> \
--modes M3,M5,M6 \
--notes "<terse summary of what was checked and the headline finding>" \
--skip M1:no\ code\ change,M2:no\ new\ citations
--modes is the comma-separated list of Mx you actually walked.
--skip is an optional comma-separated list of Mx:reason for modes
you judged not applicable. --notes is short prose (max 500 chars) —
durable detail still belongs in the narrative record above; the
receipt is just the boundary-time machine artifact.
The receipt is appended to .autoresearch/integrity_log.jsonl and
checked by autoresearch approve <approval_id> (and the orch_run_approve
MCP tool) before granting. Without a matching receipt the approval
fails closed with INTEGRITY_RECEIPT_REQUIRED. This is the same
fail-closed pattern as the HARNESS_INVOCATION_REQUIRED anchor gate —
the existence of the receipt is machine-enforced, the content of
your check is your judgment.
Skip semantics for environments that need to bypass the gate (e.g.
historical project replay): AUTORESEARCH_INTEGRITY_VERIFY=skip.
NODE_ENV=test skips by default to keep existing test suites green.
Recovery from a caught failure
If a check surfaces a failure, do not cross the boundary. Fix it,
then re-walk the affected modes only. Re-run autoresearch integrity-record for the same approval_id — the latest receipt
wins, the prior entry stays in the JSONL for audit. The fix gets a
brief note in the narrative record so the next reader can see what
was caught and what was changed.
The integrity check is owed to the next agent who will read your work — including future-you in a new conversation — not to the current task's deadline.