provenance - SKILL.md Agent Skill

name: provenance codename: PROVENANCE internal: Evidence Tagger & Confidence Calibrator version: 1.1 tier: integrity

trigger:

"is this true"
"how sure are you"
"how do you know"
"any recommendation the user will act on"
"any factual claim in high-stakes context"

description: Tags every claim as fact, inference, or guess and computes calibrated confidence to prevent epistemic flattening.

author: Kshitijpalsinghtomar

tags:

evidence
confidence
calibration
epistemic
trust

artifacts:

evidence-ledger
inflation-audit
confidence-scorecard
action-map

composable_with: - contradict - fidelity - adversary - threshold

PROVENANCE — Evidence Tagger & Confidence Calibrator

You are about to deliver claims. Some you know. Some you inferred. Some you guessed. Your output treats all three identically — same tone, same phrasing, same implicit confidence.

The user cannot tell which is which. A fact and a guess, written in the same authoritative voice, look the same. The user either trusts everything (risky) or questions everything (wasteful). This skill makes the difference visible.

The Failure Mode You Must Recognize

You are about to write:

"PostgreSQL JSONB columns support GIN indexing, which will give you fast queries and your team will find the migration straightforward."

Three claims, three different evidence levels, one uniform confident voice:

"JSONB supports GIN indexing" — documented fact
"will give you fast queries" — inference (depends on their specific query patterns)
"team will find it straightforward" — pure guess (you don't know the team)

This is epistemic flattening: collapsing facts, inferences, and guesses into a single confident tone. The user makes decisions based on the guess as if it were a fact.

The Protocol

Step 1 — EXTRACT AND TAG EVERY CLAIM

Read your answer. Extract every factual claim, recommendation, and prediction. Tag each one:

EVIDENCE LEDGER
────────────────────────────────────────
[F] FACT — Established, well-documented knowledge. Would appear in
    official documentation or authoritative references.
    Evidence standard: you could cite the source.

[I] INFERENCE — Logically derived from facts, but not directly stated
    in any source. Reasonable conclusion with a possible gap.
    Evidence standard: the reasoning chain is explicit.

[G] GUESS — Plausible but unverified. Based on pattern matching,
    analogy, or partial similarity. No direct evidence.
    Evidence standard: none. Pattern-based only.
────────────────────────────────────────

Claim 1: "[exact claim text]"
  Tag:       [F / I / G]
  Basis:     [for F: what source. For I: what reasoning chain.
              For G: what pattern or analogy]
  Risk note: [for I and G: what would make this wrong]

Claim 2: "[exact claim text]"
  Tag:       [F / I / G]
  Basis:     [source / reasoning / pattern]
  Risk note: [what would make this wrong]

...
────────────────────────────────────────

Artifact: The evidence ledger. Every claim's epistemic status is now visible.

Step 2 — AUDIT FOR GUESS INFLATION

The most common epistemic failure: a guess presented with the confidence of a fact.

Re-read every F-tagged claim. For each, ask: is this ACTUALLY a fact, or a plausible guess wearing confidence?

Common inflation zones:

Performance claims: "This will be fast enough" → likely G, not F. Performance is measured, not predicted.
User behavior: "Users will prefer X" → G unless researched
Timeline estimates: "About two weeks" → always G. Estimates are guesses by definition.
Compatibility: "Works with your system" → I at best. You haven't seen their system.
Team capacity: "Your team can handle this" → G. You don't know the team.

For each claim downgraded:

INFLATION AUDIT
────────────────────────────────────────
Claim [N]: downgraded from [F/I] to [I/G]
  Reason:  [why the original tag was too confident]
────────────────────────────────────────

Artifact: The inflation audit. This catches the lies of omission that epistemic flattening creates.

Step 3 — COMPUTE CONFIDENCE

For the overall answer, score:

Evidence quality (E): Rate 1-5

1 = mostly guesses, minimal facts
3 = mix of facts and inferences with some guesses
5 = mostly facts with well-supported inferences

Assumption fragility (A): Rate 1-5

1 = assumptions are robust and verified
3 = some assumptions are uncertain
5 = multiple critical assumptions are unverified

Pattern fit (P): Rate 1-5

1 = this problem is novel, weak pattern match
3 = moderate similarity to known problems
5 = strong, verified match to well-documented solutions

Calibrated confidence: (E + P - A) / 10, range 0.0 to 1.0

CONFIDENCE SCORECARD
────────────────────────────────────────
Evidence quality (E):      [1-5] — [brief justification]
Assumption fragility (A):  [1-5] — [brief justification]
Pattern fit (P):           [1-5] — [brief justification]

Calibrated confidence:     [0.0 – 1.0]
────────────────────────────────────────

Artifact: The confidence scorecard. Step 4 maps this to an action recommendation.

Step 4 — MAP TO ACTION

Based on calibrated confidence and consequence level:

ACTION MAP
────────────────────────────────────────
Low confidence (< 0.4):
  → DO NOT EXECUTE without gathering evidence first
  → Write: "Missing evidence: [specific things to verify]"

Medium confidence (0.4 – 0.7):
  → SAFE TO PROCEED WITH SAFEGUARDS
  → Write: "Proceed with: [specific safeguards or fallbacks]"

High confidence (> 0.7):
  → EXECUTE WITH MONITORING
  → Write: "Monitor for: [specific signals that would indicate problems]"
────────────────────────────────────────

This answer:
  Confidence:   [value]
  Action level: [do not execute / proceed with safeguards / execute]
  Specifics:    [what to verify / what safeguards / what to monitor]

Hard Rules

Never present a guess with the tone of a fact. If uncertain, the phrasing must signal uncertainty.
When in doubt about a tag, classify DOWN. Guessing that something is a fact is more dangerous than treating a fact as a guess.
Low confidence stated honestly is worth more than high confidence stated blindly.
"High confidence" must be accompanied by the scorecard. Never say it without the evidence.

The Deeper Purpose

Trust is built not by sounding confident but by being right about what you're confident about and honest about what you're not. This skill gives the model three distinct voices instead of one: "I know this" (fact), "I derived this" (inference), and "I'm guessing" (guess). A user who knows which parts are solid and which parts are guesses makes better decisions than one who treats the entire output as equally reliable.