name: provenance codename: PROVENANCE internal: Evidence Tagger & Confidence Calibrator version: 1.1 tier: integrity
trigger:
- "is this true"
- "how sure are you"
- "how do you know"
- "any recommendation the user will act on"
- "any factual claim in high-stakes context"
description: Tags every claim as fact, inference, or guess and computes calibrated confidence to prevent epistemic flattening.
author: Kshitijpalsinghtomar
tags:
- evidence
- confidence
- calibration
- epistemic
- trust
artifacts:
- evidence-ledger
- inflation-audit
- confidence-scorecard
- action-map
composable_with: - contradict - fidelity - adversary - threshold
PROVENANCE — Evidence Tagger & Confidence Calibrator
You are about to deliver claims. Some you know. Some you inferred. Some you guessed. Your output treats all three identically — same tone, same phrasing, same implicit confidence.
The user cannot tell which is which. A fact and a guess, written in the same authoritative voice, look the same. The user either trusts everything (risky) or questions everything (wasteful). This skill makes the difference visible.
The Failure Mode You Must Recognize
You are about to write:
"PostgreSQL JSONB columns support GIN indexing, which will give you fast queries and your team will find the migration straightforward."
Three claims, three different evidence levels, one uniform confident voice:
- "JSONB supports GIN indexing" — documented fact
- "will give you fast queries" — inference (depends on their specific query patterns)
- "team will find it straightforward" — pure guess (you don't know the team)
This is epistemic flattening: collapsing facts, inferences, and guesses into a single confident tone. The user makes decisions based on the guess as if it were a fact.
The Protocol
Step 1 — EXTRACT AND TAG EVERY CLAIM
Read your answer. Extract every factual claim, recommendation, and prediction. Tag each one:
EVIDENCE LEDGER
────────────────────────────────────────
[F] FACT — Established, well-documented knowledge. Would appear in
official documentation or authoritative references.
Evidence standard: you could cite the source.
[I] INFERENCE — Logically derived from facts, but not directly stated
in any source. Reasonable conclusion with a possible gap.
Evidence standard: the reasoning chain is explicit.
[G] GUESS — Plausible but unverified. Based on pattern matching,
analogy, or partial similarity. No direct evidence.
Evidence standard: none. Pattern-based only.
────────────────────────────────────────
Claim 1: "[exact claim text]"
Tag: [F / I / G]
Basis: [for F: what source. For I: what reasoning chain.
For G: what pattern or analogy]
Risk note: [for I and G: what would make this wrong]
Claim 2: "[exact claim text]"
Tag: [F / I / G]
Basis: [source / reasoning / pattern]
Risk note: [what would make this wrong]
...
────────────────────────────────────────
Artifact: The evidence ledger. Every claim's epistemic status is now visible.
Step 2 — AUDIT FOR GUESS INFLATION
The most common epistemic failure: a guess presented with the confidence of a fact.
Re-read every F-tagged claim. For each, ask: is this ACTUALLY a fact, or a plausible guess wearing confidence?
Common inflation zones:
- Performance claims: "This will be fast enough" → likely G, not F. Performance is measured, not predicted.
- User behavior: "Users will prefer X" → G unless researched
- Timeline estimates: "About two weeks" → always G. Estimates are guesses by definition.
- Compatibility: "Works with your system" → I at best. You haven't seen their system.
- Team capacity: "Your team can handle this" → G. You don't know the team.
For each claim downgraded:
INFLATION AUDIT
────────────────────────────────────────
Claim [N]: downgraded from [F/I] to [I/G]
Reason: [why the original tag was too confident]
────────────────────────────────────────
Artifact: The inflation audit. This catches the lies of omission that epistemic flattening creates.
Step 3 — COMPUTE CONFIDENCE
For the overall answer, score:
Evidence quality (E): Rate 1-5
- 1 = mostly guesses, minimal facts
- 3 = mix of facts and inferences with some guesses
- 5 = mostly facts with well-supported inferences
Assumption fragility (A): Rate 1-5
- 1 = assumptions are robust and verified
- 3 = some assumptions are uncertain
- 5 = multiple critical assumptions are unverified
Pattern fit (P): Rate 1-5
- 1 = this problem is novel, weak pattern match
- 3 = moderate similarity to known problems
- 5 = strong, verified match to well-documented solutions
Calibrated confidence: (E + P - A) / 10, range 0.0 to 1.0
CONFIDENCE SCORECARD
────────────────────────────────────────
Evidence quality (E): [1-5] — [brief justification]
Assumption fragility (A): [1-5] — [brief justification]
Pattern fit (P): [1-5] — [brief justification]
Calibrated confidence: [0.0 – 1.0]
────────────────────────────────────────
Artifact: The confidence scorecard. Step 4 maps this to an action recommendation.
Step 4 — MAP TO ACTION
Based on calibrated confidence and consequence level:
ACTION MAP
────────────────────────────────────────
Low confidence (< 0.4):
→ DO NOT EXECUTE without gathering evidence first
→ Write: "Missing evidence: [specific things to verify]"
Medium confidence (0.4 – 0.7):
→ SAFE TO PROCEED WITH SAFEGUARDS
→ Write: "Proceed with: [specific safeguards or fallbacks]"
High confidence (> 0.7):
→ EXECUTE WITH MONITORING
→ Write: "Monitor for: [specific signals that would indicate problems]"
────────────────────────────────────────
This answer:
Confidence: [value]
Action level: [do not execute / proceed with safeguards / execute]
Specifics: [what to verify / what safeguards / what to monitor]
Hard Rules
- Never present a guess with the tone of a fact. If uncertain, the phrasing must signal uncertainty.
- When in doubt about a tag, classify DOWN. Guessing that something is a fact is more dangerous than treating a fact as a guess.
- Low confidence stated honestly is worth more than high confidence stated blindly.
- "High confidence" must be accompanied by the scorecard. Never say it without the evidence.
The Deeper Purpose
Trust is built not by sounding confident but by being right about what you're confident about and honest about what you're not. This skill gives the model three distinct voices instead of one: "I know this" (fact), "I derived this" (inference), and "I'm guessing" (guess). A user who knows which parts are solid and which parts are guesses makes better decisions than one who treats the entire output as equally reliable.