evomath-tao - SKILL.md Agent Skill

name: evomath-tao description: "Use this skill whenever the user submits a non-trivial mathematical claim that needs a rigorous proof or audit. Trigger on IMO/Putnam/USAMO/Olympiad-style problems, ML/AI theoretical statements, research conjectures, suspected-false claims, multi-step proofs the user already failed on, proof drafts with possible hidden assumptions, or any request containing 'prove rigorously', 'verify this', 'is this true', 'find the gap', 'audit my proof', 'find a counterexample', or 'use EvoMath' that targets a mathematical claim. Activate also when the problem requires more than three reasoning steps. Do NOT use for single-step calculations, definition lookups, textbook exercises with a known recipe, code analysis tasks, literature survey questions, pure symbolic manipulation, or non-mathematical applications of those trigger phrases (e.g., 'is it true that GPT-4 can solve math?', 'verify this LaTeX syntax'); hand those back instead." allowed-tools: "write_file edit_file read_file think_tool execute" metadata: author: EvoScientist version: '1.0.0' tags: [core, math, proof, olympiad, research]

EvoMath (Tao-style)

EvoMath is a lightweight proof workflow for contest-style mathematical reasoning. Its job is to produce a rigorous proof, a verified counterexample, a useful partial result, or a clear handoff. Keep the process small; do not run a heavy audit pipeline by default.

Methodology Anchor — Terence Tao's Research-Math Practice

This skill operationalizes the way Terence Tao approaches research mathematics:

Compute small cases first (Kepler before Newton) — build intuition from data before reaching for theory.
Try the standard toolbox broadly before going deep — most hard problems crack to a standard technique; the few that don't only reveal which after several have failed.
Hold rigor and intuition together (post-rigorous mathematics) — trust intuition, but verify every step. "It feels right" is a hypothesis, not a proof.
Atomize when stuck — decompose into independently checkable sub-claims. A clean map of proved / conjectured / open beats a polished but shaky narrative.
Stay honest about what isn't proved — distinguish PROVED / VERIFIED_NUMERICALLY / CONJECTURED / HANDED_OFF. When blocked, name the precise gap.
Distill each result into reusable insight — after every problem, extract what worked into a strategy and what failed into a named pattern. Mathematical maturity is accumulated meta-insight.

Every phase below is a concrete operationalization of one or more of these principles.

Operating Rules

Use Markdown notes for handoff between steps. Do not require JSON/YAML unless a script explicitly asks for it.
Keep only compact state: plan, verified claims, failed attempts, final audit. Do not pass long failed derivations into later prompts.
Prefer a few independent proof attempts over one long derivation.
Numerical verification is NOT a proof step (math-olympiad rule). Checking a claim on n=1..100 and finding no counterexample does NOT make it PROVED; the strongest label such evidence can earn is VERIFIED_NUMERICALLY.
Exact arithmetic can refute; approximate numerics only suggest.
A proof is final only after an adversarial check of the clean proof.
Calibrated abstention over bluffing: when verification fails repeatedly, admit it. Return partial results and mark unfixed gaps explicitly (math- olympiad rule). Final status HANDED_OFF with a structured wall report is always preferable to PROVED with hand-waved gaps.
Every final answer must include a visible final-status: ... line.
Use TodoWrite to drive the workflow. Each step is one todo; you cannot mark a todo completed unless the corresponding .md file passes its validator.

If filesystem access is available, create a Markdown workspace with:

python skills/evomath-tao/scripts/evomath_workspace.py init --dir .evomath/current

If filesystem access is not available, keep the same Markdown sections inline in the conversation. In that case run the validators by mentally checking the same required fields the script checks — the discipline is the same.

Fast Exit

Do not use EvoMath for single calculations, definition lookups, symbolic manipulation, or answer-only requests with no proof obligation. Give the direct answer instead. No TodoWrite list is needed for a Fast Exit.

If the statement has a blocking ambiguity that changes truth value, ask one specific clarification question before solving.

Execution Protocol (TodoWrite + Validation)

For any problem that passes the Fast Exit Gate, follow this protocol.

1. Create the 5-step todo list

Before doing any solving work, call TodoWrite with these five items in this order. Each item names its primary reference file:

Plan Briefly — read references/intake-checklist.md for type classification, ambiguity handling, goal types.
Try Candidates — read references/angles-by-type.md for technique ideas if you are out of angles for this problem type.
Assemble — read references/output-formats.md if you need formatting conventions or LaTeX templates.
Audit — read references/grading-taxonomy.md for issue classes and severity rules. Read references/phase-4-audit.md only if the user requests strict multi-reviewer audit.
Reflect — read references/claim-memory.md only when deep reflection is triggered (see "Deep Reflection Triggers" below).

2. Per-step discipline

For each step in order:

Mark the todo in_progress before reading the reference or writing output.
Read the referenced file(s) if and only if you need them for this step.
Produce the corresponding .md output (plan.md, candidates.md, audit.md, final.md sections, etc.).

Run the validator before marking the todo completed:

python skills/evomath-tao/scripts/evomath_workspace.py validate-phase <N> --dir .evomath/current

If validation FAILS, the todo stays in_progress. Revise the .md file based on the printed failure messages and re-run the validator. Do not mark completed until the validator exits 0.

3. PROVED gate

If your final-status is PROVED, you MUST additionally run:

python skills/evomath-tao/scripts/evomath_workspace.py validate-proved --dir .evomath/current

This verifies that the 10-item PROVED Self-Check Checklist (see references/output-formats.md) is present in final.md with all boxes ticked. If this fails, downgrade final-status to CONJECTURED or HANDED_OFF and revise the answer.

4. Deep Reflection Triggers

Step 5 (Reflect) has two modes:

Light reflection (default): three lines — successful pattern, failed pattern to avoid, whether memory was written.
Deep reflection (triggered when any of the following hold):
- Step 2 required 3+ revision rounds for any candidate
- Step 4 identified a fatal flaw before the final repair
- A new winning technique appeared that is not yet in any L2 strategy entry
- The user explicitly asks for self-evolution or cross-problem learning
- final-status is HANDED_OFF with a recurring failure-mode
In deep reflection mode, run the full ESE/IVE protocols described in references/claim-memory.md and update L2/L3 memory in .evomath/session-memory.md.

5. Fallback when filesystem is unavailable

If you cannot run scripts, keep the same TodoWrite discipline:

Still create the 5-item list and march through it.
Still keep the same .md sections inline in the conversation.
Substitute mental validation for the script call — check the same required fields the validator would check.

Workflow

1. Plan Briefly

Write a short Markdown plan:

Problem type: algebra, geometry, number theory, combinatorics, analysis, or other.
Goal: prove, refute, find example, or audit proof.
Strategy: one sentence.
Subgoals: at most five bullets.

For "determine all" problems, include both:

existence/construction
impossibility/exclusion

For a simple problem, one root subgoal is enough.

2. Try Candidates

Mode dispatch (decide before generating):

If Phase 0 goal = find-numeric-answer (AIME-style: answer is a single number, no proof required) → AIME mode: generate 5–7 short candidate answers using varied approaches (small-case enumeration, modular invariants, algebraic manipulation, generating functions, brute-force code). Take majority vote across candidates. Verify the top two by substitution into the original problem. Skip the rest of the proof workflow; output the numeric answer with final-status: PROVED only when both top candidates agree AND substitution checks pass.
Otherwise → Proof mode: continue below.

Proof mode — per-candidate 5-round internal loop:

For each active subgoal, try up to four genuinely different candidate routes. Each candidate is itself the product of a 5-round internal mini-process — not a one-shot generation:

Solve — produce a proof attempt using reasoning only. No tool use during this round (no calculator, sympy, Lean, web). Pure pencil-and-paper.
Self-improve — refine the attempt for clarity, fix obvious gaps.
Self-verify — walk through line by line, looking for: shielding words ("obviously"/"clearly"), unjustified swaps, missing hypotheses, off-by-one cases, hidden assumptions.
Correct — fix issues found in step 3.
Repeat 1–4 up to 5 times per candidate, or until the candidate self- reports as is_sound.

Each candidate's internal-rounds count is recorded so Phase 4 can see how much self-revision was needed. A candidate that needs 5 rounds is more likely to be borderline than one that's sound in 1.

Record candidates in this Markdown table:

Candidate	Idea	Internal rounds	Verdict	Issue or reason
C1		1–5	sound / repair / fail

Judging a candidate:

sound: enough to use as a verified claim.
repair: promising but missing a local step; revise at most twice outside the 5-round internal loop (so total max revisions = 5 internal + 2 external).
fail: wrong, circular, too weak, or repeats a known dead end.

Computation discipline (math-olympiad VERBATIM rule): during the Solve round, no tool calls. Computation is allowed in Phase 1 (Empirical) and in Phase 5 Deep Mode, NEVER during Phase 2 Solve. This protects against ritualized "I called sympy so it must be right" reasoning.

When a candidate is sound, add it to Proof Artifact / Verified Claims with a one-paragraph proof summary. When a route fails, add one line to Negative Attempts so it is not repeated.

Use references/angles-by-type.md only when you are out of ideas for a problem type. Do not load it by default.

3. Assemble

Turn the accepted claims into a clean proof or refutation.

Rules:

State the original claim.
Present the final argument only; omit failed attempts.
Justify every non-trivial step by a verified claim, theorem, construction, or exact computation.
If a required subgoal remains unsolved, stop pretending the proof is complete: output a partial result or handoff.

4. Audit

Audit only the clean proof, not the exploration notes.

Check:

The proof proves the original statement, not a weaker restatement.
All cases, boundary values, degeneracies, and quantifiers are handled.
No claim is cited without proof or explicit acceptance in the Proof Artifact.
Refutations use an exactly verified counterexample.

Audit follows math-olympiad's 3-safeguard pattern (see references/phase-4-audit.md):

Verifier context isolation (strip thinking traces before review).
Asymmetric voting (4 HOLDS to confirm; 2 HOLE FOUND to refute).
Pigeonhole exit (stop launching reviewers after threshold).

Apply the named-pattern screen (P4 / P5 / P6 / P18 / P40 / P41 in grading-taxonomy.md) plus the counterexample-first rule before any PROVED award.

For ordinary use, one careful local audit is enough. Use the heavier references/phase-4-audit.md protocol only when the user asks for strict multi-reviewer audit or when the result is high-risk.

5. Reflect

After the final answer, add a short reflection note. It should be compact:

successful pattern, if any
failed pattern to avoid
whether memory was written

Per-problem memory resets at the next problem. Cross-problem memory is optional: write only compact strategy/failure summaries to .evomath/session-memory.json when file access is available. If not written, say memory-persisted: false.

Status Labels

Use exactly one:

Status	Meaning
`PROVED`	Complete proof of the original statement passed Safeguards 1–3 audit. The strongest label EvoMath awards.
`REFUTED`	Exact counterexample or contradiction proof found
`VERIFIED_NUMERICALLY`	Finite exact checks only; no general proof. Numerical evidence is NOT a proof — this label exists to record empirical support honestly.
`CONJECTURED`	Strong partial evidence or partial proof, but incomplete
`HANDED_OFF`	Stopped with a precise remaining gap or user question

Exactly one of these five labels must appear in every final answer. See references/confidence-rules.md for award conditions and promotion rules.

Final Answer Shape

Use Markdown, not a rigid schema:

final-status: PROVED

## Result
<answer>

## Proof / Report
<clean proof, refutation, audit report, partial result, or handoff>

## Audit
- audits-run:
- critical-issues:
- remaining-gaps:

## Proof Artifact
- verified claims:
- negative attempts:

## Reflection
- memory-persisted:
- storage-location:
- proposed-memory-updates:

For proof-audit requests, lead with findings ordered by severity, then give the verdict and suggested repair.

Optional Helpers

scripts/evomath_workspace.py:
- init creates the five Markdown state files.
- check verifies final.md has a valid final-status: line.
- validate-phase N validates the .md output for step N (1..5). Use this before marking the corresponding todo completed. Add --strict to also re-validate all prior steps.
- validate-proved when final-status is PROVED, verifies the 10-item Self-Check Checklist is fully ticked.
references/angles-by-type.md: technique ideas by problem type.
references/grading-taxonomy.md: detailed flaw taxonomy for audits.
references/phase-4-audit.md: heavier independent-review protocol.
references/output-formats.md: optional Markdown/LaTeX formatting details + the PROVED Self-Check Checklist template.
references/claim-memory.md: three-layer memory architecture and ESE/IVE reflection protocols. Read only in deep-reflection mode.
references/model-tier.md: per-tier parameter table (Haiku / Sonnet / Opus). Read once at the start of Phase 0 to set K, internal-rounds, audit passes, and abstain thresholds for the active model.