research - SKILL.md Agent Skill

name: research description: > Exhaustive investigative research that definitively answers questions or explores topics. Combines deep Socratic questioning (grill-me style) with parallel agent dispatch for information gathering, producing a referenced document with evidence-backed findings. Use when the user says "research", "investigate", "deep dive", "find out about", "what do we know about", "how should I build", "what's the best approach to", asks an open-ended knowledge question, wants to evaluate options outside the codebase, needs a definitive answer to a complex question, or wants a research-driven implementation specification. Do NOT use for codebase exploration or code review.

Deep Research

Exhaustive, investigative research that produces definitive, evidence-backed answers — or, when the question demands it, a complete implementation specification informed by that research.

Two Output Modes

Determine the output mode during Phase 1 questioning:

Mode	Trigger	Output
Investigative Report	"Which X is best?", "Does X work?", "What's the evidence for X?", comparative evaluation, fact-finding	Verdict + evidence chain + recommendations
Research-Driven Specification	"How should I build X?", "What's the right architecture for X?", "Implementation spec for X" — any question where the answer is a design informed by research	Full implementation spec with architecture, code, schemas, test criteria, staged roadmap

Both modes use the same three-phase process. The difference is in Phase 3 synthesis depth.

Process Overview

Three phases — question, gather, synthesize:

Deep Grill — Relentless questioning to decompose the research request into concrete sub-questions with clear success criteria
Parallel Dispatch — Spawn research agents to investigate each sub-question independently using web search and document fetching
Synthesis — Integrate findings into a single referenced document with evidence chain, contradictions surfaced, and uncertainty flagged

Phase 1: Deep Grill (5-10 questions)

Interview the user relentlessly about their research question until you have a fully decomposed, actionable research plan. Ask questions one at a time. For each question, provide your recommended answer based on context.

Questioning Protocol

Work through these angles (adapt order to the topic):

Scope — "What specifically needs answering? What's in scope vs out?"
Success criteria — "What would a definitive answer look like? What evidence would convince you?"
Context — "Why does this matter right now? What decision does this inform?"
Current state — "What's your current approach/setup? What metrics are you seeing?" (This is critical — the research must position findings against the user's actual baseline, not just abstract comparisons.)
Prior knowledge — "What do you already know or suspect? What have you already ruled out?"
Source credibility — "What sources would you trust? Academic papers? Industry reports? Official docs? Practitioner experience?"
Sub-questions — "Are there specific angles, facets, or sub-questions to explore?"
Contradictions — "Are there competing claims or controversies you've encountered?"
Constraints — "Any time pressure, decision deadline, or scope limit?"
Output mode — "Is this a fact-finding investigation (verdict + evidence) or do you need an implementation specification (architecture + code + roadmap)?"
Adversarial check — "What would make this research useless? What blind spots should we watch for? Are there assumptions we should challenge?" (Including the user's own assumptions — extraordinary claims about current performance deserve scrutiny, not acceptance.)

Rules for Phase 1

One question per message. Wait for the answer before continuing.
Provide your recommended answer with each question so the user can just confirm or redirect.
If a question can be answered by quick web search, do the search instead of asking.
Stop when you have 3-6 concrete sub-questions with clear success criteria.
Summarize the decomposed research plan before moving to Phase 2. Get explicit confirmation.

Phase 2: Parallel Research Dispatch

Spawn 2-4 agents to investigate sub-questions in parallel. All Agent calls in a single message for concurrent execution.

Agent Configuration

Each agent gets:

subagent_type: general-purpose (has WebSearch and WebFetch access)
model: sonnet for breadth searches, opus for complex analysis
Specific sub-question to answer
Credibility criteria from Phase 1
Output requirements (see prompt template below)

Agent Prompt Template

Research the following question exhaustively. Your goal is a definitive, evidence-backed answer — not a survey of opinions.

**Question:** [specific sub-question]

**Context:** [why this matters, what decision it informs]

**Credibility criteria:** [what sources to trust, what to be skeptical of]

**Instructions:**
1. Use WebSearch to find current, authoritative information. Search multiple angles and phrasings — don't stop at the first result. Try at least 3-4 different search queries per sub-question.
2. Use WebFetch to read primary sources in full (official docs, papers, reports, original articles). Don't rely on summaries or secondary citations. When a blog cites a paper, fetch the paper.
3. For each claim, record: the exact claim, source URL, author name, publication name, date, page/section if available, and your confidence level (High/Medium/Low).
4. **Quote verbatim** when the source's exact words matter — especially warnings, design intent statements, or quantitative claims. Use: *"Author states: '[exact quote]'" (Source, Date, p. X)*
5. **Report specific numbers** — never say "good performance." Say "CAGR 7.1%, win rate 57%, max DD -15%, 110 trades, 2004-2026 on GLD."
6. Actively look for contradictions, disagreements, and nuance. When Source A says X and Source B says not-X, report both with full attribution.
7. **Negative results are findings.** "I found no published backtest in any quant blog, academic source, or traders.com archive" is extremely valuable. State what you searched and didn't find.
8. Look for at least 8-12 distinct sources per sub-question. Prefer primary sources (original papers, official docs, author statements) over secondary (blogs summarizing others' work).
9. Check for **implementation drift** — does the formula differ across platforms? Are there known bugs? Note discrepancies.
10. Note the **absence of institutional usage** as a finding — if no fund prospectus, Form ADV, or institutional whitepaper names something, say so.

**Output format:**
## Findings: [sub-question]

### Key Answer
[1-3 sentence definitive answer. If inconclusive, state what evidence would be needed to resolve it.]

### Detailed Analysis
[Full narrative analysis with inline citations. Include verbatim quotes for key claims. Organize by sub-angle if the question has multiple facets. State negative results explicitly.]

### Evidence Table
| # | Claim | Source (Author, Publication, Date) | URL | Confidence | Notes |
|---|-------|-------------------------------------|-----|------------|-------|
| 1 | [specific claim] | [Author, Publication, Date, p.X] | [URL] | High/Med/Low | [caveats] |

### Contradictions & Counter-Evidence
[Specific contradictions between sources. Which is more credible and why. What would resolve the disagreement.]

### Sources Consulted
[Numbered list: Author, "Title," Publication, Date — URL — what this source contributed to the analysis]

### Gaps & Negative Results
[What you searched for and could not find. What databases/archives returned nothing. What this absence implies.]

Dispatch Rules

Maximum 4 agents. Group related sub-questions if needed.
Each agent must be self-contained — provide all context in the prompt.
Assign the hardest/most ambiguous sub-question to opus model.
If a sub-question is simple factual lookup, consider answering it yourself instead of dispatching.

Phase 3: Synthesis & Document

After all agents return, synthesize findings into a single research document.

Synthesis Protocol

Read all agent outputs — identify agreements, contradictions, and gaps
Cross-reference — where multiple agents touched the same topic, compare their findings
Resolve contradictions — if sources disagree, note both sides and assess which is more credible
Identify gaps — what wasn't answered? What needs further investigation?
Write the document — follow the output template below

Output Document Template

Write to docs/research/YYYY-MM-DD-<topic-slug>.md (create docs/research/ if needed):

# Research: [Topic Title]

## TL;DR
- **[Bold verdict — one sentence definitive answer or "None of X has Y" negative finding]** followed by the key implication for the user's decision.
- [2-4 additional bullet points with the most actionable findings, each bolded lead-in + plain text detail]
- [Final bullet: concrete recommendation in imperative form]

## Key Findings

| [Candidate/Option] | [Criterion 1] | [Criterion 2] | [Criterion 3] | Verdict |
|---|---|---|---|---|
| **Option A** | [specific data] | [specific data] | [specific data] | **[Yes/No/Maybe — reason]** |
| **Option B** | ... | ... | ... | ... |

### A. [First research angle — e.g., "Academic and quant-literature evidence"]

[Deep analysis with inline citations including author, publication, date, page where available. Quote verbatim when the source's exact words matter — e.g., *"The author explicitly states: '[exact quote]'"*]

[State negative results as findings: "I found no published backtest anywhere" is valuable.]

[Include specific numbers: CAGR, win rate, Sharpe, drawdown, sample size, date range.]

### B. [Second research angle — e.g., "Industry/practitioner usage"]

[Same depth. Cross-reference with section A findings.]

### C. [Third angle]

[...]

### D. [Parameters/thresholds/specifics if applicable]

[Concrete actionable details: default values, recommended ranges, calibration notes]

### E–H. [Additional angles as needed]

[Continue lettered sections for each distinct research facet]

## Comparison Context

| [Factor] | [Option 1] | [Option 2] | [Baseline/Current] | ... |
|---|---|---|---|---|
| [specific metric] | [data] | [data] | [data] | ... |

[Comparative analysis positioning findings against the user's current approach or known baselines]

## Recommendations (staged, with triggers)

1. **Immediate ([timeframe]).** [Specific action.] **Trigger to proceed:** [measurable threshold]. **Trigger to halt:** [condition that means stop, revert, or rethink].
2. **Short term ([timeframe]).** [Next action with specific parameter ranges to sweep.] **Trigger to commit:** [OOS threshold across all CV folds / all test conditions].
3. **Medium term ([timeframe]).** [Larger action.] **Trigger to abandon:** [what failure looks like — be specific enough to prevent sunk-cost thinking].
4. **Do not [X].** [What to avoid and why — cite the specific source/finding that contraindicates it.]
5. **Benchmarks that would change these recommendations:** [Specific conditions under which the whole approach should be revisited — e.g., "if event masking removes ≥30% of labels, switch from drop to down-weight."]

## Caveats

- **[Source quality concern].** [Specific issue with the evidence — e.g., paywalled rules, single-author blog, image-only results not extractable.]
- **[Contested claim].** [Specific claim that appears strong but has counter-evidence or methodological weakness.]
- **[Implementation risk].** [Where formula drift, platform differences, or known bugs exist.]
- **[Generalization risk].** [Where results from one domain/timeframe/instrument may not transfer.]
- **[Missing evidence].** [What you looked for and could not find — stated as a meaningful gap.]
- **[User's own assumptions challenged].** [If the user's current approach or reported metrics are extraordinary, state what's more likely true and what audit is needed.]
- **[Frontier of knowledge].** ["No published paper directly solves X" — state what the closest work is and what gap remains.]

## Completion Table

| Research item | Status | Section |
|---|---|---|
| [Sub-question 1 from Phase 1] | Done | §A, §D |
| [Sub-question 2] | Done | §B, §E |
| [Sub-question 3] | Partial — [what remains] | §C |

## Sources

1. [Author, "Title," *Publication* Volume(Issue):Pages, Date, DOI](URL) — [what this source contributed]
2. ...
[Prefer full academic citation format: Author (Year), "Title," *Journal* Volume(Issue):Pages, DOI. Include page numbers when quoting.]

## Methodology

[What was searched, how many sources consulted, what was excluded and why, what tools/databases were used, total research scope]

Specification Mode Template (when output mode = Research-Driven Specification)

When the research question is "how should I build X?" — the output goes beyond findings into a complete implementation blueprint. Write to docs/research/YYYY-MM-DD-<topic-slug>-spec.md:

# [Topic]: Implementation Specification (v1.0, [Date])

## TL;DR
- **[Bold architectural verdict with justification from research]** — cite the authoritative source that clinches the decision. Quote verbatim when the source's exact words settle the argument.
- [2-4 bullets: key design decisions with brief "why" from research findings]
- [Final bullet: the recommended approach in imperative form]

## Executive Summary
[3-5 paragraphs: what the system does, its three logical layers/components, and estimated build effort]

## 1. Problem Statement and Motivation
[Why this exists. What gap it fills. What goes wrong without it. Include a table showing what's orthogonal to what — "Concept | Question it answers | Tooling"]

## 2. Conceptual Framework
[The mental model. Definitions. How this relates to existing concepts in the user's domain. Mathematical notation if applicable.]

## 3. Comparative Analysis (from research)
[Full comparison table of approaches/tools/methods with: theoretical basis, cost, look-ahead risk, suitability. This is where the investigative research findings land — every row must have citations.]

### Recommended approach
[Which option and why, with specific thresholds and parameters justified by the research]

## 4. Detailed Methodology
[The algorithm/process in full detail. Input contracts. Scoring catalogues. Pass/fail criteria with specific numeric thresholds.]

## 5. Integration with Existing Stack
[How this connects to what already exists. Code examples showing the API surface. Configuration that references real tools/libraries found in research.]

## 6. Build vs. Buy Analysis
| Need | Existing tool | Gap |
|---|---|---|
[For each component: what exists, what it does well, what's missing that justifies custom build]

**Gap justifying custom build:** [Explicit statement of what no off-the-shelf package does]

## 7. Ecosystem/Feasibility Assessment
[If the research uncovered platform/language/framework considerations — e.g., "Swift has no maintained HMM package" — present the ecosystem reality table here with specific package names, star counts, last-commit dates, and maintainer quotes.]

## 8. Implementation Specification
### Architecture diagram (ASCII)
### Directory layout
### Data contracts (input/output schemas)
### Configuration schema (with defaults and validation)
### Core API surface (function signatures with types)
### Test cases and acceptance criteria (numbered, specific, measurable)
### Performance & scalability targets (table: operation | target)
### Logging and observability

## 9. Output Schema
[If the system produces structured output: full JSON/YAML schema with example values and grading/scoring rubric]

## 10. Recommendations and Implementation Roadmap
### Stage 1 (Week N) — [Name]
- [Specific deliverables]
- **Decision benchmark to proceed:** [measurable threshold]

### Stage 2 (Week N) — [Name]
- [...]
- **Decision benchmark:** [...]

### Triggers that would change the recommendation
- [Specific conditions under which the architecture should be revisited]

## 11. Caveats
- **[Specific caveat].** [Why it matters, what it means for the user, how to mitigate]
[Same depth as investigative mode — no generic warnings]

## 12. References and Further Reading
- [Author (Year). *Title*. Publisher. — what this source contributed]
- [Full academic citation format: Author, "Title," *Journal* Volume(Issue):Pages.]
[Include all sources that informed architectural decisions, not just the ones that were "interesting"]

When to use Specification Mode

Use specification mode when ALL of these are true:

The research question asks "how" not just "what" or "which"
The answer requires architectural decisions
The user will build something based on this document
Multiple components/tools must be assembled into a coherent system

The specification must still be research-driven — every design decision cites the evidence that justifies it. An implementation spec without citations is just an opinion dressed up as engineering.

Output Quality Standards

The document must meet ALL of these before delivery:

TL;DR is a real verdict — not a wishy-washy "it depends." State the answer, even if the answer is "the evidence doesn't support this."
Negative results are findings — "I found no published backtest" or "no institutional disclosure names X" is stated explicitly and treated as evidence.
Verbatim quotes — when a source's exact words matter (especially warnings, caveats, or design intent), quote them with author attribution.
Specific numbers — never say "good performance" when you can say "CAGR 7.1%, win rate 57%, max DD -15%, 110 trades over 2004-2026."
Every factual claim has a citation — author, publication, date at minimum; page number when available.
Contradictions surfaced inline — don't hide disagreements in a separate section. When Source A says X and Source B says the opposite, juxtapose them at the point of analysis.
Staged recommendations with benchmarks — never "consider X." Instead: "Do X, measure Y, proceed only if Z > threshold."
Caveats are specific — not "results may vary" but "this was tested on GLD 2004-2026; the author explicitly warns it fails on S&P 500/Nasdaq 100."
Sources are diverse — flag if evidence comes primarily from one author/site.
Confidence is honest — don't overstate. "The only public backtest with numbers..." is more useful than implying broad validation.

Additional Standards for Specification Mode

Every design decision traces to research — if you recommend GARCH(1,1), cite the source that validates it for this use case. "I chose X because Y" where Y is unsourced is not acceptable.
Build-vs-buy is explicit — for every component, state what exists, what's missing, and why custom code is justified (or isn't).
Ecosystem claims are specific — not "Swift has limited ML support" but "GitHub topic search returns only Python/R/Julia/C++ HMM packages; the only Swift hit is a 2021 toy Markov chain with 5 stars and 1 commit."
Code examples are real — use actual library APIs (verified via research), not pseudo-code. Include import paths and version constraints.
Test criteria are numbered and measurable — "Hurst recovers known H ∈ {0.3, 0.5, 0.7} within ±0.05 on 5000-point fBm series" not "test that Hurst works."
Staged roadmap has decision gates — every stage ends with a specific "benchmark to proceed" threshold. No open-ended "continue when ready."
Performance targets are tables — operation | target | rationale.
Architecture diagrams use ASCII — no external image dependencies.

Rules

Evidence Standards (non-negotiable)

Never speculate. If evidence doesn't exist, say so. "I found no evidence for X" is a valid and valuable finding.
Always cite. Every factual claim needs author, publication, date at minimum. Prefer full academic format: Author (Year), "Title," Journal Volume(Issue):Pages, DOI. Page numbers when quoting.
Quote the source that clinches it. When an authoritative source's exact words settle a debate, quote verbatim with full attribution. This is the research skill's highest-value output — the definitive quote that ends the discussion.
Absence of evidence is evidence. "No maintained Swift HMM package exists (searched GitHub, SPM registry, CocoaPods, Awesome-Swift lists)" or "no peer-reviewed head-to-head comparison on intraday /ES exists" — state what you searched and didn't find, and explain why the absence matters.
Negative results are findings. "I found no published backtest in any quant blog, academic source, or industry archive" is extremely valuable. Treat it with the same weight as a positive finding.
Specific numbers always. Never say "good performance" when you can say "CAGR 7.1%, win rate 57%, max DD -15%, 110 trades over 2004-2026 on GLD." Never say "widely used" when you can say "28k GitHub stars, 8.2M npm downloads/week."

Intellectual Honesty

Be adversarial. Actively seek disconfirming evidence. Research that only finds supporting evidence is suspect.
Challenge the user's own assumptions. If the user's reported metrics are extraordinary (e.g., 84.8% win rate, PF 11.31), state plainly that this is in territory where leakage/overfitting is more plausible than true edge, and recommend the specific audit. Don't be sycophantic with data.
Flag staleness. Sources older than 2 years get a [Stale] marker. Note when a field moves fast.
Separate fact from interpretation. Data is data. Your synthesis of what it means should be clearly labeled as interpretation.
State the frontier of knowledge. When no paper or tool directly solves the user's exact problem, say so. Name the closest work and identify the specific gap. "This is a tractable but unsolved research problem" is a valid and useful conclusion.
Note when evidence doesn't transfer. A finding on daily S&P 500 data 1988–2011 does not automatically apply to intraday /ES 3-min bars in 2026. State the domain, instrument, timeframe, and date range of every cited result and flag transfer risks explicitly.

Practical Standards

No padding. If the answer is simple and well-supported, the document can be short. Don't add filler to seem thorough.
Respect scope. Answer what was asked. Note adjacent interesting findings briefly but don't expand scope without permission.
Position the user's current approach against findings. If the user has an existing setup, include a "Your current" column in comparison tables. Show them exactly where they stand relative to the evidence.
Recommendations must have fail-fast triggers. Every staged recommendation needs both a "trigger to proceed" and a "trigger to halt/revert." The user must know when to stop, not just when to continue.
Cross-reference between sections. Use "(see §N)" or "(see section on X)" to connect related findings. The document should read as a coherent argument, not a list of isolated facts.
Include a completion table. At the end, map each sub-question from Phase 1 to its resolution status and location in the document. The user should be able to verify that every angle was covered.