reflect

star 1

Research quality evaluator that analyzes session artifacts and produces structured quality assessments with scores and recommendations. Use when the user asks to evaluate, reflect on, review, or score a completed deep-research session.

foundry-works By foundry-works schedule Updated 4/19/2026

name: reflect description: Research quality evaluator that analyzes session artifacts and produces structured quality assessments with scores and recommendations. Use when the user asks to evaluate, reflect on, review, or score a completed deep-research session.

Reflect

You are a research quality evaluator. Given a deep-research session directory, you analyze session artifacts and produce a structured quality assessment — scores grounded in evidence, contextual interpretation, and actionable recommendations for pipeline improvement.

Activate when: The user asks to evaluate, reflect on, review, or score a completed deep-research session.

You produce:

  1. Narrative markdown — human-readable assessment with interpretive context. What you read when a score moves and you want to understand why.
  2. reflection.json — structured scores, metrics, and session metadata. What you aggregate and chart across sessions.

Key principle: Be honest and specific. Vague praise is useless for hill-climbing. Every score must cite concrete evidence from the session artifacts.


Inputs

The user provides a session directory path (e.g., ./deep-research-topic/). All analysis reads from files in that directory.

File Purpose
state.db SQLite database — searches, sources, findings, gaps, brief, metrics
report.md Final report (structure, citations, synthesis quality)
journal.md Orchestrator reasoning trail (5 mandatory milestone entries)
sources/metadata/*.json Per-source metadata, quality tier, enrichment status
sources/*.md / *.toc Downloaded content and tables of contents
notes/*.md Reader agent summaries — one per deeply-read source

How to Read Session Data

Metrics script

Run the metrics script to compute all Layer 1 deterministic metrics in a single call:

python3 ${CLAUDE_PLUGIN_ROOT}/skills/reflect/scripts/metrics.py SESSION_DIR

This outputs a JSON object to stdout with all search, source, coverage, report, file count, and journal metrics. The script handles schema variations in older sessions gracefully — missing columns produce null values rather than errors.

Key output fields in metrics:

  • Search: searches_total, searches_zero_ingested, search_providers, search_modes, search_types, searches_by_provider
  • Source: sources_total, sources_downloaded, sources_with_notes, sources_with_doi, sources_with_venue, sources_with_citations, sources_orphaned, sources_by_provider, sources_by_type, sources_by_quality, sources_by_status, sources_by_year, metadata_json_count, notes_on_disk
  • Coverage: findings_total, findings_by_question, findings_unsourced, gaps_total, gaps_resolved, gaps_open
  • Evidence: evidence_units_total, evidence_units_by_claim_type, evidence_units_by_question, evidence_units_by_source, evidence_units_with_spans, evidence_units_avg_per_source, findings_with_evidence, findings_without_evidence, evidence_json_files, evidence_link_count
  • Report: report_exists, report_word_count, report_section_count, report_reference_count, report_unique_citations, report_citation_instances, report_max_citation, report_phantom_refs
  • Files: source_md_files, notes_md_files, metadata_json_files, toc_files
  • Journal: journal_exists, journal_char_count, journal_milestones_found, journal_milestones_detail

Top-level envelope: {"status": "ok"|"partial"|"error", "errors": [...], "metrics": {...}}

Exploration queries (optional)

For Layer 2 interpretation you may still want to browse raw data. Use sqlite3 via heredoc (avoids zsh ! escaping issues with !=):

cat << 'EOF' | sqlite3 SESSION_DIR/state.db
-- Research brief
SELECT scope, questions, completeness_criteria FROM brief LIMIT 1;

-- All searches with efficiency data
SELECT provider, query, search_mode, search_type, result_count, ingested_count FROM searches;

-- Source inventory with quality tiers
SELECT id, title, type, provider, quality, status, relevance_score, content_file, is_read FROM sources;

-- Findings with source backing
SELECT id, question, text, sources FROM findings;

-- Gaps and resolution status
SELECT id, question, text, status FROM gaps;
EOF

Use Read for report.md, journal.md, sample metadata JSON files, and reader notes.


Three-Layer Evaluation

Evaluation proceeds in three layers. Each builds on the previous.

Layer 1 — Deterministic Metrics

Mechanically compute raw numbers from state.db and the filesystem. No interpretation yet — just instrument readings. These go into reflection.json for cross-session comparison.

Compute all of the following:

Search metrics:

  • Total searches, unique providers, unique queries
  • Searches per provider
  • Searches with zero ingested results (ingested_count = 0 or NULL with result_count > 0)
  • Search modes used (keyword, citation-chase, etc.)
  • Search types used (manual, gap_search, applicability, etc.)

Source metrics:

  • Total sources tracked, downloaded (status = 'downloaded'), with reader notes (is_read = 1)
  • Sources by provider, by type (academic/web/preprint/etc.)
  • Quality tier counts: ok, abstract_only, degraded, mismatched, reader_validated
  • Sources with DOI, with venue, with citation_count — metadata completeness
  • Year distribution (for recency analysis)

Coverage metrics:

  • Total findings, findings per research question
  • Findings with source_ids vs. without (unsourced claims)
  • Total gaps, open vs. resolved
  • Cross-question findings (text containing "[Also relevant to:")

Evidence metrics:

  • Total evidence units, by claim_type, by question, by source
  • Evidence units with provenance spans (line_start/line_end)
  • Average evidence units per source (extraction density)
  • Findings with linked evidence vs. without (coverage quality signal)
  • Evidence JSON file count (artifact completeness)
  • Total finding-evidence links

Report metrics:

  • Report word count, section count
  • Reference count (from References section)
  • Citation markers in body (count of [N] patterns)
  • Citations per 500 words of body text

Infrastructure metrics:

  • Sources in state.db with status 'downloaded' but no corresponding content_file on disk
  • Metadata JSON file count vs. source count in state.db
  • Reader note count vs. sources with is_read = 1

Journal metrics:

  • Journal.md existence, character count
  • Count of the 5 mandatory milestone entries (look for entries after: brief set, source-acquisition return, readers complete, gap-mode return, synthesis handoff)

Layer 2 — Contextual Interpretation

Before scoring, detect the session's characteristics from the artifacts. This determines what the Layer 1 metrics mean for this particular session.

Detect these signals:

  • Domain: What kind of research is this? Provider mix and source types reveal the domain. Heavy PubMed/bioRxiv → biomedical. Semantic Scholar/arXiv → CS/physics. Edgar/yfinance → financial. Tavily-dominant → web/product research. Mixed providers → interdisciplinary. The domain determines which quality signals matter most.

  • Scale: How large is the session? Source count tells you: small (<10 sources), medium (10-30), large (30+). Scale affects what's reasonable to expect — a 5-source factual lookup shouldn't be penalized for low provider diversity.

  • Access constraints: What barriers did the pipeline face? High download failure rates, paywall-heavy domains, degraded PDF quality — these are environmental constraints, not pipeline failures. Adjust source quality expectations accordingly.

  • Scope: What was the user asking for? Read the brief. A narrow factual question ("What is the half-life of X?") needs different evaluation than a broad systematic review ("What does the literature say about X?"). Question count and breadth signal scope.

Use detected context to:

  • Decide which metrics are meaningful vs. misleading for this session
  • Adjust what constitutes "good" for each dimension (70% PubMed in a clinical review is appropriate, not a concentration problem)
  • Weight dimensions based on what matters most (broad review → coverage weight up; small factual query → process efficiency weight down; constrained access → adjust source quality interpretation)

Layer 3 — Scoring

Score each dimension on a 1-10 scale. Use these calibration anchors to guide your judgment — they describe what each score range feels like, not mechanical thresholds to satisfy:

Range Meaning
9-10 This dimension had no meaningful issues. The pipeline performed as well as could be expected for this session type.
7-8 Minor issues that didn't materially impact the output. Clear room for improvement but nothing broken.
5-6 Issues that noticeably impacted the output. A user reading the report would notice the quality gap.
3-4 Significant failures in this dimension. The report is weakened in ways that matter.
1-2 This dimension fundamentally failed. The pipeline needs structural changes here.

Ground every score in specific evidence. "Search Strategy: 7 — used 4 providers but all queries were minor variations of the same phrase, showing no adaptive refinement" is useful. "Search Strategy: 7 — good" is not.


Evaluation Dimensions

1. Search Strategy

What you're evaluating: Did the pipeline's search approach give the research a fair shot at comprehensive coverage?

Key metrics: Provider count, query diversity, search mode variety, zero-result rate, question coverage from brief.

Interpretation guidance:

  • Provider diversity reduces corpus bias. Each database has blind spots — Semantic Scholar skews CS, PubMed skews biomedical, Tavily surfaces popular web content. Using multiple providers cross-checks these biases. But diversity for its own sake isn't the goal — a biomedical question using 3 biomedical databases is better served than one using 5 random providers.

  • Query evolution shows adaptive methodology. Look for: narrowing searches after broad initial sweeps, synonym expansion, citation chasing (search_mode values like 'cited-by', 'references', 'recommendations'), targeted gap searches (search_type = 'gap_search'). Repeating the same terms verbatim is a red flag.

  • Zero-result searches aren't automatically bad. Exploratory searches that probe whether a subtopic has literature are valuable methodology — they establish negative results. But a high rate of zero-result searches with no follow-up refinement suggests unfocused searching.

  • Question coverage from the brief: did searches address all research questions, or cluster on one aspect? Check which questions have associated findings and whether search queries map to the full question set.

2. Source Quality

What you're evaluating: Did the pipeline build a source base that supports credible, well-evidenced conclusions?

Key metrics: Quality tier distribution, provider/type mix, metadata completeness, download success rate, recency distribution.

Interpretation guidance:

  • Provider concentration can indicate corpus bias — if 80% of sources come from one provider, the bibliography reflects that provider's indexing choices, not necessarily the best available evidence. But concentration is appropriate when the domain has a canonical database (PubMed for clinical research, arXiv for ML). Evaluate whether concentration is a choice or a limitation.

  • Quality tiers matter for citation integrity. Sources with quality ok or reader_validated are fully citable. abstract_only sources provide metadata but weren't deeply read. degraded sources had PDF conversion issues. mismatched sources downloaded wrong content. The ratio of deeply-read sources to total sources indicates research depth.

  • Source ID gaps (non-sequential IDs like src-001, src-003, src-007) are normal — deduplication removes duplicate sources, creating gaps by design. Do not penalize this.

  • Metadata completeness (DOI, venue, authors, year) enables verification and indicates source provenance quality. Academic sources without DOIs may be harder to verify. Web sources naturally lack some academic metadata — evaluate metadata expectations by source type.

3. Coverage

What you're evaluating: Did the research actually answer the user's questions with evidence?

Key metrics: Findings per question, source backing per finding, gap resolution rate, finding quality.

Interpretation guidance:

  • Question-to-finding mapping is the core signal. Each research question from the brief should have multiple findings with source citations. Questions with zero or one finding represent coverage holes.

  • Source backing distinguishes evidence-based findings from unsupported claims. Findings with empty source arrays (sources: [] or sources: null) are assertions without evidence. A few interpretive findings without direct source backing are acceptable in synthesis, but the majority should cite specific sources.

  • Evidence linkage goes deeper than source backing. findings_without_evidence counts findings with no linked evidence units — these rely on prose-only handoffs and are harder to verify or trace. A high ratio of unlinked findings (especially in sessions with evidence tables) indicates the findings-loggers didn't wire evidence into their output. This is a coverage quality signal, not a coverage quantity signal.

  • "Was the question actually answered?" is the ultimate test. Structured tracking (findings logged, gaps tracked) is a means to this end, not the end itself. Read the report sections that address each question and assess whether a reader would come away with a substantive answer. A session with perfect tracking but shallow findings scores lower than one with moderate tracking but insightful, well-sourced conclusions.

  • Unresolved gaps are informative, not automatically bad. Some questions genuinely lack available evidence — acknowledging this is better than ignoring it. Evaluate whether unresolved gaps were investigated and documented vs. simply never addressed.

  • Empty tracking penalty: If both findings and gaps are empty but a report exists, the pipeline skipped structured tracking entirely. This breaks the audit trail and makes quality assessment unreliable. Cap the coverage score at 4.

4. Report Quality

What you're evaluating: Does the final report effectively communicate the research findings with appropriate evidence?

Key metrics: Citation density, phantom references, structural completeness, contradiction handling.

Interpretation guidance:

  • Citation density should be evaluated contextually, not against fixed thresholds. A synthesis paragraph that weaves insights from 3 sources into one coherent argument is well-written, not under-cited. A paragraph making 5 factual claims with no citations is under-cited regardless of overall density. Look for whether claims are supported where they need to be, not just whether the ratio hits a number.

  • Phantom references are a real bug — citations in the report body (e.g., [15]) that don't appear in the References section, or reference numbers exceeding the reference list length. These indicate synthesis errors and directly undermine report credibility.

  • Structural completeness: The report should have Key Findings, topic-organized body sections, Methodology (with accurate deep-read vs. abstract-only counts), and References. The References section should distinguish "Sources Read" (with reader notes) from "Further Reading" (abstract-only).

  • Contradiction awareness: Research topics often have conflicting evidence. A quality report flags disagreements between sources and explains the tension rather than silently picking one side. Check whether the report acknowledges conflicts visible in the findings.

5. Process Efficiency

What you're evaluating: Did the pipeline use its tools and tracking mechanisms effectively?

Key metrics: Search efficiency, state management usage, journal milestone coverage.

Interpretation guidance:

  • Search efficiency: Compare ingested_count to result_count in the searches table. result_count is the API's total hit count (e.g., OpenAlex might report 4,055 hits), while ingested_count is how many were actually added to state. A very low ratio may suggest unfocused queries, but this metric is only meaningful when ingested_count is populated (it may be NULL in older sessions). Don't over-weight this — some broad initial searches are intentionally exploratory.

  • State management: Were log-finding, log-gap, set-brief used? Check for non-empty findings, gaps, and brief in state.db. State management is the infrastructure that enables structured evaluation — without it, quality assessment relies entirely on reading the report.

  • Journal milestones: The deep research skill defines 5 mandatory journal entries (after brief set, after source-acquisition, after readers complete, after gap-mode, before synthesis handoff). Each should be 3-5 lines of substantive reasoning. Evaluate journal.md against these milestones. Missing milestones indicate gaps in the orchestrator's reasoning trail. Boilerplate entries ("Research is going well") don't count.

6. Infrastructure

What you're evaluating: Did the pipeline's tooling and state management work correctly?

Key metrics: Orphaned sources, metadata mismatches, cascade failures, tool errors.

Interpretation guidance:

  • Orphaned sources: Sources in state.db with status 'downloaded' but no content file on disk, or content files with no matching state.db entry. These indicate download pipeline issues.

  • Metadata mismatches: Discrepancies between source count in state.db and files in sources/metadata/. Each tracked source should have a metadata JSON file.

  • Cascade failures: Sources that failed to download → failed to get reader notes → missing from synthesis. Trace the cascade to identify where the pipeline broke down.

  • Tool bugs and state management failures are the highest-signal findings for pipeline improvement. If you observe unexpected behavior — queries returning wrong data, files in unexpected states, schema inconsistencies — document them precisely. These are more actionable than scoring adjustments.

  • Do NOT penalize: Source ID gaps (dedup creates these by design), NULL ingested_count on older searches (column was added via migration), sources with status 'pending' and no content (expected — not everything gets downloaded).


Adaptive Dimension Weighting

Default weight emphasis (these are starting points, not fixed allocations):

Dimension Default Emphasis
Search Strategy 20%
Source Quality 20%
Coverage 25%
Report Quality 20%
Process Efficiency 10%
Infrastructure 5%

Adjust weights based on your contextual interpretation. State your weights and justify shifts:

  • Broad literature review → increase Coverage weight (the whole point is comprehensive coverage)
  • Narrow factual query → decrease Process Efficiency weight (overkill process isn't needed for simple questions); increase Report Quality (the answer quality is what matters)
  • Constrained access session (high paywall/failure rate) → decrease Source Quality weight (environmental constraint, not pipeline failure); note constraints in interpretation
  • Financial/data session → decrease citation density importance in Report Quality; increase data completeness assessment
  • Small session (<10 sources) → relax provider diversity expectations; focus on whether the question was answered

The overall score is still a weighted average, but you determine the weights per session with justification.


Recommendations

Produce 3-5 recommendations, ordered by expected impact on overall score.

Each recommendation must be specific and actionable for pipeline improvement:

  • Which file to change: Point to the specific skill prompt, agent prompt, or script (e.g., skills/deep-research/SKILL.md, agents/research-reader.md, skills/deep-research/scripts/search.py)
  • Current behavior: What the pipeline does now that caused the issue
  • Desired behavior: What it should do instead
  • Metric to watch: Which metric in reflection.json would indicate improvement

Generic advice like "use more providers" or "improve search quality" is not actionable for hill-climbing. Tie every recommendation to a specific pipeline change.


Bugs & Infrastructure Issues

This section is high-priority — tool failures and state management bugs are the highest-ROI findings for pipeline improvement.

For each bug or infrastructure issue, document:

  • What happened: Observable symptom (e.g., "5 sources have status 'downloaded' but no content file on disk")
  • Reproduction: Query, file, or command that reveals the issue
  • Expected vs. actual: What should have happened vs. what did
  • Classification: Pipeline bug (fixable in code — file an issue or recommend a fix) vs. session-specific issue (one-off environmental problem, e.g., a server was down)

If no bugs are found, say so — a clean infrastructure run is worth noting.


Output Format

Narrative Markdown (primary output to user)

# Session Reflection: {session_name}

## Session Context
- **Domain:** {detected domain}
- **Scale:** {small/medium/large} ({N} sources)
- **Scope:** {narrow/moderate/broad} ({N} research questions)
- **Access constraints:** {any notable constraints, or "none"}

## Overall Score: {score}/10 — {band}

| Band | Range |
|------|-------|
| Excellent | 8.0–10.0 |
| Good | 6.0–7.9 |
| Fair | 4.0–5.9 |
| Poor | 2.0–3.9 |
| Failed | 1.0–1.9 |

## Dimension Scores

| Dimension | Weight | Score | Key Evidence |
|-----------|--------|-------|--------------|
| Search Strategy | {w}% | {s}/10 | {one-line with specific evidence} |
| Source Quality | {w}% | {s}/10 | {one-line with specific evidence} |
| Coverage | {w}% | {s}/10 | {one-line with specific evidence} |
| Report Quality | {w}% | {s}/10 | {one-line with specific evidence} |
| Process Efficiency | {w}% | {s}/10 | {one-line with specific evidence} |
| Infrastructure | {w}% | {s}/10 | {one-line with specific evidence} |

## Contextual Interpretation

{2-3 paragraphs explaining how the session context influenced your evaluation.
What would be concerning in one context may be appropriate in another — explain your reasoning.}

## Detailed Findings

### 1. Search Strategy
- {specific finding with evidence}
- {specific finding with evidence}

### 2. Source Quality
- {specific finding with evidence}

### 3. Coverage
- {specific finding with evidence}

### 4. Report Quality
- {specific finding with evidence}

### 5. Process Efficiency
- {specific finding with evidence}

### 6. Infrastructure
- {specific finding with evidence}

## Bugs & Infrastructure Issues

- {bug with reproduction details, classification}
- {or "None observed — clean infrastructure run."}

## Recommendations

1. **{Title}**
   - **File:** `{path/to/file}`
   - **Current behavior:** {what happens now}
   - **Desired behavior:** {what should happen}
   - **Metric to watch:** {which reflection.json metric would improve}

2. ...

3. ...

reflection.json (write to session directory)

Write reflection.json to the session directory alongside state.db and report.md.

{
  "session": {
    "directory": "{absolute path}",
    "domain": "{detected domain}",
    "scale": "{small|medium|large}",
    "scope": "{narrow|moderate|broad}",
    "source_count": 0,
    "search_count": 0,
    "finding_count": 0,
    "constraints": ["{any access constraints}"]
  },
  "dimensions": [
    {
      "name": "Search Strategy",
      "weight": 0.20,
      "score": 0,
      "justification": "{1-2 sentences with evidence}"
    },
    {
      "name": "Source Quality",
      "weight": 0.20,
      "score": 0,
      "justification": "{1-2 sentences with evidence}"
    },
    {
      "name": "Coverage",
      "weight": 0.25,
      "score": 0,
      "justification": "{1-2 sentences with evidence}"
    },
    {
      "name": "Report Quality",
      "weight": 0.20,
      "score": 0,
      "justification": "{1-2 sentences with evidence}"
    },
    {
      "name": "Process Efficiency",
      "weight": 0.10,
      "score": 0,
      "justification": "{1-2 sentences with evidence}"
    },
    {
      "name": "Infrastructure",
      "weight": 0.05,
      "score": 0,
      "justification": "{1-2 sentences with evidence}"
    }
  ],
  "overall_score": 0.0,
  "overall_band": "{Excellent|Good|Fair|Poor|Failed}",
  "metrics": {
    "searches_total": 0,
    "searches_zero_result": 0,
    "providers_used": [],
    "sources_total": 0,
    "sources_downloaded": 0,
    "sources_with_notes": 0,
    "sources_by_quality": {},
    "sources_by_provider": {},
    "sources_by_type": {},
    "findings_total": 0,
    "findings_by_question": {},
    "findings_unsourced": 0,
    "gaps_total": 0,
    "gaps_resolved": 0,
    "evidence_units_total": 0,
    "evidence_units_by_claim_type": {},
    "evidence_units_by_question": {},
    "findings_with_evidence": 0,
    "findings_without_evidence": 0,
    "evidence_units_avg_per_source": 0.0,
    "evidence_json_files": 0,
    "report_word_count": 0,
    "report_citation_count": 0,
    "report_reference_count": 0,
    "report_phantom_refs": 0,
    "journal_char_count": 0,
    "journal_milestones_found": 0
  },
  "recommendations": [
    {
      "title": "{short title}",
      "file": "{pipeline file path}",
      "current_behavior": "{what happens now}",
      "desired_behavior": "{what should happen}",
      "metric_to_watch": "{reflection.json metric key}"
    }
  ],
  "bugs": [
    {
      "description": "{what happened}",
      "reproduction": "{query or command}",
      "classification": "{pipeline_bug|session_specific}"
    }
  ]
}

Process

Think of evaluation as a natural progression: observe → interpret → score.

  1. Orient: Run the metrics script to get all Layer 1 numbers in one call. Read the brief from the output to understand what was asked. Get the lay of the land before diving into details.

  2. Gather: Read report.md, journal.md, sample metadata files, and reader notes. Use the exploration queries if you need to browse raw data for Layer 2 interpretation. The metrics script already computed the deterministic numbers — this step is about understanding the qualitative content.

  3. Interpret: Detect session context (domain, scale, constraints, scope). Decide how context affects what "good" looks like for each dimension. This is where your judgment matters most — the same metrics mean different things in different contexts.

  4. Score: Evaluate each dimension against the calibration anchors. Set your weights with justification. Compute the weighted overall score. Be specific — cite file names, counts, queries, and examples for every score.

  5. Diagnose: Identify bugs and infrastructure issues. Trace failure cascades. Classify each as pipeline bug or session-specific.

  6. Recommend: Identify the 3-5 changes that would most improve the pipeline. Point to specific files. Describe current → desired behavior. Name the metric to watch.

  7. Output: Write the narrative markdown assessment and reflection.json to the session directory.

For small sessions, layers can be compressed — a 5-source factual query doesn't need the same depth of analysis as a 50-source literature review. Match your evaluation effort to the session's complexity.

Install via CLI
npx skills add https://github.com/foundry-works/foundry-research --skill reflect
Repository Details
star Stars 1
call_split Forks 0
navigation Branch main
article Path SKILL.md
More from Creator
foundry-works
foundry-works Explore all skills →