name: literature-discovery description: > Multi-source literature discovery across academic and non-academic sources. Can be the FIRST step in a research project — give it a topic and it creates the library from scratch. Searches OpenAlex, Semantic Scholar, arXiv, and the general web (blogs, whitepapers, NIST docs, vendor publications, industry reports). Use this skill whenever the user says "research this topic," "find papers on," "discover literature," "I want to research," "find sources about," "what's been written about," "build me a bibliography," "start researching," "find more papers," "expand my bibliography," "what am I missing," "find related work," "literature search," "fill research gaps," or any request to search for literature — academic or otherwise — on a topic. This is the recommended ENTRY POINT for new research projects. Also trigger when gap analysis results suggest missing coverage areas, or when the user gives a topic without specifying what to do with it. version: 0.3.0
Literature Discovery
The starting point for any research project. Give it a topic — it finds the literature, creates the library, and loads everything into Supabase. Works with or without an existing library.
Discovery Engines
This skill has two discovery engines that work together:
API Engine — Direct calls to academic databases (OpenAlex, Semantic Scholar, arXiv) and web search. Fast, structured, returns metadata-rich results.
Multi-Model Swarm — Fans the same research query out to 2-3 LLMs via OpenRouter (Perplexity Sonar Pro + Gemini Flash or Claude Sonnet). Each model searches its own knowledge base and web-grounded sources, finding material the others miss. Results are merged, deduplicated, and synthesized by Claude.
By default, both engines run. The API engine catches the structured academic literature. The swarm catches the practitioner content, niche reports, and sources that don't show up in academic indexes.
Two Modes
Mode 1: Topic-First (New Research)
The user has a topic but no bibliography yet. This is the most common entry point.
Flow:
- User says "I want to research STIG automation" (or any topic)
- Create a new library in Supabase
- Search academic AND non-academic sources
- Present candidates for user review
- Ingest approved sources into the library
- Run gap analysis to identify what's still missing
Mode 2: Library Expansion (Existing Research)
The user already has a library and wants to find what's missing.
Flow:
- Assess current library state
- Choose discovery strategy based on gaps
- Search, deduplicate against existing citations
- Present and ingest approved sources
Prerequisites
- Supabase MCP — creating libraries, reading/writing citations
- Built-in
WebSearch— for general web search across blogs, whitepapers, NIST/DISA docs WebFetchandBash(curl) — for direct API calls (OpenAlex, Semantic Scholar, arXiv, CrossRef)- OpenRouter API key + Council plugin v0.7.0+ — the multi-model swarm reuses Council's
scripts/openrouter_dispatch.py(andscripts/openrouter_key.py), which auto-resolves the key from theOPENROUTER_API_KEYenv var or the team-shared vault file atMoxyWolf Vault/_Shared Knowledge/Agents and Plugins/openrouter.env
Step 1: Understand the Research Need
Ask the user what they want to research. If they've given a clear topic, proceed. If vague, ask a few sharpening questions:
- What specific aspect of this topic interests you?
- Are you looking for academic research, industry practices, both?
- Any particular time period? (last 5 years, historical, etc.)
- Any specific domains? (cybersecurity, compliance, AI, etc.)
Step 2: Create or Select Library
If no library exists for this topic:
INSERT INTO research_libraries (name, description, metadata)
VALUES (
'{topic_name}',
'{user_description}',
'{"topics": [...], "created_by": "dorian", "source_types": ["academic", "industry", "government"]}'::jsonb
)
RETURNING id, name
Tell the user: "Created library '[name]' — now searching for sources."
If a library already exists:
SELECT id, name FROM research_libraries ORDER BY created_at DESC
Ask which one, or auto-select if name matches.
Step 3: Search Across All Source Types
Academic Sources
OpenAlex (Primary — largest coverage, free, no auth)
GET https://api.openalex.org/works?search={query}&per_page=50&sort=relevance_score:desc
Add &mailto=dorianc@moxywolf.com for polite pool access (faster rate limits).
Filter options:
- By year:
&filter=from_publication_date:2020-01-01 - By concept:
&filter=concepts.id:C41008148 - By open access:
&filter=is_oa:true
Key response fields: doi, title, authorships, publication_year,
primary_location.source.display_name, abstract_inverted_index, cited_by_count
Reconstruct abstracts from inverted index (see references/api-response-formats.md).
Rate limit: 10/sec without key, 100/sec with mailto.
Semantic Scholar (Strong for citation graphs)
GET https://api.semanticscholar.org/graph/v1/paper/search?query={query}&limit=50&fields=title,authors,year,abstract,externalIds,citationCount
Rate limit: 100 requests/5 minutes.
arXiv (Preprints — CS, ML, Physics, Math)
GET https://export.arxiv.org/api/query?search_query=all:{query}&start=0&max_results=50&sortBy=relevance
Rate limit: 1 request/3 seconds. Always respect this.
For cybersecurity/compliance topics, use category filter: cat:cs.CR
Non-Academic Sources
These are just as important for practitioner-oriented research.
Web Search (built-in)
Use Claude's built-in WebSearch tool to search for:
Construct targeted queries for each source type:
- Industry blogs/whitepapers:
"{topic}" site:blog OR whitepaper OR "technical report" - Government/standards docs:
"{topic}" site:nist.gov OR site:disa.mil OR site:cyber.gov - Vendor documentation:
"{topic}" site:docs.* OR "technical documentation" - Conference talks/slides:
"{topic}" conference OR presentation OR "talk" filetype:pdf - Reddit/forums:
"{topic}" site:reddit.com OR site:stackoverflow.com
NIST/Government Sources (direct)
For compliance and cybersecurity topics, search NIST directly:
WebFetch: https://csrc.nist.gov/publications?keywords={query}
Extract publication titles, abstracts, and PDF links.
DISA STIG Sources
For STIG-related research:
WebFetch: https://public.cyber.mil/stigs/
Search for relevant STIGs, SRGs, and related documentation.
Source Classification
For each discovered source, classify it:
| Type | Entry Type | Example |
|---|---|---|
| Journal article | article |
IEEE, ACM papers |
| Conference paper | inproceedings |
Black Hat, RSA talks |
| Preprint | preprint |
arXiv papers |
| Government publication | government |
NIST SP 800-series |
| Industry whitepaper | whitepaper |
Vendor security reports |
| Blog post | blog |
Practitioner insights |
| Standard/Framework | standard |
DISA STIGs, CIS Benchmarks |
| Book/Chapter | book |
Textbooks, reference guides |
| Other web source | web |
Everything else |
Step 3b: Multi-Model Research Swarm
After the API and web searches complete, run the multi-model swarm to catch what structured APIs miss. This uses OpenRouter to query multiple LLMs in parallel, each with different training data, web access, and knowledge bases.
How It Works
Send the same research prompt to 2-3 models via Council's scripts/openrouter_dispatch.py
helper, which hits the OpenRouter chat completions endpoint in parallel. Each model
returns sources it knows about. Claude then merges, deduplicates, and validates the
combined results.
Models and Their Strengths
| Model | OpenRouter ID | Strength | Notes |
|---|---|---|---|
| Perplexity Sonar Pro | perplexity/sonar-pro |
Web-grounded search, real-time citations with verified URLs | Primary. Always use. Best source of real, current URLs. |
| Google Gemini 2.5 Flash | google/gemini-2.5-flash-preview-05-20 |
Fast, strong on government/standards docs | Use Flash not Pro — Pro's reasoning burns the token budget before generating output. |
| Anthropic Claude Sonnet | anthropic/claude-sonnet-4 |
Deep reasoning, strong cross-domain connections | Good complement to Perplexity. Honest about what it doesn't know. |
Models to Avoid
| Model | Why |
|---|---|
openai/gpt-4o |
Fabricates URLs. Every single URL it returned in testing was example.com. Cannot be trusted for source discovery. |
deepseek/deepseek-r1 |
Usable — the dispatch helper's 180s default timeout covers R1's reasoning phase. Still slower than chat variants, so default to deepseek/deepseek-chat unless deep reasoning is worth the latency. |
google/gemini-2.5-pro-preview-05-06 |
Spends most of its token budget on internal reasoning, then hits max_tokens before outputting results. Use Flash instead. |
Execution Strategy
Run Perplexity + one or two other models as a parallel batch via Council's
scripts/openrouter_dispatch.py (see "Executing via the dispatch helper" below).
The 180s default timeout comfortably accommodates 3 models in one batch; only
split if you're chaining deepseek-r1 with another reasoning model.
The Swarm Prompt
Send this prompt (adapted per topic) to each model:
You are a research discovery agent. Your job is to find ALL significant sources
— academic papers, government publications, industry whitepapers, blog posts,
conference talks, and standards documents — on the following topic:
TOPIC: {topic}
CONTEXT: {library_description}
For each source you find, provide:
1. Title (exact)
2. Authors (if known)
3. Year of publication
4. Type: academic | government | industry | blog | standard | book
5. URL or DOI (if you have it — ONLY real ones, never fabricate)
6. A 2-3 sentence description of what the source covers
7. Why it's relevant to this research topic
Find at least 15 sources. Prioritize:
- Seminal/foundational works that everyone in this field cites
- Recent publications (last 3 years) showing current state of the art
- Government standards and guidance documents (NIST, DISA, DoD)
- Practitioner perspectives (blogs, conference talks, vendor whitepapers)
- Contrarian or critical viewpoints that challenge mainstream thinking
DO NOT fabricate citations. If you're unsure about a URL or DOI, say so.
It's better to give a title without a link than a fake link.
Format your response as a JSON array:
[
{
"title": "...",
"authors": "...",
"year": 2024,
"type": "academic",
"url": "https://...",
"doi": "10.1234/...",
"description": "...",
"relevance": "..."
}
]
Executing via the dispatch helper
Reuse Council's scripts/openrouter_dispatch.py (resolves at
${CLAUDE_PLUGIN_ROOT}/../council/scripts/openrouter_dispatch.py when both plugins
are installed in the same marketplace, or copy it to this plugin if you prefer
no cross-plugin coupling).
Build a swarm jobs file at ${WORKSPACE_OUTPUTS}/lit-discovery-{run_id}/swarm-jobs.json:
[
{
"id": "perplexity",
"model": "perplexity/sonar-pro",
"messages": [{"role": "user", "content": "{swarm_prompt}"}],
"temperature": 0.3,
"max_tokens": 4000
},
{
"id": "gemini",
"model": "google/gemini-2.5-flash-preview-05-20",
"messages": [{"role": "user", "content": "{swarm_prompt}"}],
"temperature": 0.3,
"max_tokens": 4000
},
{
"id": "claude",
"model": "anthropic/claude-sonnet-4",
"messages": [{"role": "user", "content": "{swarm_prompt}"}],
"temperature": 0.3,
"max_tokens": 4000
}
]
Then dispatch in parallel:
python3 "${COUNCIL_PLUGIN}/scripts/openrouter_dispatch.py" \
--jobs "${WORKSPACE_OUTPUTS}/lit-discovery-${RUN_ID}/swarm-jobs.json" \
--out "${WORKSPACE_OUTPUTS}/lit-discovery-${RUN_ID}/swarm" \
--timeout 180
The helper writes one {id}.json per model. Read each, extract response.choices[0].message.content, then process as below.
Processing Swarm Results
After all models respond:
- Parse each model's JSON response into source objects
- Tag each source with the model that found it:
"discovered_by": ["perplexity", "gemini"] - Merge by matching on title similarity (>85%) or exact DOI/URL match
- Score by cross-model agreement:
- Found by 3+ models = high confidence
- Found by 2 models = medium confidence
- Found by 1 model only = verify before including
- Validate URLs — for sources found by only one model, attempt a HEAD request via WebFetch to confirm the URL is real. Discard 404s.
- Deduplicate against API results — remove anything already found by OpenAlex, Semantic Scholar, or arXiv in Step 3
What the Swarm Catches That APIs Miss
- Practitioner blog posts and opinion pieces
- Vendor-specific whitepapers and case studies
- DISA/DoD memos and policy documents not indexed in academic databases
- Conference presentations and webinar recordings
- Reddit/forum discussions with real-world implementation insights
- International equivalents and comparative standards
- Historical documents and foundational thinking pre-dating digital indexes
- Cross-domain connections (e.g., healthcare compliance mapped to defense)
Cost Awareness
Actual costs from live testing (March 2026):
- Perplexity Sonar Pro: ~$0.024 per query (best ROI — most real sources per dollar)
- Gemini 2.5 Flash: ~$0.005-0.01 per query
- Claude Sonnet: ~$0.02-0.04 per query
Total per discovery run (2 models): ~$0.03-0.07. Negligible for the coverage improvement.
Presenting Swarm Results
After merging with API results, flag swarm-only discoveries in the presentation:
🔍 Multi-Model Discovery ([count] unique sources not found by API search)
22. [Title] ([Year]) — found by Perplexity + Gemini Flash
Type: whitepaper
URL: [link]
Why: [relevance explanation]
23. [Title] ([Year]) — found by Perplexity only (URL verified)
Type: blog
URL: [link]
Why: [relevance explanation]
This makes it clear which sources came from the swarm vs. structured APIs, so the user can weigh confidence accordingly.
Step 4: Deduplicate
If expanding an existing library:
SELECT doi, title, url FROM citations WHERE library_id = {id}
Remove matches on DOI, URL, or title similarity >90%.
For new libraries, deduplicate across search results (same paper found by multiple APIs).
Step 5: Present Candidates
Group results by source type for clarity:
Literature Discovery: "{topic}"
══════════════════════════════
Found [X] sources across academic and industry channels.
📄 Academic Papers ([count])
1. [Title] ([Year]) — [Journal]
Authors: [...] | Citations: [count]
DOI: [doi]
2. ...
🏛️ Government & Standards ([count])
3. [NIST SP 800-171 Rev 3] — NIST
Published: [date]
URL: [link]
4. ...
📝 Industry & Practitioner ([count])
5. [Blog Title] — [Site]
Author: [...] | Published: [date]
URL: [link]
6. ...
Which of these should I add to your library?
Say 'all', specific numbers, 'academic only', 'industry only', or 'none'.
Step 6: Ingest Approved Sources
For each approved source:
INSERT INTO citations (
library_id, citation_key, entry_type, title, authors, year,
journal, abstract, doi, arxiv_id, url, bibtex_raw,
verification_status, source
) VALUES (
{library_id},
'{generated_citation_key}',
'{entry_type}',
'{title}',
'{authors}',
{year},
'{journal_or_publisher}',
'{abstract_or_description}',
'{doi}',
'{arxiv_id}',
'{url}',
'{bibtex_raw_if_available}',
CASE WHEN doi IS NOT NULL OR arxiv_id IS NOT NULL THEN 'verified' ELSE 'unverified' END,
'{source_api}'
)
ON CONFLICT (library_id, citation_key) DO NOTHING
Generating Citation Keys
For non-academic sources without a BibTeX key, generate one:
{first_author_surname}_{year}for authored works{org_acronym}_{year}_{short_title}for org publications (e.g.,nist_2024_sp800171){site}_{year}_{slug}for blog posts (e.g.,krebs_2024_stig_automation)
Generating BibTeX for Non-Academic Sources
For web sources, generate a BibTeX entry so the library stays export-compatible:
@misc{krebs_2024_stig_automation,
author = {Krebs, Brian},
title = {Automating STIG Compliance at Scale},
year = {2024},
url = {https://example.com/article},
note = {Blog post. Accessed: 2026-03-18},
abstract = {Brief description of the content...}
}
Step 7: Summary and Next Steps
Library Built: {name}
═══════════════════
Total sources added: [X]
├─ Academic papers: [count]
├─ Government/standards: [count]
├─ Industry/practitioner: [count]
└─ Other: [count]
Verification status:
├─ Verified (DOI/arXiv): [count]
└─ Unverified (web): [count]
Next steps:
→ "Verify my citations" — validate DOIs and check for broken links
→ "Find more papers" — run another discovery round with refined terms
→ "Synthesize my research" — build thematic map + writing perspective
→ "Import my BibTeX" — add your own collected references on top
Discovery Strategies (for Library Expansion)
When expanding an existing library, choose based on its state:
Strategy 1: Keyword Expansion
Broaden search terms based on themes already in the library.
Strategy 2: Citation Chain
Follow references and citations from the library's most-cited papers via Semantic Scholar.
Strategy 3: Gap-Driven
Target specific gaps identified in research_gaps table.
Strategy 4: Temporal
Fill year-range gaps — find recent work or foundational papers.
Strategy 5: Source-Type Diversification
If the library is all academic, find industry sources. If all industry, find the academic backing.
Scheduled Discovery
Weekly automated scan (configure via /schedule):
- For each active library, search for sources published in the last 7 days
- Filter by relevance to library topics
- Stage candidates — do NOT auto-ingest
- Summary for next session: "Found [X] new sources this week for [library name]."
Reference
references/api-response-formats.md— Parsing guides for OpenAlex, Semantic Scholar, arXiv, CrossRef, DataCitereferences/openrouter-swarm.md— Multi-model swarm configuration: model roster, prompt templates, merge algorithm, cost breakdown- For the Supabase schema, read the migration SQL in the research-pipeline skill.