web-to-rag

star 0

Import content into local RAG. Triggers: "add to RAG", "scrape docs", "import YouTube", "embed PDF", "build knowledge base", "embedda sito", "importa video", "aggiungi PDF", "crea knowledge base".

Tapiocapioca By Tapiocapioca schedule Updated 1/4/2026

name: web-to-rag description: | Import content into local RAG. Triggers: "add to RAG", "scrape docs", "import YouTube", "embed PDF", "build knowledge base", "embedda sito", "importa video", "aggiungi PDF", "crea knowledge base".

Web to RAG

Import websites, YouTube videos, and PDFs into AnythingLLM.

Quick Reference

Source Detection Tool
YouTube youtube.com, youtu.be yt-dlp-server:8501
PDF .pdf extension pdftotext
Docs site /sitemap.xml exists Crawl4AI sitemap
Generic Everything else Crawl4AI BFS
MCP Tool Purpose
mcp__crawl4ai__md Scrape page to markdown
mcp__anythingllm__list_workspaces List workspaces
mcp__anythingllm__embed_text Add content to workspace
mcp__anythingllm__chat_with_workspace Query RAG (mode: "query")

Prerequisites

# Verify containers
docker ps | grep -E "crawl4ai|anythingllm"

# Start if needed
docker start crawl4ai anythingllm

# YouTube/audio support
docker start yt-dlp-server whisper-server

For "Client not initialized" error:

mcp__anythingllm__initialize_anythingllm
  apiKey: "YOUR_API_KEY"
  baseUrl: "http://localhost:3001"

Core Workflow

1. Detect content type from URL
2. Find or create workspace (name from domain)
3. Crawl content (max 3 parallel)
4. Embed with metadata
5. Report results with test query

Rate Limits

  • 3 URLs maximum per batch
  • Wait before next batch
  • Confirm if > 50 pages

Workspace Naming

URL Workspace
docs.anthropic.com anthropic-docs
react.dev/learn react-dev
example.com/blog example-blog

Embedding Format

---
source: {url}
title: {page_title}
scraped: {timestamp}
---
{cleaned_content}

Common Mistakes

Mistake Fix
Rate limit exceeded Max 3 parallel, wait between batches
Empty content embedded Skip empty pages
Wrong chat mode Use mode: "query"
Uninitialized client Call initialize_anythingllm first
External links crawled Filter to same domain

Error Handling

Error Action
403 Forbidden Use Playwright fallback
429 Rate Limited Pause 60s, reduce parallelism
Timeout Retry once, skip
Empty content Skip

Extended Workflows

See references/:

Report Template

## Scraping Report

**URL:** {url}
**Type:** {docs|generic|youtube|pdf}
**Workspace:** {name}

### Stats
- Found: X
- Processed: Y
- Failed: Z

### Test Query
mcp__anythingllm__chat_with_workspace
  slug: "{workspace}"
  message: "What does this documentation cover?"
  mode: "query"
Install via CLI
npx skills add https://github.com/Tapiocapioca/claude-code-skills-DEAD --skill web-to-rag
Repository Details
star Stars 0
call_split Forks 0
navigation Branch main
article Path SKILL.md
More from Creator
Tapiocapioca
Tapiocapioca Explore all skills →