web-ops - SKILL.md Agent Skill

name: web-ops description: Web research and content retrieval using SearXNG, Firecrawl, and ArchiveBox on CT107. Use when searching for information, scraping web content, archiving pages, or researching topics. Provides unified interface to CT107 web stack.

Web Operations Skill

Interface with web research stack on CT107 (res container) for search, scraping, and archiving.

Services Overview

Service	URL	Purpose	Status
SearXNG	http://10.10.10.107:8080	Meta-search engine	✅
Firecrawl	http://10.10.10.107:3002	Web scraping API	✅
ArchiveBox	http://10.10.10.107:8000	Web archiving	✅

SearXNG (Search)

Basic Search

# Simple query
curl -s 'http://10.10.10.107:8080/search?q=minimax+api+documentation&format=json' | jq '.results[:5]'

JSON API

# Search with JSON output
curl -s 'http://10.10.10.107:8080/search?q=fastapi+routing+best+practices&format=json' | python3 -c '
import sys, json
data = json.load(sys.stdin)
for r in data.get("results", [])[:5]:
    print(f"{r.get(\x27title\x27, \x27N/A\x27)[:50]}...")
    print(f"  {r.get(\x27url\x27, \x27N/A\x27)[:60]}")
    print(f"  {r.get(\x27content\x27, \x27N/A\x27)[:100]}...")
    print()
'

Search Categories

# General search
curl -s 'http://10.10.10.107:8080/search?q=python+programming&category_general=on'

# News search
curl -s 'http://10.10.10.107:8080/search?q=ai+security&category_news=on'

# IT/Technical search
curl -s 'http://10.10.10.107:8080/search?q=docker+networking&category_it=on'

Advanced Parameters

# Language filter
curl -s 'http://10.10.10.107:8080/search?q=minimax&language=en'

# Time filter (day, week, month, year)
curl -s 'http://10.10.10.107:8080/search?q=claude+code&time_range=week'

# Safe search
curl -s 'http://10.10.10.107:8080/search?q=tutorial&safesearch=0'

Firecrawl (Scraping)

Health Check

curl -s http://10.10.10.107:3002/health

Basic Scrape

# Scrape single page
curl -s -X POST http://10.10.10.107:3002/v1/scrape \
  -H "Content-Type: application/json" \
  -d '{"url": "https://docs.minimaxi.com", "formats": ["markdown"]}' | jq '.data.markdown[:500]'

Scrape with Options

# Scrape with specific options
curl -s -X POST http://10.10.10.107:3002/v1/scrape \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "formats": ["markdown", "html"],
    "onlyMainContent": true,
    "includeTags": ["h1", "h2", "p", "code"],
    "excludeTags": ["nav", "footer", "ads"]
  }' | jq '.'

Crawl Website

# Crawl entire website (limited depth)
curl -s -X POST http://10.10.10.107:3002/v1/crawl \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://docs.minimaxi.com",
    "limit": 10,
    "scrapeOptions": {
      "formats": ["markdown"]
    }
  }' | jq '.id'

# Check crawl status
# (Use ID from previous response)

Map Website Structure

# Get all links from a website
curl -s -X POST http://10.10.10.107:3002/v1/map \
  -H "Content-Type: application/json" \
  -d '{"url": "https://docs.minimaxi.com"}' | jq '.links[:20]'

ArchiveBox (Archiving)

Health Check

curl -s http://10.10.10.107:8000/admin/login/ | head -1

Add URL to Archive

# Archive a page via API (if API enabled)
curl -s -X POST http://10.10.10.107:8000/api/v1/archive \
  -H "Content-Type: application/json" \
  -d '{"url": "https://important-docs.example.com", "tags": ["reference", "api"]}' 2>/dev/null || echo 'Check ArchiveBox API settings'

View Archive

# List recent snapshots
curl -s http://10.10.10.107:8000/archive/ | grep -o 'href="[^"]*"' | head -20

Combined Workflows

Research Pipeline

# 1. Search for information
SEARCH_RESULTS=$(curl -s 'http://10.10.10.107:8080/search?q=fastapi+best+practices&format=json')

# 2. Extract top URLs
URLS=$(echo $SEARCH_RESULTS | jq -r '.results[:3] | .[].url')

# 3. Scrape each URL for content
for url in $URLS; do
  echo "Scraping: $url"
  curl -s -X POST http://10.10.10.107:3002/v1/scrape \
    -H "Content-Type: application/json" \
    -d "{\"url\": \"$url\", \"formats\": [\"markdown\"]}" | \
    jq -r '.data.markdown[:1000]'
  echo "---"
done

Documentation Scraping

# Scrape documentation site
curl -s -X POST http://10.10.10.107:3002/v1/scrape \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://docs.minimaxi.com/api-reference",
    "formats": ["markdown"],
    "onlyMainContent": true
  }' | jq -r '.data.markdown' > /tmp/minimax-docs.md

echo "Documentation saved to /tmp/minimax-docs.md"

News Monitoring

# Search for recent news
curl -s 'http://10.10.10.107:8080/search?q=cybersecurity+threats&time_range=day&category_news=on&format=json' | \
  jq '.results[:5] | .[] | {title: .title, url: .url, publishedDate: .publishedDate}'

Research Patterns

API Documentation Lookup

# Search for API docs
QUERY="minimax anthropic api authentication"
RESULTS=$(curl -s "http://10.10.10.107:8080/search?q=$QUERY&format=json")

# Get first result and scrape
FIRST_URL=$(echo $RESULTS | jq -r '.results[0].url')
echo "Scraping: $FIRST_URL"
curl -s -X POST http://10.10.10.107:3002/v1/scrape \
  -H "Content-Type: application/json" \
  -d "{\"url\": \"$FIRST_URL\", \"formats\": [\"markdown\"]}" | \
  jq -r '.data.markdown[:2000]'

Error Research

# Search for error solutions
ERROR="invalid api key minimax"
curl -s "http://10.10.10.107:8080/search?q=$ERROR&format=json" | \
  jq '.results[:5] | .[] | {title: .title, url: .url}'

Competitor Analysis

# Compare solutions
curl -s 'http://10.10.10.107:8080/search?q=openclaw+alternative+tools&format=json' | \
  jq '.results[:5] | .[] | {title, url, content: .content[:200]}'

Python Integration

SearXNG Python Helper

import requests
import json

def search_searxng(query, limit=5):
    url = f"http://10.10.10.107:8080/search?q={query}&format=json"
    response = requests.get(url)
    data = response.json()
    return data.get('results', [])[:limit]

# Usage
results = search_searxng("fastapi routing", 3)
for r in results:
    print(f"{r['title']}: {r['url']}")

Firecrawl Python Helper

import requests

def scrape_page(url):
    response = requests.post(
        "http://10.10.10.107:3002/v1/scrape",
        json={"url": url, "formats": ["markdown"]}
    )
    return response.json().get('data', {}).get('markdown', '')

# Usage
content = scrape_page("https://docs.minimaxi.com")
print(content[:1000])

Safety & Ethics

Respect robots.txt
Don't overload servers (rate limit requests)
Cache results when possible
Don't scrape sensitive/personal data
Follow terms of service

Troubleshooting

SearXNG Not Responding

# Check service
curl -s http://10.10.10.107:8080/health

# Restart if needed (on CT107)
# docker restart searxng

Firecrawl Errors

# Check health
curl -s http://10.10.10.107:3002/health

# Common issues:
# - URL unreachable (check network)
# - Rate limited (wait and retry)
# - Content blocked (try different user-agent)

ArchiveBox Access

# Default admin credentials may be needed
# Check CT107 configuration

References

SearXNG Docs: https://docs.searxng.org/
Firecrawl Docs: https://docs.firecrawl.dev/
ArchiveBox Docs: https://docs.archivebox.io/