web-ops

star 0

Web research and content retrieval using SearXNG, Firecrawl, and ArchiveBox on CT107. Use when searching for information, scraping web content, archiving pages, or researching topics. Provides unified interface to CT107 web stack.

goranjovic55 By goranjovic55 schedule Updated 2/22/2026

name: web-ops description: Web research and content retrieval using SearXNG, Firecrawl, and ArchiveBox on CT107. Use when searching for information, scraping web content, archiving pages, or researching topics. Provides unified interface to CT107 web stack.

Web Operations Skill

Interface with web research stack on CT107 (res container) for search, scraping, and archiving.

Services Overview

Service URL Purpose Status
SearXNG http://10.10.10.107:8080 Meta-search engine
Firecrawl http://10.10.10.107:3002 Web scraping API
ArchiveBox http://10.10.10.107:8000 Web archiving

SearXNG (Search)

Basic Search

# Simple query
curl -s 'http://10.10.10.107:8080/search?q=minimax+api+documentation&format=json' | jq '.results[:5]'

JSON API

# Search with JSON output
curl -s 'http://10.10.10.107:8080/search?q=fastapi+routing+best+practices&format=json' | python3 -c '
import sys, json
data = json.load(sys.stdin)
for r in data.get("results", [])[:5]:
    print(f"{r.get(\x27title\x27, \x27N/A\x27)[:50]}...")
    print(f"  {r.get(\x27url\x27, \x27N/A\x27)[:60]}")
    print(f"  {r.get(\x27content\x27, \x27N/A\x27)[:100]}...")
    print()
'

Search Categories

# General search
curl -s 'http://10.10.10.107:8080/search?q=python+programming&category_general=on'

# News search
curl -s 'http://10.10.10.107:8080/search?q=ai+security&category_news=on'

# IT/Technical search
curl -s 'http://10.10.10.107:8080/search?q=docker+networking&category_it=on'

Advanced Parameters

# Language filter
curl -s 'http://10.10.10.107:8080/search?q=minimax&language=en'

# Time filter (day, week, month, year)
curl -s 'http://10.10.10.107:8080/search?q=claude+code&time_range=week'

# Safe search
curl -s 'http://10.10.10.107:8080/search?q=tutorial&safesearch=0'

Firecrawl (Scraping)

Health Check

curl -s http://10.10.10.107:3002/health

Basic Scrape

# Scrape single page
curl -s -X POST http://10.10.10.107:3002/v1/scrape \
  -H "Content-Type: application/json" \
  -d '{"url": "https://docs.minimaxi.com", "formats": ["markdown"]}' | jq '.data.markdown[:500]'

Scrape with Options

# Scrape with specific options
curl -s -X POST http://10.10.10.107:3002/v1/scrape \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "formats": ["markdown", "html"],
    "onlyMainContent": true,
    "includeTags": ["h1", "h2", "p", "code"],
    "excludeTags": ["nav", "footer", "ads"]
  }' | jq '.'

Crawl Website

# Crawl entire website (limited depth)
curl -s -X POST http://10.10.10.107:3002/v1/crawl \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://docs.minimaxi.com",
    "limit": 10,
    "scrapeOptions": {
      "formats": ["markdown"]
    }
  }' | jq '.id'

# Check crawl status
# (Use ID from previous response)

Map Website Structure

# Get all links from a website
curl -s -X POST http://10.10.10.107:3002/v1/map \
  -H "Content-Type: application/json" \
  -d '{"url": "https://docs.minimaxi.com"}' | jq '.links[:20]'

ArchiveBox (Archiving)

Health Check

curl -s http://10.10.10.107:8000/admin/login/ | head -1

Add URL to Archive

# Archive a page via API (if API enabled)
curl -s -X POST http://10.10.10.107:8000/api/v1/archive \
  -H "Content-Type: application/json" \
  -d '{"url": "https://important-docs.example.com", "tags": ["reference", "api"]}' 2>/dev/null || echo 'Check ArchiveBox API settings'

View Archive

# List recent snapshots
curl -s http://10.10.10.107:8000/archive/ | grep -o 'href="[^"]*"' | head -20

Combined Workflows

Research Pipeline

# 1. Search for information
SEARCH_RESULTS=$(curl -s 'http://10.10.10.107:8080/search?q=fastapi+best+practices&format=json')

# 2. Extract top URLs
URLS=$(echo $SEARCH_RESULTS | jq -r '.results[:3] | .[].url')

# 3. Scrape each URL for content
for url in $URLS; do
  echo "Scraping: $url"
  curl -s -X POST http://10.10.10.107:3002/v1/scrape \
    -H "Content-Type: application/json" \
    -d "{\"url\": \"$url\", \"formats\": [\"markdown\"]}" | \
    jq -r '.data.markdown[:1000]'
  echo "---"
done

Documentation Scraping

# Scrape documentation site
curl -s -X POST http://10.10.10.107:3002/v1/scrape \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://docs.minimaxi.com/api-reference",
    "formats": ["markdown"],
    "onlyMainContent": true
  }' | jq -r '.data.markdown' > /tmp/minimax-docs.md

echo "Documentation saved to /tmp/minimax-docs.md"

News Monitoring

# Search for recent news
curl -s 'http://10.10.10.107:8080/search?q=cybersecurity+threats&time_range=day&category_news=on&format=json' | \
  jq '.results[:5] | .[] | {title: .title, url: .url, publishedDate: .publishedDate}'

Research Patterns

API Documentation Lookup

# Search for API docs
QUERY="minimax anthropic api authentication"
RESULTS=$(curl -s "http://10.10.10.107:8080/search?q=$QUERY&format=json")

# Get first result and scrape
FIRST_URL=$(echo $RESULTS | jq -r '.results[0].url')
echo "Scraping: $FIRST_URL"
curl -s -X POST http://10.10.10.107:3002/v1/scrape \
  -H "Content-Type: application/json" \
  -d "{\"url\": \"$FIRST_URL\", \"formats\": [\"markdown\"]}" | \
  jq -r '.data.markdown[:2000]'

Error Research

# Search for error solutions
ERROR="invalid api key minimax"
curl -s "http://10.10.10.107:8080/search?q=$ERROR&format=json" | \
  jq '.results[:5] | .[] | {title: .title, url: .url}'

Competitor Analysis

# Compare solutions
curl -s 'http://10.10.10.107:8080/search?q=openclaw+alternative+tools&format=json' | \
  jq '.results[:5] | .[] | {title, url, content: .content[:200]}'

Python Integration

SearXNG Python Helper

import requests
import json

def search_searxng(query, limit=5):
    url = f"http://10.10.10.107:8080/search?q={query}&format=json"
    response = requests.get(url)
    data = response.json()
    return data.get('results', [])[:limit]

# Usage
results = search_searxng("fastapi routing", 3)
for r in results:
    print(f"{r['title']}: {r['url']}")

Firecrawl Python Helper

import requests

def scrape_page(url):
    response = requests.post(
        "http://10.10.10.107:3002/v1/scrape",
        json={"url": url, "formats": ["markdown"]}
    )
    return response.json().get('data', {}).get('markdown', '')

# Usage
content = scrape_page("https://docs.minimaxi.com")
print(content[:1000])

Safety & Ethics

  • Respect robots.txt
  • Don't overload servers (rate limit requests)
  • Cache results when possible
  • Don't scrape sensitive/personal data
  • Follow terms of service

Troubleshooting

SearXNG Not Responding

# Check service
curl -s http://10.10.10.107:8080/health

# Restart if needed (on CT107)
# docker restart searxng

Firecrawl Errors

# Check health
curl -s http://10.10.10.107:3002/health

# Common issues:
# - URL unreachable (check network)
# - Rate limited (wait and retry)
# - Content blocked (try different user-agent)

ArchiveBox Access

# Default admin credentials may be needed
# Check CT107 configuration

References

Install via CLI
npx skills add https://github.com/goranjovic55/NOP --skill web-ops
Repository Details
star Stars 0
call_split Forks 0
navigation Branch main
article Path SKILL.md
More from Creator
goranjovic55
goranjovic55 Explore all skills →