process-logs - SKILL.md Agent Skill

name: process-logs description: Process error logs from admin panel - fetch new errors, analyze, create tasks, fix, and mark resolved. Checks Axiom/Pino if DB entries are filtered. version: 1.9.0

Process Error Logs

Automated workflow for processing error logs from /admin/logs.

CRITICAL REQUIREMENTS

YOU MUST FOLLOW THESE RULES. NO EXCEPTIONS.

1. BEADS IS MANDATORY

EVERY error MUST have a Beads task. No direct fixes without tracking.

# ALWAYS run this FIRST for each error:
bd create --type=bug --priority=<1-3> --title="Fix: <error_message>" --files "<relevant_files>"
bd update <task_id> --status=in_progress

2. TASK COMPLEXITY ROUTING

Route tasks by complexity:

Complexity	Examples	Action
Simple	Typo fix, single import, config value	Execute directly
Medium	Multi-file fix, migration, API change	Delegate to subagent
Complex	Architecture change, new feature	Ask user first

Subagent selection for MEDIUM tasks:

DB/migration → database-architect
API/tRPC → fullstack-nextjs-specialist
Types → typescript-types-specialist
UI → nextjs-ui-designer

Execute directly for SIMPLE tasks:

Single-line fix (typo, wrong value)
Import path correction
Config constant change
Comment fix

Docs L1/L2 is mandatory

ALWAYS query documentation before implementing:

@neuledge/context MCP first with package@version from lockfile; Context7 MCP fallback: resolve-library-id -> query-docs

4. BUG FIXING PRINCIPLES

This is PRODUCTION. Every bug matters.

Fix fundamentally, not superficially:

Find and fix the ROOT CAUSE, not just symptoms
If error happens in function X but cause is in function Y → fix Y
Don't add workarounds/hacks that mask the problem
Ask: "Why did this happen?" until you reach the actual cause

Never ignore errors:

Every error indicates a real problem
"Works most of the time" is NOT acceptable
External service errors → add retry logic or graceful degradation
Config warnings → fix config or make truly optional

Propose improvements:

If you see code that could be better → create separate Beads task
If fix reveals related issues → document them
If pattern repeats → suggest refactoring to prevent future bugs
Format: bd create --type=chore --title="Improve: <description>"

Quality over speed:

Take time to understand the full context
Test the fix mentally: "What else could break?"
Check for similar patterns elsewhere in codebase
One good fix > multiple quick patches

5. LOG NOTES (MANDATORY)

Always write notes when updating log status. Keep it brief, in English.

Status	What to write in notes
`resolved`	Root cause + fix applied. Example: `Missing constraint. Added 'approved' to enum via migration.`
`auto_muted`	System-assigned. Don't change. Skip these errors in processing.
`ignored`	Never use. Fix or ask user.
`to_verify`	Why pending + what to check. Auto-resolved after 14d if no recurrence. Example: `External API timeout. Monitor for 24h.`
`in_progress`	Beads task ID. Example: `Working on mc2-5ch`

Format: <root_cause>. <action_taken>. — Max 100 chars.

Examples:

ESM import conflict. Renamed generator.ts to generator-node.ts.
Constraint missing 'approved'. Added via migration 20250115_fix_status.
Cloudflare 500. External issue, retry logic already exists. Monitoring.

6. AUTO-MUTED ERRORS

Some errors are automatically ignored by the system with status auto_muted. These are expected events, NOT bugs.

Current auto-mute rules (from src/shared/logger/auto-classification.ts, total: 58):

Pattern	Reason	Description
`Redis connection (ended\|closed)`	graceful_shutdown	Redis disconnects during app restart
`graceful.*shutdown`	graceful_shutdown	Server shutdown events during deploys
`/api/trpc/health.*404`	monitoring_probe	tRPC health endpoint probes (Uptime Kuma)
`/health.*404`	monitoring_probe	Generic health check probes
`Cloudflare.*5\d{2}`	external_service	Cloudflare edge errors (502, 503, 521)
`ECONNRESET.*external`	external_service	External API connection resets
`Layer failed, trying next`	cascading_repair	Repair layer failed, trying next layer
`Critique-revise attempt failed`	cascading_repair	Layer 2 retry attempt failed
`Zod.validation failed.Layer`	cascading_repair	Layer 1 validation failed, escalating
`Job stalled`	job_lifecycle	BullMQ job restarted (long LLM operations)
`Unexpected exit code: 10`	job_lifecycle	Worker TTL timeout (10 min), will retry
`No RAG chunks found`	expected_behavior	Course without docs, generates w/o RAG
`Mermaid.fallback.used`	graceful_fallback	Diagram gen failed, fallback to text
`/trpc/.*401`	expected_behavior	Unauthenticated tRPC request, 401 correct
`Cache directory does not exist`	expected_behavior	Cache missing on fresh env, created later
`ModelConfigBunker.sync.fail`	external_service	Network issue, has retry with backoff
`Invalid status for approval`	ui_race_condition	User clicked approve but course progressed
`Job .+ not found`	expected_behavior	Frontend polls job status after cleanup
`Failed to log generation trace`	expected_behavior	Trace insert failed during pool pressure
`Patcher.REJECTED.truncated`	graceful_fallback	Truncated content detected, returns original
`Preprocessing failed.*using raw`	graceful_fallback	Preprocessing failed, using raw LLM output
`Stage 5.*Primary model attempt`	cascading_repair	Stage 5 primary model unavailable, will retry
`JSON repair failed after all`	graceful_fallback	JSON repair exhausted, LLM output too malformed
`ModelConfigBunker.*LKG file`	graceful_fallback	LKG atomic write race, has Redis+DB fallback
`could not renew lock for job`	job_lifecycle	BullMQ lock renewal failed, will restart
`Missing key for job.*moveToDelayed`	job_lifecycle	BullMQ race condition, job already done
`Critical language consistency`	expected_behavior	Cyrillic false positive in Russian courses
`Critical heuristic failures`	expected_behavior	Heuristic skipped LLM review (false positive)
`Rate limit exceeded`	expected_behavior	tRPC rate limiter working as designed
`/trpc/lessonContent.*429`	expected_behavior	HTTP 429 from rate limiter on partial generate
`/trpc/jobs\.getStatus 404`	expected_behavior	HTTP 404 from job status poll after cleanup
`Sufficiency verdict.*defaulting`	graceful_fallback	Phase 0.5 Zod validation fallback, non-blocking
`Batch section insert failed.*fallback`	graceful_fallback	Batch insert duplicate → individual fallback
`Failed to create section record`	graceful_fallback	Individual section insert skipped (already exists)
`Content failed sanity check.*non-blocking`	expected_behavior	Sanity check warning, content still accepted
`Unavailable For Legal Reasons\|content policy violation`	content_policy	Jina API content policy rejection (PII/legal)
`Using STALE phase config due to database error`	graceful_fallback	ModelConfigBunker stale config during DB outage
`\[Phase 6\] Max retries reached.*best-effort`	graceful_fallback	Phase 6 summary retries exhausted, best-effort
`MISCONF Redis.*unable to persist`	infrastructure	Redis RDB/AOF disk issue - infra problem, not app
`Concurrency limit exceeded`	external_service	Jina API server-side concurrency enforcement
`getaddrinfo EAI_AGAIN`	infrastructure	DNS resolution failure during deploy/restart
`Course not ready.*generation`	expected_behavior	Frontend polls after course completed or not ready

Total rules: 60 (test validates sync with code)

Metadata-aware matching (v1.10):

shouldAutoMute() now accepts optional context?: { message?: string } parameter. For tRPC errors where error_message is generic ("tRPC error"), the actual error from metadata.message is also checked against all patterns. This catches ~2000+ errors per week that previously slipped through.

Test environment auto-muting:

Errors from NODE_ENV=test (vitest) are automatically muted at insert time via muteTestEnvironmentLog() in error-service.ts. They get environment = 'test' and status = 'auto_muted' immediately. This prevents test errors from polluting the admin logs UI and triggering auto-reopen.

When you see auto_muted errors:

Skip them in processing — they don't need fixes
If you see a pattern that should be auto-muted, add it to auto-classification.ts
Note: auto-muted patterns are now filtered BEFORE DB insert (pre-insert filter), so most won't appear in error_logs at all. Only errors from logPermanentFailure() (canonical path) may still get auto-muted after insert.

How to add a new auto-mute rule:

Edit packages/course-gen-platform/src/shared/logger/auto-classification.ts:

{
  pattern: /your-pattern/i,
  reason: 'category',  // graceful_shutdown | monitoring_probe | external_service
  description: 'Why this is expected',
}

Update this SKILL.md with the new pattern

When NOT to auto-mute:

Errors that SOMETIMES indicate real problems
New error types (analyze first, then decide)
Anything affecting user experience

7. SEARCH SIMILAR PROBLEMS FIRST (MANDATORY)

Before fixing ANY error, search BOTH sources:

7a. Search in Beads (closed bug tasks)

# Search by error keywords
bd search "<keyword>" --type=bug --status=closed

# Example searches:
bd search "constraint violation"
bd search "tRPC timeout"
bd search "undefined property"

What to look for in Beads:

Similar error patterns in task titles
Root cause analysis in task descriptions
Fix approach and files changed

7b. Search in error_logs (resolved errors)

-- Search similar errors by message (use mcp__supabase__execute_sql)
SELECT el.id, el.error_message, el.severity, lis.status, lis.notes, el.created_at
FROM error_logs el
LEFT JOIN log_issue_status lis ON lis.log_id = el.id AND lis.log_type = 'error_log'
WHERE to_tsvector('english', el.error_message) @@ plainto_tsquery('english', '<keyword>')
  AND lis.status = 'resolved'
ORDER BY el.created_at DESC
LIMIT 5;

What to search for:

Key error terms: constraint, undefined, timeout, not found
Function/module names from stack trace
Error codes or specific identifiers

7c. If found similar resolved issue

From Beads: Read task description for root cause and fix approach
From error_logs: Read the notes field — contains root cause and fix
Apply same solution pattern if applicable
Reference in your notes: Similar to mc2-xxx / <date>. Same fix applied.

7d. If NOT found — create Beads task (MANDATORY)

Every new error MUST have a Beads task before fixing:

# 1. Create task with all required fields
bd create --type=bug --priority=<1-3> --title="Fix: <error_message>" --files "<relevant_files>"

# 2. Start working
bd update <task_id> --status=in_progress

# 3. After fix - close with detailed reason
bd close <task_id> --reason="Root cause: <why>. Fix: <what was done>."

Beads task MUST include:

Clear title with error essence
Priority based on severity (CRITICAL=1, ERROR=2, WARNING=3)
Files that will be modified (--files)
Closing reason with root cause explanation

Usage

Invoke via: /process-logs or "обработай логи ошибок"

Workflow

Step 1: Fetch New Errors

IMPORTANT: The /admin/logs UI shows errors from TWO tables:

error_logs — system errors, validation failures, worker errors
generation_trace (where error_data IS NOT NULL) — LLM generation errors

Both tables must be checked. Logs without a log_issue_status record show as "Новый" (new) in the UI.

1a. Check error_logs

-- Use mcp__supabase__execute_sql
-- NOTE: This excludes auto_muted errors (they are handled automatically)
SELECT el.id, el.severity, el.error_message, el.metadata, el.stack_trace,
       el.course_id, el.lesson_id, el.request_id, el.trpc_path, el.trpc_input, el.attempted_value
FROM error_logs el
LEFT JOIN log_issue_status lis ON lis.log_id = el.id AND lis.log_type = 'error_log'
WHERE lis.id IS NULL OR (lis.status NOT IN ('resolved', 'ignored', 'auto_muted'))
ORDER BY
  CASE el.severity WHEN 'CRITICAL' THEN 1 WHEN 'ERROR' THEN 2 ELSE 3 END,
  el.created_at DESC
LIMIT 20;

1b. Check generation_trace (LLM errors)

-- generation_trace with error_data shows as ERROR in UI
SELECT gt.id, gt.created_at, gt.stage, gt.phase, gt.step_name, gt.course_id,
       (gt.error_data->>'message')::text as error_message
FROM generation_trace gt
LEFT JOIN log_issue_status lis ON gt.id = lis.log_id AND lis.log_type = 'generation_trace'
WHERE gt.error_data IS NOT NULL
  AND (lis.id IS NULL OR lis.status NOT IN ('resolved', 'ignored', 'auto_muted'))
ORDER BY gt.created_at DESC
LIMIT 20;

1c. Quick count check

-- Quick check: how many "new" errors in each table?
SELECT
  'error_logs' as source,
  (SELECT COUNT(*) FROM error_logs el
   LEFT JOIN log_issue_status lis ON el.id = lis.log_id AND lis.log_type = 'error_log'
   WHERE lis.id IS NULL) as new_count
UNION ALL
SELECT
  'generation_trace' as source,
  (SELECT COUNT(*) FROM generation_trace gt
   LEFT JOIN log_issue_status lis ON gt.id = lis.log_id AND lis.log_type = 'generation_trace'
   WHERE gt.error_data IS NOT NULL AND lis.id IS NULL) as new_count;

Step 1.5: Filter by Environment (IMPORTANT)

CRITICAL: Both DEV and STAGE are production-like servers. ALL errors on these environments must be investigated and fixed. Only LOCAL (NULL) can be bulk-resolved.

The error_logs table has an environment column that indicates where the error occurred:

Value	Environment	Action
`NULL`	Local dev	Bulk resolve — local testing/development only
`'dev'`	Dev server	MUST FIX — real errors affecting developers
`'stage'`	Staging (prod)	MUST FIX — real production errors

Always check environment distribution first:

-- Check how many errors per environment (includes both NULL status AND status='new')
SELECT environment, COUNT(*) as count
FROM error_logs el
LEFT JOIN log_issue_status lis ON lis.fingerprint = el.fingerprint AND lis.log_type = 'error_log'
WHERE lis.id IS NULL OR lis.status = 'new'
GROUP BY environment
ORDER BY count DESC;

Bulk resolve LOCAL errors only:

-- Bulk resolve ONLY local environment errors (environment IS NULL)
-- NEVER bulk resolve dev or stage errors - they must be investigated individually!
-- NOTE: This handles both NULL status AND status='new'
WITH local_fingerprints AS (
  SELECT DISTINCT ON (el.fingerprint) el.id, el.fingerprint
  FROM error_logs el
  LEFT JOIN log_issue_status lis ON lis.fingerprint = el.fingerprint AND lis.log_type = 'error_log'
  WHERE (lis.id IS NULL OR lis.status = 'new')
    AND el.environment IS NULL
    AND el.fingerprint IS NOT NULL
  ORDER BY el.fingerprint, el.created_at DESC
)
INSERT INTO log_issue_status (log_type, log_id, status, notes, fingerprint, updated_at)
SELECT 'error_log', lf.id, 'resolved', 'Local environment: Testing/development errors', lf.fingerprint, NOW()
FROM local_fingerprints lf
ON CONFLICT (log_type, log_id) DO UPDATE SET status = 'resolved', notes = EXCLUDED.notes, updated_at = NOW();

Focus on server errors (dev + stage):

-- Get only SERVER errors (dev and stage environments)
-- NOTE: Includes both NULL status AND status='new'
SELECT
  el.environment,
  el.fingerprint,
  el.severity,
  MIN(el.error_message) as error_message,
  COUNT(*) as count,
  MAX(el.created_at) as last_seen
FROM error_logs el
LEFT JOIN log_issue_status lis ON lis.fingerprint = el.fingerprint AND lis.log_type = 'error_log'
WHERE (lis.id IS NULL OR lis.status = 'new')
  AND el.fingerprint IS NOT NULL
  AND el.environment IS NOT NULL  -- Exclude local (NULL)
GROUP BY el.environment, el.fingerprint, el.severity
ORDER BY
  CASE el.severity WHEN 'CRITICAL' THEN 1 WHEN 'ERROR' THEN 2 ELSE 3 END,
  COUNT(*) DESC
LIMIT 20;

Why this matters:

Local testing generates thousands of errors (incomplete data, experiments)
Dev and stage servers have real errors that need investigation
Bulk resolving only local (NULL) errors saves time without missing real bugs

Step 1.7: Check to_verify Fingerprints

Auto-resolution of stale to_verify fingerprints. Run on EVERY skill invocation.

Before processing new errors, resolve stale to_verify fingerprints:

1.7a. Run auto-resolution

-- Use mcp__supabase__execute_sql
-- Resolves inactive to_verify (14d no recurrence) and reopens recurred ones
SELECT resolve_inactive_to_verify(14);

Returns JSON:

{
  "resolved_count": 3,
  "reopened_count": 1,
  "resolved_fingerprints": ["abc...", "def..."],
  "reopened_fingerprints": ["ghi..."],
  "inactive_days": 14
}

1.7b. Handle results

resolved_count > 0: Fixes confirmed. Include count in Step 3 summary.
reopened_count > 0: Errors recurred — fixes didn't work. These fingerprints are now in_progress and will appear in Step 2 processing. Prioritize them.
Both 0: No to_verify fingerprints pending. Continue to Step 2.

1.7c. Query reopened details (if reopened_count > 0)

-- Get details of reopened fingerprints for Step 2 processing
SELECT lis.fingerprint, lis.notes,
       (SELECT MIN(el.error_message) FROM error_logs el WHERE el.fingerprint = lis.fingerprint) as error_message,
       (SELECT COUNT(*) FROM error_logs el
        WHERE el.fingerprint = lis.fingerprint
          AND el.created_at > lis.updated_at - INTERVAL '14 days') as recent_count
FROM log_issue_status lis
WHERE lis.status = 'in_progress'
  AND lis.notes LIKE 'Recurred after fix%'
  AND lis.updated_at > NOW() - INTERVAL '5 minutes';

Step 2: For EACH Error (Loop)

FOR each error:
  1. CREATE BEADS TASK (MANDATORY):
     bd create --type=bug --priority=<1-3> --title="Fix: <message>" --files "<files>"
     bd update <id> --status=in_progress

  2. ANALYZE error type and SELECT subagent:
     - DB constraint → database-architect
     - tRPC/API → fullstack-nextjs-specialist
     - Types → typescript-types-specialist
     - UI → nextjs-ui-designer

  3. USE Docs L1/L2 for relevant docs

  4. DELEGATE using Task tool:
     Task(subagent_type="<selected>", prompt="Fix error: <details>...")

  5. VERIFY results (MANDATORY):
     - Read tool: check modified files
     - Bash: pnpm type-check && pnpm build
     - If errors → re-delegate

  6. MARK resolved in DB:
     -- For error_logs:
     INSERT INTO log_issue_status (log_type, log_id, status, notes, updated_at)
     VALUES ('error_log', '<id>', 'resolved', 'Fixed: <desc>', NOW())
     ON CONFLICT (log_type, log_id) DO UPDATE SET status = 'resolved', notes = EXCLUDED.notes, updated_at = NOW();

     -- For generation_trace:
     INSERT INTO log_issue_status (log_type, log_id, status, notes, updated_at)
     VALUES ('generation_trace', '<id>', 'resolved', 'Fixed: <desc>', NOW())
     ON CONFLICT (log_type, log_id) DO UPDATE SET status = 'resolved', notes = EXCLUDED.notes, updated_at = NOW();

  7. CLOSE Beads task:
     bd close <id> --reason="Fixed"

Step 3: Summary Report

## Log Processing Summary

| Severity | Fixed | Pending | To Verify |
| -------- | ----- | ------- | --------- |
| CRITICAL | X     | Y       | Z         |
| ERROR    | X     | Y       | Z         |
| WARNING  | X     | Y       | Z         |

### to_verify Auto-Resolution

| Action                            | Count |
| --------------------------------- | ----- |
| Auto-resolved (14d no recurrence) | X     |
| Reopened (error recurred)         | Y     |

### Beads Tasks Created:

- mc2-xxx: <description> → <status>

### Pending (need user input):

- <log_id>: <reason>

Subagent Delegation Examples

DB Constraint Error

Task(
  subagent_type="database-architect",
  prompt="Fix DB constraint violation in error_logs.
  Error: <full_error_message>
  Context: <stack_trace>
  Course: <course_id>
  Create migration to fix the constraint."
)

tRPC/API Error

Task(
  subagent_type="fullstack-nextjs-specialist",
  prompt="Fix tRPC error in <trpc_path>.
  Error: <full_error_message>
  Input: <trpc_input>
  Stack: <stack_trace>
  Fix the API endpoint."
)

Type Error

Task(
  subagent_type="typescript-types-specialist",
  prompt="Fix TypeScript type error.
  Error: <full_error_message>
  File: <file_path>
  Fix types and ensure compatibility."
)

Verification Checklist

Before marking ANY error as resolved:

Beads task exists for this error
Subagent was used (if not trivial fix)
Modified files reviewed with Read tool
pnpm type-check passes
pnpm build passes
No new errors introduced
Beads task closed with reason

Error Categories

Pattern	Category	Subagent	Priority
`violates.*constraint`	DB constraint	`database-architect`	1
`tRPC error`	API bug	`fullstack-nextjs-specialist`	2
`Type.*error`	Type error	`typescript-types-specialist`	2
`Error querying`	Query bug	`database-architect`	2
Config missing	Config issue	ASK USER	3
External service	External	mark `to_verify`	3
Redis shutdown	Expected	SKIP (auto_muted)	-
Health probe 404	Expected	SKIP (auto_muted)	-

Errors with status auto_muted are automatically ignored by the system. Skip them.

Reference Docs

Admin Logs Guide: .claude/docs/admin-logs-guide.md
Error Types: packages/course-gen-platform/src/shared/logger/types.ts
Logs Router: packages/course-gen-platform/src/server/routers/admin/logs.ts
CLAUDE.md: Main orchestration rules

Architecture Note

The /admin/logs page aggregates errors from two sources:

Table	log_type	What it contains
`error_logs`	`'error_log'`	System errors, validation, worker failures
`generation_trace`	`'generation_trace'`	LLM errors (where `error_data IS NOT NULL`)

Status is tracked in log_issue_status table with composite key (log_type, log_id).

UI Logic: Status shows as "Новый" (new) when:

No log_issue_status record exists for the fingerprint
OR record exists with explicit status = 'new'

IMPORTANT: Always check BOTH conditions when querying for new errors.

Grouped View (fingerprint)

The UI has two views:

List view — individual logs, status by log_id
Grouped view — errors grouped by fingerprint, status by fingerprint

Auto-sync trigger (trg_sync_log_status_fingerprint):

When you INSERT/UPDATE log_issue_status for an error_log
The trigger automatically copies fingerprint from error_logs
This ensures grouped view shows correct status

IMPORTANT: You don't need to manually handle fingerprint — the trigger does it automatically. Just use the standard INSERT INTO log_issue_status by log_id.

Error Logging Architecture (Post-Optimization v1.9)

Error flow after volume optimization:

logger.warn/error()
    |
[Proxy Interceptor]
    ├── Pino → stdout → Axiom  (ALWAYS, no filter)
    └── writeToErrorLogs()
          ├── shouldAutoMute() → SKIP if matches auto-mute rules (58 patterns)
          ├── shouldWriteToDb() → SKIP if rate-limited (>5/min per fingerprint)
          └── INSERT into error_logs

logPermanentFailure()  (canonical path, bypasses proxy filters)
    └── INSERT/UPSERT into error_logs → applyAutoMuteStatus()

Key points:

WARN/ERROR always go to Pino/Axiom regardless of DB filters
logPermanentFailure() is the canonical DB write — NOT affected by pre-insert filter
baseLogger.warn/error() bypasses proxy entirely (Pino only, no DB)
Rate limiter prevents outage floods (max 5 per message per minute)

If you suspect missing errors in error_logs (filtered by optimization):

Check Pino/Axiom logs for the full unfiltered stream:

Axiom dashboard: all WARN/ERROR logs are always captured
These are NOT affected by DB-level filtering
Use Axiom search when investigating an error that doesn't appear in /admin/logs
Pino logs to stdout are never filtered — they capture everything

Files involved in filtering:

src/shared/logger/index.ts — proxy interceptor + writeToErrorLogs (pre-insert filter)
src/shared/logger/auto-classification.ts — auto-mute patterns (58 rules)
src/shared/logger/rate-limiter.ts — per-fingerprint rate limiter
src/shared/logger/error-service.ts — logPermanentFailure (canonical path, has own auto-mute)