process-logs

star 2

Process error logs from admin panel - fetch new errors, analyze, create tasks, fix, and mark resolved. Checks Axiom/Pino if DB entries are filtered.

maslennikov-ig By maslennikov-ig schedule Updated 6/12/2026

name: process-logs description: Process error logs from admin panel - fetch new errors, analyze, create tasks, fix, and mark resolved. Checks Axiom/Pino if DB entries are filtered. version: 1.9.0

Process Error Logs

Automated workflow for processing error logs from /admin/logs.

CRITICAL REQUIREMENTS

YOU MUST FOLLOW THESE RULES. NO EXCEPTIONS.

1. BEADS IS MANDATORY

EVERY error MUST have a Beads task. No direct fixes without tracking.

# ALWAYS run this FIRST for each error:
bd create --type=bug --priority=<1-3> --title="Fix: <error_message>" --files "<relevant_files>"
bd update <task_id> --status=in_progress

2. TASK COMPLEXITY ROUTING

Route tasks by complexity:

Complexity Examples Action
Simple Typo fix, single import, config value Execute directly
Medium Multi-file fix, migration, API change Delegate to subagent
Complex Architecture change, new feature Ask user first

Subagent selection for MEDIUM tasks:

  • DB/migration → database-architect
  • API/tRPC → fullstack-nextjs-specialist
  • Types → typescript-types-specialist
  • UI → nextjs-ui-designer

Execute directly for SIMPLE tasks:

  • Single-line fix (typo, wrong value)
  • Import path correction
  • Config constant change
  • Comment fix

Docs L1/L2 is mandatory

ALWAYS query documentation before implementing:

@neuledge/context MCP first with package@version from lockfile; Context7 MCP fallback: resolve-library-id -> query-docs

4. BUG FIXING PRINCIPLES

This is PRODUCTION. Every bug matters.

Fix fundamentally, not superficially:

  • Find and fix the ROOT CAUSE, not just symptoms
  • If error happens in function X but cause is in function Y → fix Y
  • Don't add workarounds/hacks that mask the problem
  • Ask: "Why did this happen?" until you reach the actual cause

Never ignore errors:

  • Every error indicates a real problem
  • "Works most of the time" is NOT acceptable
  • External service errors → add retry logic or graceful degradation
  • Config warnings → fix config or make truly optional

Propose improvements:

  • If you see code that could be better → create separate Beads task
  • If fix reveals related issues → document them
  • If pattern repeats → suggest refactoring to prevent future bugs
  • Format: bd create --type=chore --title="Improve: <description>"

Quality over speed:

  • Take time to understand the full context
  • Test the fix mentally: "What else could break?"
  • Check for similar patterns elsewhere in codebase
  • One good fix > multiple quick patches

5. LOG NOTES (MANDATORY)

Always write notes when updating log status. Keep it brief, in English.

Status What to write in notes
resolved Root cause + fix applied. Example: Missing constraint. Added 'approved' to enum via migration.
auto_muted System-assigned. Don't change. Skip these errors in processing.
ignored Never use. Fix or ask user.
to_verify Why pending + what to check. Auto-resolved after 14d if no recurrence. Example: External API timeout. Monitor for 24h.
in_progress Beads task ID. Example: Working on mc2-5ch

Format: <root_cause>. <action_taken>. — Max 100 chars.

Examples:

  • ESM import conflict. Renamed generator.ts to generator-node.ts.
  • Constraint missing 'approved'. Added via migration 20250115_fix_status.
  • Cloudflare 500. External issue, retry logic already exists. Monitoring.

6. AUTO-MUTED ERRORS

Some errors are automatically ignored by the system with status auto_muted. These are expected events, NOT bugs.

Current auto-mute rules (from src/shared/logger/auto-classification.ts, total: 58):

Pattern Reason Description
Redis connection (ended|closed) graceful_shutdown Redis disconnects during app restart
graceful.*shutdown graceful_shutdown Server shutdown events during deploys
/api/trpc/health.*404 monitoring_probe tRPC health endpoint probes (Uptime Kuma)
/health.*404 monitoring_probe Generic health check probes
Cloudflare.*5\d{2} external_service Cloudflare edge errors (502, 503, 521)
ECONNRESET.*external external_service External API connection resets
Layer failed, trying next cascading_repair Repair layer failed, trying next layer
Critique-revise attempt failed cascading_repair Layer 2 retry attempt failed
Zod.*validation failed.*Layer cascading_repair Layer 1 validation failed, escalating
Job stalled job_lifecycle BullMQ job restarted (long LLM operations)
Unexpected exit code: 10 job_lifecycle Worker TTL timeout (10 min), will retry
No RAG chunks found expected_behavior Course without docs, generates w/o RAG
Mermaid.*fallback.*used graceful_fallback Diagram gen failed, fallback to text
/trpc/.*401 expected_behavior Unauthenticated tRPC request, 401 correct
Cache directory does not exist expected_behavior Cache missing on fresh env, created later
ModelConfigBunker.*sync.*fail external_service Network issue, has retry with backoff
Invalid status for approval ui_race_condition User clicked approve but course progressed
Job .+ not found expected_behavior Frontend polls job status after cleanup
Failed to log generation trace expected_behavior Trace insert failed during pool pressure
Patcher.*REJECTED.*truncated graceful_fallback Truncated content detected, returns original
Preprocessing failed.*using raw graceful_fallback Preprocessing failed, using raw LLM output
Stage 5.*Primary model attempt cascading_repair Stage 5 primary model unavailable, will retry
JSON repair failed after all graceful_fallback JSON repair exhausted, LLM output too malformed
ModelConfigBunker.*LKG file graceful_fallback LKG atomic write race, has Redis+DB fallback
could not renew lock for job job_lifecycle BullMQ lock renewal failed, will restart
Missing key for job.*moveToDelayed job_lifecycle BullMQ race condition, job already done
Critical language consistency expected_behavior Cyrillic false positive in Russian courses
Critical heuristic failures expected_behavior Heuristic skipped LLM review (false positive)
Rate limit exceeded expected_behavior tRPC rate limiter working as designed
/trpc/lessonContent.*429 expected_behavior HTTP 429 from rate limiter on partial generate
/trpc/jobs\.getStatus 404 expected_behavior HTTP 404 from job status poll after cleanup
Sufficiency verdict.*defaulting graceful_fallback Phase 0.5 Zod validation fallback, non-blocking
Batch section insert failed.*fallback graceful_fallback Batch insert duplicate → individual fallback
Failed to create section record graceful_fallback Individual section insert skipped (already exists)
Content failed sanity check.*non-blocking expected_behavior Sanity check warning, content still accepted
Unavailable For Legal Reasons|content policy violation content_policy Jina API content policy rejection (PII/legal)
Using STALE phase config due to database error graceful_fallback ModelConfigBunker stale config during DB outage
\[Phase 6\] Max retries reached.*best-effort graceful_fallback Phase 6 summary retries exhausted, best-effort
MISCONF Redis.*unable to persist infrastructure Redis RDB/AOF disk issue - infra problem, not app
Concurrency limit exceeded external_service Jina API server-side concurrency enforcement
getaddrinfo EAI_AGAIN infrastructure DNS resolution failure during deploy/restart
Course not ready.*generation expected_behavior Frontend polls after course completed or not ready

Total rules: 60 (test validates sync with code)

Metadata-aware matching (v1.10):

shouldAutoMute() now accepts optional context?: { message?: string } parameter. For tRPC errors where error_message is generic ("tRPC error"), the actual error from metadata.message is also checked against all patterns. This catches ~2000+ errors per week that previously slipped through.

Test environment auto-muting:

Errors from NODE_ENV=test (vitest) are automatically muted at insert time via muteTestEnvironmentLog() in error-service.ts. They get environment = 'test' and status = 'auto_muted' immediately. This prevents test errors from polluting the admin logs UI and triggering auto-reopen.

When you see auto_muted errors:

  • Skip them in processing — they don't need fixes
  • If you see a pattern that should be auto-muted, add it to auto-classification.ts
  • Note: auto-muted patterns are now filtered BEFORE DB insert (pre-insert filter), so most won't appear in error_logs at all. Only errors from logPermanentFailure() (canonical path) may still get auto-muted after insert.

How to add a new auto-mute rule:

  1. Edit packages/course-gen-platform/src/shared/logger/auto-classification.ts:

    {
      pattern: /your-pattern/i,
      reason: 'category',  // graceful_shutdown | monitoring_probe | external_service
      description: 'Why this is expected',
    }
    
  2. Update this SKILL.md with the new pattern

When NOT to auto-mute:

  • Errors that SOMETIMES indicate real problems
  • New error types (analyze first, then decide)
  • Anything affecting user experience

7. SEARCH SIMILAR PROBLEMS FIRST (MANDATORY)

Before fixing ANY error, search BOTH sources:

7a. Search in Beads (closed bug tasks)

# Search by error keywords
bd search "<keyword>" --type=bug --status=closed

# Example searches:
bd search "constraint violation"
bd search "tRPC timeout"
bd search "undefined property"

What to look for in Beads:

  • Similar error patterns in task titles
  • Root cause analysis in task descriptions
  • Fix approach and files changed

7b. Search in error_logs (resolved errors)

-- Search similar errors by message (use mcp__supabase__execute_sql)
SELECT el.id, el.error_message, el.severity, lis.status, lis.notes, el.created_at
FROM error_logs el
LEFT JOIN log_issue_status lis ON lis.log_id = el.id AND lis.log_type = 'error_log'
WHERE to_tsvector('english', el.error_message) @@ plainto_tsquery('english', '<keyword>')
  AND lis.status = 'resolved'
ORDER BY el.created_at DESC
LIMIT 5;

What to search for:

  • Key error terms: constraint, undefined, timeout, not found
  • Function/module names from stack trace
  • Error codes or specific identifiers

7c. If found similar resolved issue

  1. From Beads: Read task description for root cause and fix approach
  2. From error_logs: Read the notes field — contains root cause and fix
  3. Apply same solution pattern if applicable
  4. Reference in your notes: Similar to mc2-xxx / <date>. Same fix applied.

7d. If NOT found — create Beads task (MANDATORY)

Every new error MUST have a Beads task before fixing:

# 1. Create task with all required fields
bd create --type=bug --priority=<1-3> --title="Fix: <error_message>" --files "<relevant_files>"

# 2. Start working
bd update <task_id> --status=in_progress

# 3. After fix - close with detailed reason
bd close <task_id> --reason="Root cause: <why>. Fix: <what was done>."

Beads task MUST include:

  • Clear title with error essence
  • Priority based on severity (CRITICAL=1, ERROR=2, WARNING=3)
  • Files that will be modified (--files)
  • Closing reason with root cause explanation

Usage

Invoke via: /process-logs or "обработай логи ошибок"

Workflow

Step 1: Fetch New Errors

IMPORTANT: The /admin/logs UI shows errors from TWO tables:

  1. error_logs — system errors, validation failures, worker errors
  2. generation_trace (where error_data IS NOT NULL) — LLM generation errors

Both tables must be checked. Logs without a log_issue_status record show as "Новый" (new) in the UI.

1a. Check error_logs

-- Use mcp__supabase__execute_sql
-- NOTE: This excludes auto_muted errors (they are handled automatically)
SELECT el.id, el.severity, el.error_message, el.metadata, el.stack_trace,
       el.course_id, el.lesson_id, el.request_id, el.trpc_path, el.trpc_input, el.attempted_value
FROM error_logs el
LEFT JOIN log_issue_status lis ON lis.log_id = el.id AND lis.log_type = 'error_log'
WHERE lis.id IS NULL OR (lis.status NOT IN ('resolved', 'ignored', 'auto_muted'))
ORDER BY
  CASE el.severity WHEN 'CRITICAL' THEN 1 WHEN 'ERROR' THEN 2 ELSE 3 END,
  el.created_at DESC
LIMIT 20;

1b. Check generation_trace (LLM errors)

-- generation_trace with error_data shows as ERROR in UI
SELECT gt.id, gt.created_at, gt.stage, gt.phase, gt.step_name, gt.course_id,
       (gt.error_data->>'message')::text as error_message
FROM generation_trace gt
LEFT JOIN log_issue_status lis ON gt.id = lis.log_id AND lis.log_type = 'generation_trace'
WHERE gt.error_data IS NOT NULL
  AND (lis.id IS NULL OR lis.status NOT IN ('resolved', 'ignored', 'auto_muted'))
ORDER BY gt.created_at DESC
LIMIT 20;

1c. Quick count check

-- Quick check: how many "new" errors in each table?
SELECT
  'error_logs' as source,
  (SELECT COUNT(*) FROM error_logs el
   LEFT JOIN log_issue_status lis ON el.id = lis.log_id AND lis.log_type = 'error_log'
   WHERE lis.id IS NULL) as new_count
UNION ALL
SELECT
  'generation_trace' as source,
  (SELECT COUNT(*) FROM generation_trace gt
   LEFT JOIN log_issue_status lis ON gt.id = lis.log_id AND lis.log_type = 'generation_trace'
   WHERE gt.error_data IS NOT NULL AND lis.id IS NULL) as new_count;

Step 1.5: Filter by Environment (IMPORTANT)

CRITICAL: Both DEV and STAGE are production-like servers. ALL errors on these environments must be investigated and fixed. Only LOCAL (NULL) can be bulk-resolved.

The error_logs table has an environment column that indicates where the error occurred:

Value Environment Action
NULL Local dev Bulk resolve — local testing/development only
'dev' Dev server MUST FIX — real errors affecting developers
'stage' Staging (prod) MUST FIX — real production errors

Always check environment distribution first:

-- Check how many errors per environment (includes both NULL status AND status='new')
SELECT environment, COUNT(*) as count
FROM error_logs el
LEFT JOIN log_issue_status lis ON lis.fingerprint = el.fingerprint AND lis.log_type = 'error_log'
WHERE lis.id IS NULL OR lis.status = 'new'
GROUP BY environment
ORDER BY count DESC;

Bulk resolve LOCAL errors only:

-- Bulk resolve ONLY local environment errors (environment IS NULL)
-- NEVER bulk resolve dev or stage errors - they must be investigated individually!
-- NOTE: This handles both NULL status AND status='new'
WITH local_fingerprints AS (
  SELECT DISTINCT ON (el.fingerprint) el.id, el.fingerprint
  FROM error_logs el
  LEFT JOIN log_issue_status lis ON lis.fingerprint = el.fingerprint AND lis.log_type = 'error_log'
  WHERE (lis.id IS NULL OR lis.status = 'new')
    AND el.environment IS NULL
    AND el.fingerprint IS NOT NULL
  ORDER BY el.fingerprint, el.created_at DESC
)
INSERT INTO log_issue_status (log_type, log_id, status, notes, fingerprint, updated_at)
SELECT 'error_log', lf.id, 'resolved', 'Local environment: Testing/development errors', lf.fingerprint, NOW()
FROM local_fingerprints lf
ON CONFLICT (log_type, log_id) DO UPDATE SET status = 'resolved', notes = EXCLUDED.notes, updated_at = NOW();

Focus on server errors (dev + stage):

-- Get only SERVER errors (dev and stage environments)
-- NOTE: Includes both NULL status AND status='new'
SELECT
  el.environment,
  el.fingerprint,
  el.severity,
  MIN(el.error_message) as error_message,
  COUNT(*) as count,
  MAX(el.created_at) as last_seen
FROM error_logs el
LEFT JOIN log_issue_status lis ON lis.fingerprint = el.fingerprint AND lis.log_type = 'error_log'
WHERE (lis.id IS NULL OR lis.status = 'new')
  AND el.fingerprint IS NOT NULL
  AND el.environment IS NOT NULL  -- Exclude local (NULL)
GROUP BY el.environment, el.fingerprint, el.severity
ORDER BY
  CASE el.severity WHEN 'CRITICAL' THEN 1 WHEN 'ERROR' THEN 2 ELSE 3 END,
  COUNT(*) DESC
LIMIT 20;

Why this matters:

  • Local testing generates thousands of errors (incomplete data, experiments)
  • Dev and stage servers have real errors that need investigation
  • Bulk resolving only local (NULL) errors saves time without missing real bugs

Step 1.7: Check to_verify Fingerprints

Auto-resolution of stale to_verify fingerprints. Run on EVERY skill invocation.

Before processing new errors, resolve stale to_verify fingerprints:

1.7a. Run auto-resolution

-- Use mcp__supabase__execute_sql
-- Resolves inactive to_verify (14d no recurrence) and reopens recurred ones
SELECT resolve_inactive_to_verify(14);

Returns JSON:

{
  "resolved_count": 3,
  "reopened_count": 1,
  "resolved_fingerprints": ["abc...", "def..."],
  "reopened_fingerprints": ["ghi..."],
  "inactive_days": 14
}

1.7b. Handle results

  • resolved_count > 0: Fixes confirmed. Include count in Step 3 summary.
  • reopened_count > 0: Errors recurred — fixes didn't work. These fingerprints are now in_progress and will appear in Step 2 processing. Prioritize them.
  • Both 0: No to_verify fingerprints pending. Continue to Step 2.

1.7c. Query reopened details (if reopened_count > 0)

-- Get details of reopened fingerprints for Step 2 processing
SELECT lis.fingerprint, lis.notes,
       (SELECT MIN(el.error_message) FROM error_logs el WHERE el.fingerprint = lis.fingerprint) as error_message,
       (SELECT COUNT(*) FROM error_logs el
        WHERE el.fingerprint = lis.fingerprint
          AND el.created_at > lis.updated_at - INTERVAL '14 days') as recent_count
FROM log_issue_status lis
WHERE lis.status = 'in_progress'
  AND lis.notes LIKE 'Recurred after fix%'
  AND lis.updated_at > NOW() - INTERVAL '5 minutes';

Step 2: For EACH Error (Loop)

FOR each error:
  1. CREATE BEADS TASK (MANDATORY):
     bd create --type=bug --priority=<1-3> --title="Fix: <message>" --files "<files>"
     bd update <id> --status=in_progress

  2. ANALYZE error type and SELECT subagent:
     - DB constraint → database-architect
     - tRPC/API → fullstack-nextjs-specialist
     - Types → typescript-types-specialist
     - UI → nextjs-ui-designer

  3. USE Docs L1/L2 for relevant docs

  4. DELEGATE using Task tool:
     Task(subagent_type="<selected>", prompt="Fix error: <details>...")

  5. VERIFY results (MANDATORY):
     - Read tool: check modified files
     - Bash: pnpm type-check && pnpm build
     - If errors → re-delegate

  6. MARK resolved in DB:
     -- For error_logs:
     INSERT INTO log_issue_status (log_type, log_id, status, notes, updated_at)
     VALUES ('error_log', '<id>', 'resolved', 'Fixed: <desc>', NOW())
     ON CONFLICT (log_type, log_id) DO UPDATE SET status = 'resolved', notes = EXCLUDED.notes, updated_at = NOW();

     -- For generation_trace:
     INSERT INTO log_issue_status (log_type, log_id, status, notes, updated_at)
     VALUES ('generation_trace', '<id>', 'resolved', 'Fixed: <desc>', NOW())
     ON CONFLICT (log_type, log_id) DO UPDATE SET status = 'resolved', notes = EXCLUDED.notes, updated_at = NOW();

  7. CLOSE Beads task:
     bd close <id> --reason="Fixed"

Step 3: Summary Report

## Log Processing Summary

| Severity | Fixed | Pending | To Verify |
| -------- | ----- | ------- | --------- |
| CRITICAL | X     | Y       | Z         |
| ERROR    | X     | Y       | Z         |
| WARNING  | X     | Y       | Z         |

### to_verify Auto-Resolution

| Action                            | Count |
| --------------------------------- | ----- |
| Auto-resolved (14d no recurrence) | X     |
| Reopened (error recurred)         | Y     |

### Beads Tasks Created:

- mc2-xxx: <description> → <status>

### Pending (need user input):

- <log_id>: <reason>

Subagent Delegation Examples

DB Constraint Error

Task(
  subagent_type="database-architect",
  prompt="Fix DB constraint violation in error_logs.
  Error: <full_error_message>
  Context: <stack_trace>
  Course: <course_id>
  Create migration to fix the constraint."
)

tRPC/API Error

Task(
  subagent_type="fullstack-nextjs-specialist",
  prompt="Fix tRPC error in <trpc_path>.
  Error: <full_error_message>
  Input: <trpc_input>
  Stack: <stack_trace>
  Fix the API endpoint."
)

Type Error

Task(
  subagent_type="typescript-types-specialist",
  prompt="Fix TypeScript type error.
  Error: <full_error_message>
  File: <file_path>
  Fix types and ensure compatibility."
)

Verification Checklist

Before marking ANY error as resolved:

  • Beads task exists for this error
  • Subagent was used (if not trivial fix)
  • Modified files reviewed with Read tool
  • pnpm type-check passes
  • pnpm build passes
  • No new errors introduced
  • Beads task closed with reason

Error Categories

Pattern Category Subagent Priority
violates.*constraint DB constraint database-architect 1
tRPC error API bug fullstack-nextjs-specialist 2
Type.*error Type error typescript-types-specialist 2
Error querying Query bug database-architect 2
Config missing Config issue ASK USER 3
External service External mark to_verify 3
Redis shutdown Expected SKIP (auto_muted) -
Health probe 404 Expected SKIP (auto_muted) -

Errors with status auto_muted are automatically ignored by the system. Skip them.

Reference Docs

  • Admin Logs Guide: .claude/docs/admin-logs-guide.md
  • Error Types: packages/course-gen-platform/src/shared/logger/types.ts
  • Logs Router: packages/course-gen-platform/src/server/routers/admin/logs.ts
  • CLAUDE.md: Main orchestration rules

Architecture Note

The /admin/logs page aggregates errors from two sources:

Table log_type What it contains
error_logs 'error_log' System errors, validation, worker failures
generation_trace 'generation_trace' LLM errors (where error_data IS NOT NULL)

Status is tracked in log_issue_status table with composite key (log_type, log_id).

UI Logic: Status shows as "Новый" (new) when:

  1. No log_issue_status record exists for the fingerprint
  2. OR record exists with explicit status = 'new'

IMPORTANT: Always check BOTH conditions when querying for new errors.

Grouped View (fingerprint)

The UI has two views:

  1. List view — individual logs, status by log_id
  2. Grouped view — errors grouped by fingerprint, status by fingerprint

Auto-sync trigger (trg_sync_log_status_fingerprint):

  • When you INSERT/UPDATE log_issue_status for an error_log
  • The trigger automatically copies fingerprint from error_logs
  • This ensures grouped view shows correct status

IMPORTANT: You don't need to manually handle fingerprint — the trigger does it automatically. Just use the standard INSERT INTO log_issue_status by log_id.

Error Logging Architecture (Post-Optimization v1.9)

Error flow after volume optimization:

logger.warn/error()
    |
[Proxy Interceptor]
    ├── Pino → stdout → Axiom  (ALWAYS, no filter)
    └── writeToErrorLogs()
          ├── shouldAutoMute() → SKIP if matches auto-mute rules (58 patterns)
          ├── shouldWriteToDb() → SKIP if rate-limited (>5/min per fingerprint)
          └── INSERT into error_logs

logPermanentFailure()  (canonical path, bypasses proxy filters)
    └── INSERT/UPSERT into error_logs → applyAutoMuteStatus()

Key points:

  • WARN/ERROR always go to Pino/Axiom regardless of DB filters
  • logPermanentFailure() is the canonical DB write — NOT affected by pre-insert filter
  • baseLogger.warn/error() bypasses proxy entirely (Pino only, no DB)
  • Rate limiter prevents outage floods (max 5 per message per minute)

If you suspect missing errors in error_logs (filtered by optimization):

Check Pino/Axiom logs for the full unfiltered stream:

  • Axiom dashboard: all WARN/ERROR logs are always captured
  • These are NOT affected by DB-level filtering
  • Use Axiom search when investigating an error that doesn't appear in /admin/logs
  • Pino logs to stdout are never filtered — they capture everything

Files involved in filtering:

  • src/shared/logger/index.ts — proxy interceptor + writeToErrorLogs (pre-insert filter)
  • src/shared/logger/auto-classification.ts — auto-mute patterns (58 rules)
  • src/shared/logger/rate-limiter.ts — per-fingerprint rate limiter
  • src/shared/logger/error-service.ts — logPermanentFailure (canonical path, has own auto-mute)
Install via CLI
npx skills add https://github.com/maslennikov-ig/MC-2 --skill process-logs
Repository Details
star Stars 2
call_split Forks 0
navigation Branch main
article Path SKILL.md
More from Creator
maslennikov-ig
maslennikov-ig Explore all skills →