production-standards

star 18

Best-practice rubric for production readiness in Mattin AI — concurrency, fault tolerance/resilience, isolation, observability, and operations. Background reference for reliability-auditor, production-readiness-analyst, and implementation experts.

lksnext-ai-lab By lksnext-ai-lab schedule Updated 6/4/2026

name: production-standards description: Best-practice rubric for production readiness in Mattin AI — concurrency, fault tolerance/resilience, isolation, observability, and operations. Background reference for reliability-auditor, production-readiness-analyst, and implementation experts. user-invocable: false allowed-tools: Read, Grep, Glob

production-standards — Best-practice rubric

The standard each reliability/production finding is held against. Concrete, stack-specific (FastAPI async · SQLAlchemy 2.x · PostgreSQL · LangChain/LangGraph 1.x · Qdrant · Docker/Caddy/K8s · multi-tenant by app_id). A finding states which rule below it violates and the conforming pattern.

A. Concurrency

# Standard Conforming pattern in this stack
C1 No check-then-act on shared state. Counters/quotas (UsageRecord, MarketplaceUsage, request counts, TierConfig limits): atomic UPDATE t SET n = n + :d WHERE ... or an optimistic version column, not read→+1→write in Python.
C2 Guard contended rows. select(...).with_for_update() (pessimistic) or optimistic concurrency for rows multiple requests mutate; keep the locked transaction short.
C3 One session per request/task. Depends(get_db) async session scoped to the request; never share a session across await fan-out or across asyncio tasks; never reuse across replicas.
C4 Pool sized for concurrency. DB/HTTP pool size ≥ expected concurrent requests per replica; fail-fast/timeout on pool checkout; surface pool exhaustion in metrics.
C5 Never block the event loop. All I/O awaited and async-native; CPU/blocking work offloaded (thread/process executor or a task); no sync DB driver / requests / blocking file ops in async def.
C6 Bound fan-out. asyncio.gather over user-sized inputs is wrapped with a Semaphore; batch embedding/vectorization has a concurrency cap.
C7 Idempotent external writes. Public/MCP write operations tolerate retries (idempotency key or natural idempotency); no duplicate side effects on client retry.
C8 Single-runner for scheduled work. Crawl/cron/scheduler jobs in a multi-replica (K8s) deployment use a distributed lock / leader election, not "every replica runs it".
C9 Per-thread checkpointer safety. Concurrent messages on the same thread_{agent_id}_{session_id} are serialized; no interleaved checkpoint writes corrupting memory.

B. Fault tolerance / resilience

# Standard Conforming pattern
F1 Timeout every external call. Explicit timeouts on LLM, embeddings, Qdrant/PGVector, web scraping, SharePoint, outbound MCP, and DB statements. No unbounded waits.
F2 Retry transient failures only. Exponential backoff + jitter, bounded attempts, only for idempotent/transient errors (.with_retry() for LangChain runnables). No retry on 4xx/validation.
F3 Fallbacks for provider outages. .with_fallbacks() for model/provider failure; degrade gracefully (silo/RAG down → answer without retrieval) rather than hard-failing the request.
F4 Bulkhead slow dependencies. A slow/blocked dependency cannot exhaust the shared pool or event loop for unrelated traffic; separate limits per dependency where it matters.
F5 Handle partial failure. Batch uploads / multi-file vectorization / crawl jobs continue on per-item failure and record it (dead-letter/error capture); they do not abort the whole batch silently.
F6 Graceful lifecycle. FastAPI lifespan startup fails fast on missing critical deps; SIGTERM drains in-flight requests, flushes checkpoints, and closes pools before exit.
F7 Health signals. Liveness + readiness endpoints (readiness reflects DB/vector/critical-dep health) wired to the Compose/Helm health checks.
F8 Clean client errors. HTTPException with correct codes; never leak stack traces/internals; Retry-After on rate-limit/429.

C. Isolation

# Standard Conforming pattern
I1 Tenant data isolation. Every query filtered by app_id; vector collections silo_{id}; signed static URLs scoped; no cross-tenant bleed via caches or the checkpointer.
I2 Resource isolation / no noisy neighbor. Per-app rate limits (App.agent_rate_limit) and tier quotas (TierConfig) enforced on /public/v1 and /mcp/v1; per-request timeouts; max payload/file size (App.max_file_size_mb).
I3 Fault isolation. One tenant/agent/provider failure is contained — it does not degrade other tenants or take down the process.
I4 Network isolation. Only Caddy publishes port 80; backend/DB/Qdrant stay on the internal network; CORS restricted to App.agent_cors_origins (never * in prod); secrets never sent to clients.
I5 Runtime isolation / statelessness. App is stateless for horizontal scaling — no in-process mutable global that breaks across replicas; shared state lives in PostgreSQL/cache.
I6 Resource limits. Container CPU/memory requests+limits set in the Helm charts; sensible Uvicorn/worker counts; no unbounded in-memory growth.

D. Observability & operations (cross-cut)

# Standard Conforming pattern
O1 Structured logging. Project logger (never print); right levels; request/correlation context; no secrets/keys/PII in logs.
O2 Tracing. LangSmith per-app (App.langsmith_api_key) + global fallback via backend/tools/langsmith_config.py; failures are traceable end-to-end.
O3 Config & secrets. All config via env/settings (backend/utils/config.py), documented in .env.example, validated at startup; no secrets in code/images/logs.
O4 Migrations. Every schema change has an Alembic migration with a tested downgrade; no irreversible destructive step without a documented plan/backfill.
O5 Deploy & rollback. Docker/Compose + Helm current; image tags pinned; a rollback path exists; per-environment values correct.

Severity guidance

CRITICAL/HIGH: data races on counters (C1/C2), missing timeouts (F1), no fallback on the core chat path (F3), cross-tenant leaks (I1), unenforced rate limits/quotas (I2), event-loop blocking on a hot path (C5). MEDIUM: missing bulkheads, partial-failure gaps, missing readiness probe. LOW: tuning (pool sizes, retry caps) without evidence of a problem.

Install via CLI
npx skills add https://github.com/lksnext-ai-lab/ai-core-tools --skill production-standards
Repository Details
star Stars 18
call_split Forks 10
navigation Branch main
article Path SKILL.md
More from Creator
lksnext-ai-lab
lksnext-ai-lab Explore all skills →