diagnostics

name: diagnostics description: "Diagnose fleet-rlm runtime failures, API contract drift, sandbox errors, and observability issues. Use when something is broken — Daytona connection failures, websocket mismatches, escalation not triggering, budget exhaustion, or missing traces."

Symptom-to-Cause Decision Tree

"Sandbox won't start"
- Check DAYTONA_API_KEY and DAYTONA_API_URL are set and non-empty
- Run daytona-smoke to verify connectivity
- Verify API key has not expired (keys rotate every 90 days)
- Check runtime diagnostics: GET /api/v1/runtime/status
- Check managed sandboxes: GET /api/v1/sandboxes
"No response from websocket"
- Likely contract drift between frontend and backend
- Check canonical fields: repo_url, repo_ref, context_paths, batch_concurrency
- Verify /api/v1/ws/execution endpoint is registered and not shadowed
- See references/contract-surfaces.md for full field list
"Escalation not triggering"
- Inspect the routing_decision payload (tools_react, router_rlm, url_document_rlm, ...) emitted on the turn
- Verify execution_mode is not overridden to "direct" in request
- Confirm FLEET_RLM_USE_ESCALATING_RUNTIME=true in environment
- Check that the escalation module is registered: uv run fleet-rlm optimize list
"Budget exhausted too quickly"
- Trace max_llm_calls subdivision across child tasks
- Check child lease allocation via _remaining_llm_budget()
- Reduce delegation fan-out (fewer concurrent sub_rlm calls)
- Look for recursive delegation patterns (delegation depth > 1)
"Volume data missing after restart"
- Confirm volume_name is set on the interpreter configuration
- Data in /home/daytona/memory/ persists across restarts
- Workspace files (/home/daytona/workspace/) are transient — wiped on restart
- Check mounted volume contents: GET /api/v1/runtime/volume/tree
- Check a specific volume file: GET /api/v1/runtime/volume/file
"UI showing stale data"
- OpenAPI drift between backend routes and generated frontend client
- Run make api-check to detect schema mismatches
- Regenerate with make api-sync
- See references/contract-surfaces.md for sync commands
"No MLflow traces appearing"
- Verify MLFLOW_ENABLED=true in environment
- Check MLFLOW_TRACKING_URI is reachable: curl <uri>/health
- Confirm initialize_mlflow() is called at application startup
- Check experiment exists: curl <uri>/api/2.0/mlflow/experiments/get-by-name?experiment_name=fleet-rlm
"Tests failing unexpectedly"
- See Test Lane Selector below to pick the right test subset
- Check if failure is environment-specific (missing env vars, port conflicts)
- Run diagnostics script first: uv run python src/fleet_rlm/scaffold/skills/diagnostics/scripts/diagnose.py

Risky Backend Files

Files most likely to cause breakage when modified — review changes carefully:

File	Responsibility
`api/routers/ws/stream.py`	Live chat streaming loop
`api/routers/ws/commands.py`	Command dispatch (run, stop, history)
`api/routers/ws/turn_lifecycle.py`	Run/turn lifecycle state machine
`api/runtime_services/settings.py`	Settings routes (read/write config)
`api/runtime_services/diagnostics.py`	Status and diagnostics endpoints
`runtime/execution/streaming_events.py`	Streaming event construction and serialization

Test Lane Selector

Change type	Command	What it validates
Fast confidence	`make test-fast`	Unit + contracts parallel, integration serial
Quality gate	`make quality-gate`	Lint + format + types + tests + docs + frontend
Runtime/WS focus	`uv run pytest -q tests/unit/api/test_auth.py tests/unit/api/test_chat_persistence.py tests/unit/api/test_events.py tests/unit/api/test_ws_session_restore.py tests/unit/api/test_ws_turn_setup.py tests/unit/runtime/test_escalating_module.py`	WebSocket session lifecycle, auth, event streaming
Daytona focus	`uv run pytest -q tests/unit/integrations/test_daytona_config.py tests/unit/integrations/test_daytona_concurrency.py tests/unit/integrations/test_daytona_runtime.py tests/unit/integrations/test_memory_db.py tests/unit/integrations/test_volume_seed_skills.py tests/unit/runtime/test_volume_memory_tools.py tests/unit/runtime/test_phase3_tools.py`	Sandbox creation, volumes, memory persistence
MLflow/observability	`uv run pytest -q tests/unit/quality/test_module_registry.py tests/unit/quality/test_optimization_runner.py tests/unit/quality/test_workspace_metrics.py tests/unit/api/test_bootstrap_observability.py tests/unit/integrations/test_observability.py`	Tracing, metrics logging, experiment tracking

Environment Diagnostics

Run the full diagnostic check:

uv run python src/fleet_rlm/scaffold/skills/diagnostics/scripts/diagnose.py

This script checks:

Required environment variables are set
Daytona API connectivity
MLflow server reachability
Port availability (3000, 5001, 8000)
Volume mount status
OpenAPI spec consistency

diagnostics

name: diagnostics description: "Diagnose fleet-rlm runtime failures, API contract drift, sandbox errors, and observability issues. Use when something is broken — Daytona connection failures, websocket mismatches, escalation not triggering, budget exhaustion, or missing traces."

Diagnostics

Symptom-to-Cause Decision Tree

Risky Backend Files

Test Lane Selector

Environment Diagnostics

See Also