name: ergo-devops description: Diagnose production issues in running Ergo Framework nodes via MCP server. Covers performance bottlenecks, process leaks, memory issues, network problems, event fanout, restart loops, and stuck processes. Use when investigating issues on live Ergo nodes.
Ergo MCP DevOps
Diagnose production issues in running Ergo Framework distributed systems via MCP.
Network Transparency
Every tool accepts node parameter -- the framework establishes connections automatically. Never call network_connect before querying. Just pass node=<name>.
All nodes are equal. The MCP endpoint is just a proxy, not a primary node.
MCP Connection
claude mcp add --transport http ergo http://localhost:9922/mcp
Process Model
States: Init -> Sleep <-> Running -> Terminated. WaitResponse during sync Call. Zombee on forced kill.
| Metric | Meaning |
|---|---|
| MessagesMailbox | Current queue depth |
| MailboxLatency | Age of oldest message (-tags=latency) |
| RunningTime/Uptime | Utilization ratio (NOT CPU) |
| Drain (MessagesIn/Wakeups) | ~1 = idle, >>1 = overloaded |
Liveness Score
Detects blocked callbacks (mutex, channel, IO inside handler).
Formula: Liveness = RunningTime_ns / (Uptime_s * MailboxLatency_ns)
Collect:
process_list min_mailbox=1 min_uptime=60 min_messages_in=1 sort_by=mailbox_latency limit=10
Filters: pending messages (latency > 0), alive > 60s, received messages (not idle). Exclude zombees from results. Compute liveness for each.
| Latency | Liveness | Diagnosis | Action |
|---|---|---|---|
| high | high (>0.01) | Overloaded but alive | Scale out |
| high | low (<0.001) | Callback blocked | Find blocking call (pprof) |
| 0 | n/a | Healthy, mailbox empty | Nothing |
| transient (ms) | any | Caught mid-callback | Normal |
Same latency, different cause: 18s latency + RunningTime 5000s = overloaded (liveness 0.027). 18s latency + RunningTime 82us = stuck (liveness ~0).
MailboxSize = -1 means unlimited (default). Can accumulate messages indefinitely → memory growth. If > 0, ErrProcessMailboxFull on overflow.
Default Call timeout = 5 seconds. WaitResponse > 5s is abnormal.
LogMessages in node_info: [0]=Trace [1]=Debug [2]=Info [3]=Warning [4]=Error [5]=Panic. Watch [4] and [5].
Framework Internals
Important Delivery errors: ErrProcessUnknown = doesn't exist, ErrProcessMailboxFull = overloaded, ErrTimeout = ACK lost (5s), ErrNoConnection = unreachable.
Supervisor restart intensity: Sliding window (default 5 restarts / 5 sec). Exceeded = supervisor dies with ErrSupervisorRestartsExceeded, cascades up. Inspect supervisor for restarts_count.
Pool backpressure: Capacity = PoolSize * WorkerMailboxSize. All full = messages dropped. Inspect pool for messages_unhandled.
Connection pool: Default 3 TCP connections per peer. Same sender = same TCP (ordered). Different senders = parallel. Pool size determined by acceptor (receiving side) during handshake. PoolDSN non-null = this side dialed. Reconnections counter = instability.
Goroutine labels (with -tags=pprof): Actors: {"pid":"<ABC.0.1005>"}. Meta: {"meta":"Alias#...", "role":"reader"}.
Error counters in node_info: SendErrorsLocal (process unknown/terminated/mailbox full), SendErrorsRemote (connection failures), CallErrorsLocal, CallErrorsRemote. Non-zero = problems.
Connection counters in network_info: ConnectionsEstablished - ConnectionsLost = active. Growing ConnectionsLost = instability. HandshakeErrors = auth problems.
Shutdown: Processes >5s per message are logged. state=running, queue=0 = stuck in callback.
Tools Quick Reference
Profiling (use filter/exclude on remote nodes)
pprof_goroutines debug=1 filter="ProcessRun" exclude="toolPprof" limit=20 # actor goroutines
pprof_goroutines debug=2 filter="<behavior>" limit=5 # specific actor trace
pprof_cpu duration=5 exclude="runtime" limit=15 # CPU hotspots
pprof_heap limit=20 filter="myapp" # memory allocators
All pprof tools support filter (include matching) and exclude (remove matching) by substring.
Response header shows total N, matched M, showing K.
Network
network_ping name=<node> # quick alive check with RTT
network_node_info node=A name=B # A's view of connection to B
network_node_info node=B name=A # B's view (always check BOTH sides)
Connection info includes Reconnections counter -- non-zero indicates instability.
Samplers
sample_start tool=<tool> arguments={...} interval_ms=2000 duration_sec=60 linger_sec=30
sample_listen log_levels=["warning","error"] duration_sec=60 linger_sec=30
sample_read sampler_id=<id>
sample_list # shows "completed, lingering Ns"
sample_stop sampler_id=<id> # immediate terminate
linger_sec (default 30): sampler stays alive after completion for data retrieval. No rush to read before duration expires.
Proxy samplers: sample_start with node=X spawns the sampler on node X. The tool runs locally on X. To read/list/stop, sample_read/sample_list/sample_stop MUST also include node=X.
Passive samplers also capture messages and calls sent to them -- use as test receivers: send_message to=<sampler_id> type_name=X message={...}, then sample_read to see captured entries with from, type, message.
Stopping: read data first (sample_read), present to user, then sample_stop. Exception: user says "just kill it".
Typed Messages
message_types filter="order" # discover types
message_type_info type_name="TestOrder" # inspect fields
send_message to=<target> type_name="TestOrder" message={...} # send typed struct
call_process to=<target> type_name="GetStatusRequest" request={...} # sync call
Short name works (e.g., TestOrder). Go→JSON: string→"v", number→42.5, bool→true, *T→value or null, []T→[...], map→{...}, struct→nested {...}.
Proxy Timeout
All tools accept timeout parameter (seconds, default 30, max 120) for remote calls. Use higher timeout for heavy operations:
pprof_cpu node=X duration=10 timeout=60
pprof_goroutines node=X debug=2 limit=100 timeout=60
Diagnostic Playbooks
1. Cluster Health
cluster_nodes-- discover all- In parallel:
node_info node=X+network_ping name=Xfor every node - Table: Node, Uptime, Processes, Zombee, Heap, Errors, Ping RTT
- Flag anomalies by comparing rows
2. Performance Bottleneck
process_list sort_by=mailbox limit=10-- find suspects- Pattern: High mailbox + high drain = overloaded. High mailbox + low drain = slow callbacks
pprof_cpu duration=5 exclude="runtime"-- where is CPU spent?process_infoon suspect -- confirm with queue sizes
3. CPU Profiling
Snapshot: pprof_cpu duration=5 limit=20 -- top functions
Application only: pprof_cpu duration=5 filter="myapp" exclude="runtime" -- filter noise
Continuous (goroutine sampling):
sample_start tool=pprof_goroutines arguments={"debug":1,"filter":"ProcessRun","exclude":"toolPprof","limit":20} interval_ms=500 duration_sec=30
4. Memory Growth
pprof_heap limit=20-- top allocators (inuse + cumulative alloc columns)runtime_stats-- heap_alloc, num_gc- Trend:
sample_start tool=runtime_stats interval_ms=5000 duration_sec=300
5. Process Leak
node_info-- Spawned vs Terminatedprocess_list max_uptime=60 limit=50-- recent spawns- Trend:
sample_start tool=node_info interval_ms=10000
6. Restart Loop
process_list max_uptime=10 limit=50-- very young processessample_listen log_levels=["error","panic"] duration_sec=60process_inspecton supervisor -- restarts_count
7. Zombie Processes
process_list state=zombeepprof_goroutines debug=2 filter="<behavior_name>" limit=5-- find by behavior name
8. Network Issues
Always check both sides:
network_node_info node=A name=B # A's view
network_node_info node=B name=A # B's view
| Symptom | Root Cause |
|---|---|
| Ping fails or RTT > 1s | Connection problem or overloaded node |
| MessagesOut >> MessagesIn (cross-check) | Silent data loss |
| High Reconnections | Unstable connection |
9. Goroutine Investigation
pprof_goroutines debug=1 filter="ProcessRun" exclude="toolPprof" limit=20 # actors only
pprof_goroutines debug=2 filter="waitResponse" limit=10 # stuck in Call
pprof_goroutines debug=1 exclude="runtime_pollWait" limit=30 # non-IO goroutines
10. Event System
event_list utilization_state=no_subscribers-- wasted publishingevent_list sort_by=subscribers limit=10-- highest fanoutsample_listen event=<name> duration_sec=30-- see data
Build Tags
| Tag | Enables | Without It |
|---|---|---|
-tags=pprof |
Per-process goroutine traces | pid param errors |
-tags=latency |
MailboxLatency measurement | Returns -1 |
Common Mistakes
- Connecting before querying -- never use
network_connectjust to query, framework connects automatically - Unfiltered pprof on remote -- too large, will timeout. Always use filter/exclude
- One-sided network check -- always compare both sides of connection
- Snapshot vs trend -- use samplers for trends, not just single queries
- Missing timeout -- use
timeout=60for heavy remote operations - pprof "not found" -- sleeping processes park goroutines, use sampler to poll
- Dumping everything -- always use sort_by, limit, filter