ergo-devops - SKILL.md Agent Skill

name: ergo-devops description: Diagnose production issues in running Ergo Framework nodes via MCP server. Covers performance bottlenecks, process leaks, memory issues, network problems, event fanout, restart loops, and stuck processes. Use when investigating issues on live Ergo nodes.

Ergo MCP DevOps

Diagnose production issues in running Ergo Framework distributed systems via MCP.

Network Transparency

Every tool accepts node parameter -- the framework establishes connections automatically. Never call network_connect before querying. Just pass node=<name>.

All nodes are equal. The MCP endpoint is just a proxy, not a primary node.

MCP Connection

claude mcp add --transport http ergo http://localhost:9922/mcp

Process Model

States: Init -> Sleep <-> Running -> Terminated. WaitResponse during sync Call. Zombee on forced kill.

Metric	Meaning
MessagesMailbox	Current queue depth
MailboxLatency	Age of oldest message (-tags=latency)
RunningTime/Uptime	Utilization ratio (NOT CPU)
Drain (MessagesIn/Wakeups)	~1 = idle, >>1 = overloaded

Liveness Score

Detects blocked callbacks (mutex, channel, IO inside handler).

Formula: Liveness = RunningTime_ns / (Uptime_s * MailboxLatency_ns)

Collect:

process_list min_mailbox=1 min_uptime=60 min_messages_in=1 sort_by=mailbox_latency limit=10

Filters: pending messages (latency > 0), alive > 60s, received messages (not idle). Exclude zombees from results. Compute liveness for each.

Latency	Liveness	Diagnosis	Action
high	high (>0.01)	Overloaded but alive	Scale out
high	low (<0.001)	Callback blocked	Find blocking call (pprof)
0	n/a	Healthy, mailbox empty	Nothing
transient (ms)	any	Caught mid-callback	Normal

Same latency, different cause: 18s latency + RunningTime 5000s = overloaded (liveness 0.027). 18s latency + RunningTime 82us = stuck (liveness ~0).

MailboxSize = -1 means unlimited (default). Can accumulate messages indefinitely → memory growth. If > 0, ErrProcessMailboxFull on overflow.

Default Call timeout = 5 seconds. WaitResponse > 5s is abnormal.

LogMessages in node_info: [0]=Trace [1]=Debug [2]=Info [3]=Warning [4]=Error [5]=Panic. Watch [4] and [5].

Framework Internals

Important Delivery errors: ErrProcessUnknown = doesn't exist, ErrProcessMailboxFull = overloaded, ErrTimeout = ACK lost (5s), ErrNoConnection = unreachable.

Supervisor restart intensity: Sliding window (default 5 restarts / 5 sec). Exceeded = supervisor dies with ErrSupervisorRestartsExceeded, cascades up. Inspect supervisor for restarts_count.

Pool backpressure: Capacity = PoolSize * WorkerMailboxSize. All full = messages dropped. Inspect pool for messages_unhandled.

Connection pool: Default 3 TCP connections per peer. Same sender = same TCP (ordered). Different senders = parallel. Pool size determined by acceptor (receiving side) during handshake. PoolDSN non-null = this side dialed. Reconnections counter = instability.

Goroutine labels (with -tags=pprof): Actors: {"pid":"<ABC.0.1005>"}. Meta: {"meta":"Alias#...", "role":"reader"}.

Error counters in node_info: SendErrorsLocal (process unknown/terminated/mailbox full), SendErrorsRemote (connection failures), CallErrorsLocal, CallErrorsRemote. Non-zero = problems.

Connection counters in network_info: ConnectionsEstablished - ConnectionsLost = active. Growing ConnectionsLost = instability. HandshakeErrors = auth problems.

Shutdown: Processes >5s per message are logged. state=running, queue=0 = stuck in callback.

Tools Quick Reference

Profiling (use filter/exclude on remote nodes)

pprof_goroutines debug=1 filter="ProcessRun" exclude="toolPprof" limit=20   # actor goroutines
pprof_goroutines debug=2 filter="<behavior>" limit=5                         # specific actor trace
pprof_cpu duration=5 exclude="runtime" limit=15                              # CPU hotspots
pprof_heap limit=20 filter="myapp"                                           # memory allocators

All pprof tools support filter (include matching) and exclude (remove matching) by substring. Response header shows total N, matched M, showing K.

Network

network_ping name=<node>                           # quick alive check with RTT
network_node_info node=A name=B                    # A's view of connection to B
network_node_info node=B name=A                    # B's view (always check BOTH sides)

Connection info includes Reconnections counter -- non-zero indicates instability.

Samplers

sample_start tool=<tool> arguments={...} interval_ms=2000 duration_sec=60 linger_sec=30
sample_listen log_levels=["warning","error"] duration_sec=60 linger_sec=30
sample_read sampler_id=<id>
sample_list                                         # shows "completed, lingering Ns"
sample_stop sampler_id=<id>                         # immediate terminate

linger_sec (default 30): sampler stays alive after completion for data retrieval. No rush to read before duration expires.

Proxy samplers: sample_start with node=X spawns the sampler on node X. The tool runs locally on X. To read/list/stop, sample_read/sample_list/sample_stop MUST also include node=X.

Passive samplers also capture messages and calls sent to them -- use as test receivers: send_message to=<sampler_id> type_name=X message={...}, then sample_read to see captured entries with from, type, message.

Stopping: read data first (sample_read), present to user, then sample_stop. Exception: user says "just kill it".

Typed Messages

message_types filter="order"                                    # discover types
message_type_info type_name="TestOrder"                         # inspect fields
send_message to=<target> type_name="TestOrder" message={...}    # send typed struct
call_process to=<target> type_name="GetStatusRequest" request={...}  # sync call

Short name works (e.g., TestOrder). Go→JSON: string→"v", number→42.5, bool→true, *T→value or null, []T→[...], map→{...}, struct→nested {...}.

Proxy Timeout

All tools accept timeout parameter (seconds, default 30, max 120) for remote calls. Use higher timeout for heavy operations:

pprof_cpu node=X duration=10 timeout=60
pprof_goroutines node=X debug=2 limit=100 timeout=60

Diagnostic Playbooks

1. Cluster Health

cluster_nodes -- discover all
In parallel: node_info node=X + network_ping name=X for every node
Table: Node, Uptime, Processes, Zombee, Heap, Errors, Ping RTT
Flag anomalies by comparing rows

2. Performance Bottleneck

process_list sort_by=mailbox limit=10 -- find suspects
Pattern: High mailbox + high drain = overloaded. High mailbox + low drain = slow callbacks
pprof_cpu duration=5 exclude="runtime" -- where is CPU spent?
process_info on suspect -- confirm with queue sizes

3. CPU Profiling

Snapshot: pprof_cpu duration=5 limit=20 -- top functions

Application only: pprof_cpu duration=5 filter="myapp" exclude="runtime" -- filter noise

Continuous (goroutine sampling):

sample_start tool=pprof_goroutines arguments={"debug":1,"filter":"ProcessRun","exclude":"toolPprof","limit":20} interval_ms=500 duration_sec=30

4. Memory Growth

pprof_heap limit=20 -- top allocators (inuse + cumulative alloc columns)
runtime_stats -- heap_alloc, num_gc
Trend: sample_start tool=runtime_stats interval_ms=5000 duration_sec=300

5. Process Leak

node_info -- Spawned vs Terminated
process_list max_uptime=60 limit=50 -- recent spawns
Trend: sample_start tool=node_info interval_ms=10000

6. Restart Loop

process_list max_uptime=10 limit=50 -- very young processes
sample_listen log_levels=["error","panic"] duration_sec=60
process_inspect on supervisor -- restarts_count

7. Zombie Processes

process_list state=zombee
pprof_goroutines debug=2 filter="<behavior_name>" limit=5 -- find by behavior name

8. Network Issues

Always check both sides:

network_node_info node=A name=B   # A's view
network_node_info node=B name=A   # B's view

Symptom	Root Cause
Ping fails or RTT > 1s	Connection problem or overloaded node
MessagesOut >> MessagesIn (cross-check)	Silent data loss
High Reconnections	Unstable connection

9. Goroutine Investigation

pprof_goroutines debug=1 filter="ProcessRun" exclude="toolPprof" limit=20    # actors only
pprof_goroutines debug=2 filter="waitResponse" limit=10                       # stuck in Call
pprof_goroutines debug=1 exclude="runtime_pollWait" limit=30                  # non-IO goroutines

10. Event System

event_list utilization_state=no_subscribers -- wasted publishing
event_list sort_by=subscribers limit=10 -- highest fanout
sample_listen event=<name> duration_sec=30 -- see data

Build Tags

Tag	Enables	Without It
`-tags=pprof`	Per-process goroutine traces	pid param errors
`-tags=latency`	MailboxLatency measurement	Returns -1

Common Mistakes

Connecting before querying -- never use network_connect just to query, framework connects automatically
Unfiltered pprof on remote -- too large, will timeout. Always use filter/exclude
One-sided network check -- always compare both sides of connection
Snapshot vs trend -- use samplers for trends, not just single queries
Missing timeout -- use timeout=60 for heavy remote operations
pprof "not found" -- sleeping processes park goroutines, use sampler to poll
Dumping everything -- always use sort_by, limit, filter