name: production-log-inspection
description: >-
Inspect the Bike Index production Rails log at tmp/production.log —
JSON-per-request format produced by Lograge, far too large to read
end-to-end. Trigger when the user asks to
review, investigate, audit, or pull stats from a production log file
(slow requests, error spikes, status-code distribution, exception
stack traces, per-endpoint hit counts, suspicious traffic). Also
triggers when chasing a specific incident from logs (e.g. "what
happened at 04:42 UTC?", "why did /search/registrations 500?").
Honeybadger MCP is the right tool for aggregated exception triage
across time; this skill is for ad-hoc analysis of a specific log
file already on disk.
Inspecting production logs
The Bike Index production log lives at tmp/production.log. It's large, so:
- Never
Readorcatthe whole file. Usegrep/awk/head/tailto slice. - Synthesize by default; paste at most one short example when it's load-bearing. Lograge lines are long and mostly structural JSON — dumping multiple into a reply is unreadable and burns context. A single representative line for an exception or a slow request is fine; a wall of grep output is not.
Log line format
Each request produces one JSON-content line (Lograge), prefixed by a syslog-style header:
I, [2026-04-27T04:42:04.467657 #277641] INFO -- : [0591f694-…] {"method":"GET","path":"/search/registrations","format":"html","controller":"Search::RegistrationsController","action":"index","status":500,"allocations":1073409,"duration":51237.65,"view":0.0,"db":51234.09,"remote_ip":"71.212.12.114","u_id":148942,"params":{…},"@timestamp":"…","@version":"1","message":"…"}
Key fields:
| Field | Notes |
|---|---|
duration, view, db |
Milliseconds. A 60s query is 60000, not 60. |
status |
HTTP status code |
controller, action, path |
Routing info |
u_id |
User id (null for anonymous) |
remote_ip |
Forwarded client IP |
allocations |
Ruby object allocations — high allocations + long duration is a strong signal of a bad query plan |
params |
Object literal — may contain commas/colons; don't split lines on , |
When a request raises, you also get separate, non-JSON lines with the same request id prefix containing the exception class, message, and stack trace, followed by another JSON line for /500 (the ErrorsController#server_error render). That secondary /500 line is noise for most analyses — it inflates 500 counts unless you filter it out.
Common queries
Time range covered.
head -1 tmp/production.log
tail -1 tmp/production.log
Status-code distribution.
grep -oE '"status":[0-9]+' tmp/production.log | sort | uniq -c | sort -rn
Slow requests over a threshold (in ms). Use awk rather than a regex — durations are floats with arbitrary digit counts and a regex like "duration":[5-9][0-9]{4} will silently miss values:
awk -F'"duration":' '$2 != "" {split($2,a,","); if (a[1]+0 > 60000) print}' tmp/production.log
Pipe that into grep -oE '"path":"[^"]+"' | sort | uniq -c | sort -rn to group by path, or grep -oE '"status":[0-9]+' for status mix.
Distribution stats (p50/p90/p99).
awk -F'"duration":' '$2 != "" {split($2,a,","); print a[1]+0}' tmp/production.log \
| sort -n \
| awk 'BEGIN{c=0}{v[c++]=$1; s+=$1} END{print "n="c, "p50="v[int(c*.5)], "p90="v[int(c*.9)], "p99="v[int(c*.99)], "max="v[c-1], "mean="s/c}'
Most-hit endpoints.
grep -oE '"controller":"[^"]+","action":"[^"]+"' tmp/production.log | sort | uniq -c | sort -rn | head -20
5xx counts by endpoint. Filter to "status":5 first to keep the line set small:
grep '"status":5' tmp/production.log \
| grep -oE '"controller":"[^"]+","action":"[^"]+"' | sort | uniq -c | sort -rn
Finding exception stack traces
The JSON request line tells you a request 500'd but not why. Grab the request id from the JSON line, then grep -n the whole file for that id — Rails writes the exception class and backtrace as separate lines with the same [request-id] prefix:
grep -n "0591f694-49b4-4c9c-b49e-4fef89ae8d7b" tmp/production.log
Look for lines starting with E, (ERROR severity) and lines beginning [<id>] ActiveRecord::… / [<id>] ActionView::… / [<id>] Caused by: / [<id>] app/…:NN. The trailing app/… lines are the user-code frames (Rails strips gem frames by default).
To find clusters of the same exception, search for the exception class plus a line of context:
grep -B0 -A1 "PG::TRSerializationFailure" tmp/production.log | head -40
Pitfalls
- Each errored request creates ≥2 JSON lines — the original + the rendered
/500page. Counting"status":500over-counts unless you exclude"controller":"ErrorsController"or filter to one of them. grep | sort | uniqon JSON fragments is fine for counting, but don'tawk -F,orcut -d,on a whole line —params:{…}contains commas. Anchor splits to the field name (-F'"duration":').- Durations are ms, not seconds. A "slow" search is
> 60000, not> 60. - The replication-conflict cancel error (
PG::TRSerializationFailure: canceling statement due to conflict with recovery) means the replica killed the query because WAL recovery was blocked — it's a symptom of a slow query holding the replica too long, not a bug in the SQL itself. Look for the underlying duration to find the real cause. - Bots and scanners produce a lot of noise in 4xx and 5xx counts (path-traversal probes,
.well-known/*lookups, npm CDN-style 404s). Eyeball the path before treating an error spike as a real issue. - The Bash tool truncates long lines in piped output with
[... omitted end of long line], even when the pipeline ends in a file redirect. Fields at the end of a Lograge JSON line (@timestamp,@version,message) get clipped if you stage withgrep <line-pattern> | grep -oE '<trailing-field>'. Workarounds: (a)awkdirectly on the file — no pipe, full lines preserved; (b)grep -oE '<pattern>' file.logso the match itself (short) is what enters the pipeline, not the whole line. - One scanner IP can dominate counts. A single bot can rack up tens of thousands of 4xx/5xx and make a real user-facing issue look bigger than it is. Always check
"remote_ip"distribution before treating an error spike as a real signal — group by IP first, then re-run analyses excluding the dominant scanner.
When jq is and isn't worth it
The lines are JSON, so jq is tempting — but you have to strip the syslog prefix first, and on a full-day log it's noticeably slower than grep/awk. Use jq when you need to group by two or more JSON fields at once, or when params/payload structure matters; otherwise awk -F'"key":' patterns above are faster.
# strip prefix, then jq — only worth it for multi-field aggregations
sed -E 's/^[^{]+//' tmp/production.log \
| jq -r 'select(.status==500) | "\(.controller)#\(.action)\t\(.duration)"' \
| sort | uniq -c | sort -rn | head
Honeybadger vs. log files
For "what's currently broken in production" or "is exception X happening more this week", Honeybadger MCP (mcp__honeybadger__list_faults, get_fault, query_insights) is the right tool — it has aggregation, dedup, and time-series. Reach for this skill when:
- The user has already downloaded or pointed you at a specific log file, or
- The question is about requests that didn't raise (slow successful queries, traffic patterns, status-code mix), which Honeybadger doesn't capture.