Explore AI Agent Skills & Claude Prompts

Render the leadership value-prop ledger for the inference effort: the deployable wins vs the FlashInfer-TRTLLM + best-tuned-vLLM 0.21/0.22 baseline, grouped DONE / IN-PROGRESS / NOT-DONE / CLOSED-NEGATIVE, each row data-backed by a perf-lake campaign (live sol_rigor + verdict tier), plus the ranked GRIND FRONTIER of next levers (the always-be-grinding performance ratchet). Joins the curated perf-tune-report/configs/value-findings.yaml registry with the live campaigns via `perftunereport value_view`. Flags any finding whose backing campaign is missing or ungrounded. Use when you need to show value / report to leadership / answer "what have we uncovered that we can deploy, revalidate, or pursue". Triggers on "value ledger", "value prop", "show value", "show the value prop", "leadership view", "what wins do we have", "vs flashinfer", "value-findings", "what can we deploy", or any combination of "value / wins / findings / leadership / report" with "inference / vllm / flashinfer / perf".

inference-perf-bench

Canonical inference perf-bench skill (formal name. The colloquial alias is `ai-bench` - identical behaviour). Drives NVIDIA AIPerf + the replay-playback dataset against an in-cluster vLLM endpoint to measure TTFT, ITL, throughput, tok/s/user, request latency, and prefix cache hit rate. Iterative 9-phase workflow. Use when promoting a model to staging/prod, after vllmArgs / vLLM / KV-cache-dtype changes, or for A/B comparison across configs. Triggers on "perf-bench", "AIPerf", "Replay Playback", "throughput sweep", "TTFT P95", "concurrency sweep on inference", "/run-perf-bench", "benchmark throughput", "benchmark latency", "run aiperf", "performance test", or any combination of "perf / latency / throughput / tps / TTFT / ITL" with "inference / vllm / serverless / bench".

inference-aa-workload

Reproduce the Artificial Analysis (AA) language-model performance workload shapes against an OpenAI-compatible chat endpoint using NVIDIA AIPerf. Drives the three AA text shapes (1k input / >=1k answer, 10k / >=1.5k, 100k / >=2k) with temperature 0, top_p 1, and the vLLM-style min_tokens + ignore_eos "at least N answer tokens" guarantee. Two modes: synthetic (AIPerf generates the prompt at the token mean) and dataset-replay (a generated o200k_base-counted JSONL replayed identically). Ships a self-contained script and an `aa` perf_tune_report cell_run backend. Use when comparing a hosted inference endpoint to AA leaderboard numbers or reproducing AA's methodology. Triggers on "artificial analysis workload", "AA benchmark", "AA 1k/10k/100k shapes", "reproduce artificialanalysis.ai", "AA methodology", "aa-10k", "compare to AA leaderboard", or any combination of "artificial analysis / AA" with "workload / shape / benchmark / dataset".

zymtrace-anchored-query

Reusable wrapper for the knowledge-base-first SQL pattern, adapted to the zymtrace ClickHouse profiling backend. Operator names the metric / question / time range. Skill anchors the `zymtrace_profiling.events` schema first (DESCRIBE + label-value + cardinality probes), derives the safe narrow SQL, runs it via `kubectl port-forward` + `curl -X POST`, and saves the raw payload to a provenance-bearing bundle per the perf-lake-contract. Workload-agnostic. The ClickHouse cousin of `prometheus-anchored-query`. Triggers on "zymtrace anchored query", "clickhouse anchored query", "zymtrace query", "save zymtrace payload", "anchored zymtrace", "anchored clickhouse", or any combination of "zymtrace / clickhouse / profile" with "anchored / safe / provenance / query / saved-payload".

analyze-zymtrace-workload

Investigate a GPU or CPU workload through the zymtrace MCP. The MCP does most of the analysis. This skill enforces the cross-view discipline -- always pull the matching opposite-side flamegraph (CPU for GPU workloads, GPU for CPU workloads) with the same filter. Most bottlenecks hide on the side the customer didn't ask about. Triggers on "analyze my GPU workload", "where's the bottleneck in vllm", "investigate my training job", "find the hot kernel", "GPU isn't saturated", "investigate using flamegraph", "use zymtrace mcp to analyze", or any combination of "analyze / investigate / bottleneck / hot kernel" with "GPU / CPU / vllm / training / flamegraph / zymtrace".