name: test-macafm description: Run the maclocal-api (AFM/MLX) test suite — automated assertions and smart analysis. Use when asked to test, validate, regression-check, or benchmark AFM before release, after code changes, or for model onboarding.
test-macafm
Run the maclocal-api test suite: automated pass/fail assertions and smart analysis (the smart suite's AI judge is opt-in — default off; ask the user before enabling it).
Triggers
Use this skill when the user asks to:
- Test or validate the server (e.g., "run the tests", "test AFM", "validate the build")
- Regression check after code changes
- Onboard a new model (verify it works correctly with the server)
- Release check before tagging or pushing
- Benchmark or profile model performance
First Questions to Ask
- Model — Which model to test? (Ask if not specified. Default: whatever's loaded.)
- Tier — smoke / standard / full? (Suggest based on context.)
- Binary path — Default
.build/release/afm. Ask if user has a custom build location. - Port — Default 9998. Ask if user's server is on a different port.
- Server running? — Is the server already running, or should tests start it?
Tier Decision Tree
| Tier | Time | When to use | What runs |
|---|---|---|---|
| smoke | ~2 min | Quick sanity check, any small model, CI | test-assertions.sh --tier smoke |
| standard | ~15 min | After feature changes, mid-size model | test-assertions.sh --tier standard |
| full | ~60 min | Release validation, production model | test-assertions.sh --tier full + mlx-model-test.sh (smart suite; AI judge opt-in — ask the user) with test-llm-comprehensive.txt + promptfoo agentic evals |
Quick guide:
- "Just run a quick test" → smoke
- "Test before merging" → standard
- "Full release validation" or "onboard new model" → full
- User doesn't specify → suggest standard
Execution Workflow
1. Build Check
# Ensure release build is current
swift build -c release
2. Start Server (if not running)
MACAFM_MLX_MODEL_CACHE=/Volumes/edata/models/vesta-test-cache \
.build/release/afm mlx -m MODEL --port 9998 \
--tool-call-parser afm_adaptive_xml \
--enable-prefix-caching \
--enable-grammar-constraints &
# Wait for server to be ready
until curl -sf http://127.0.0.1:9998/v1/models >/dev/null 2>&1; do sleep 1; done
Recommended flags for testing:
--tool-call-parser afm_adaptive_xml— best tool call parser with JSON-in-XML fallback--enable-prefix-caching— 67-79% prompt token savings on repeated requests--enable-grammar-constraints— EBNF constrained decoding forces valid XML tool calls, improving success from 60% to 100% on realistic workloads
3. Run Automated Assertions
./Scripts/test-assertions.sh --tier TIER --model MODEL --port 9998
Interpret results immediately. If any FAIL, investigate before proceeding.
4. Run Smart Analysis (full tier only)
The smart analysis harness manages its own server (port 9877) — do NOT pass --port.
It uses test-llm-comprehensive.txt which has an [all] baseline prompt and [@ label]
template sections.
The AI judge is OPT-IN — default OFF. By default, run the smart suite WITHOUT an AI
judge: it executes every prompt and records the model's raw outputs to the report for
manual review, with no claude/codex scoring. Before running, ask the user (e.g.
via AskUserQuestion) whether to enable the AI judge — it adds latency/cost and invokes an
external CLI:
"Run the smart suite with an AI judge (claude) scoring each response, or without it (just record outputs for manual review)? Default: without."
Default — no AI judge (records outputs only; omit --smart entirely):
AFM_BIN=.build/release/afm ./Scripts/mlx-model-test.sh \
--model MODEL \
--prompts Scripts/test-llm-comprehensive.txt
Only if the user opts in — append --smart 1:claude (the --smart flag accepts a batch
mode prefix and tool list):
AFM_BIN=.build/release/afm ./Scripts/mlx-model-test.sh \
--model MODEL \
--prompts Scripts/test-llm-comprehensive.txt \
--smart 1:claude
Smart analysis options (only when the AI judge is enabled):
--smart claudeor--smart codex— batch mode 0 (one big swoop, may fail on large test suites)--smart 1:claudeor--smart 1:codex— batch mode 1 (test-by-test, more reliable)--smart 1:claude,codex— run multiple AI judges--tests 1,5,10— run only specific test numbers (1-indexed)
Note: The [all] prompt runs for every test variant. With high max_tokens (e.g., 32768
on code tests), thinking models may generate very long reasoning for the baseline prompt.
Total run time for full suite: ~45-90 min depending on model speed.
5. Run GPU Shader Profile (full tier, or when investigating perf)
Generates an interactive HTML report with measured DRAM bandwidth, GPU utilization/power timelines, and per-kernel Metal shader names from xctrace Shader Timeline.
One-time setup (creates custom Instruments template with Shader Timeline enabled):
python3 Scripts/create-shader-template.py
Run the profile (no server needed — uses single-prompt mode):
python3 Scripts/gpu-profile-report.py MODEL [max_tokens] [prompt]
# Default: 4096 tokens, built-in GPU analysis prompt
# Example: python3 Scripts/gpu-profile-report.py mlx-community/Qwen3.5-35B-A3B-4bit
This does everything automatically:
- Warms up mactop (bandwidth monitor, no sudo)
- Runs inference with
--gpu-profile --gpu-trace 15 - Collects 300ms bandwidth/GPU/power samples via PTY during inference
- Extracts shader kernel names from the xctrace trace
- Generates
/tmp/afm-gpu-profile.htmland opens in browser
Or use individual flags on any AFM invocation:
afm mlx -m MODEL --gpu-profile -s "prompt" # Zero-overhead stats
afm mlx -m MODEL --gpu-profile-bw -s "prompt" # + mactop bandwidth (~5s)
afm mlx -m MODEL --gpu-trace 10 -s "prompt" # xctrace shader trace
Live bandwidth monitor (run in separate terminal during server requests):
./Scripts/gpu-profile.sh bandwidth
What the report shows:
- Device info (chip, memory, architecture)
- Prefill/decode tok/s with exact timing
- Memory breakdown (model weights vs KV cache)
- DRAM bandwidth timeline chart (measured via mactop)
- GPU utilization & power timeline chart
- Per-kernel Metal shader names (from Shader Timeline)
- Exact command line for reproducibility
What to look for:
- GPU utilization <100% during decode → CPU-GPU pipeline bubbles
- Bandwidth utilization >80% → memory-bound, kernel optimization won't help
- Bandwidth utilization <20% with MoE model → normal (only active experts read)
- Key kernels:
affine_qmv_fast(decode bottleneck),steel_gemm_fused(prefill),sdpa_vector(attention)
5b. API-Based GPU Profiling (per-request, no CLI flags needed)
Clients can request GPU profiling data via the X-AFM-Profile HTTP header.
No server flags required — works on any running AFM server.
Two levels:
# Summary: GPU power, memory, bandwidth, tok/s
curl http://127.0.0.1:9999/v1/chat/completions \
-H "X-AFM-Profile: true" \
-d '{"model":"m","messages":[{"role":"user","content":"Hi"}]}'
# Extended: summary + 300ms time-series samples (for charts/dashboards)
curl http://127.0.0.1:9999/v1/chat/completions \
-H "X-AFM-Profile: extended" \
-d '{"model":"m","messages":[{"role":"user","content":"Hi"}]}'
Response fields (afm_profile):
gpu_power_avg_w/gpu_power_peak_w— GPU power via native IOReport (no mactop)memory_weights_gib/memory_kv_gib/memory_peak_gib— memory breakdown in GiBprefill_tok_s/decode_tok_s— throughputest_bandwidth_gbs— DRAM bandwidth from IOReport power (calibrated at startup via MLX GPU stress)chip/theoretical_bw_gbs— hardware contextgpu_samples— number of 300ms readings taken
Extended adds (afm_profile_extended):
summary— same asafm_profilesamples[]— per-300ms readings:{t, bw_gbs, gpu_pct, gpu_power_w, dram_power_w}
How it works internally:
- IOReport
Energy Model+GPU Statschannels sampled every 300ms via DispatchSource timer - DRAM bandwidth derived from DRAM power using chip-specific calibration constant
- Calibration runs once at startup: 1 GiB MLX GPU stress test (~2s, async, non-blocking)
- Per-request isolation: concurrent profiled requests are guarded (second request skips gracefully)
- Zero overhead when header not sent (one string lookup per request)
- Works for both streaming (SSE event before
[DONE]) and non-streaming
What to look for:
gpu_power_peak_w~28W during decode on M3 Ultra (matches mactop)est_bandwidth_gbs~170-180 GB/s for Qwen3.5-35B-A3B-4bit (21% of 800 GB/s theoretical)- Short requests (<300ms): at least 1 sample (timer first-fires at 100ms)
afm_profileabsent from response when header not sent (no null pollution)
6. Run Promptfoo Agentic Evals (full tier, or when validating tool calling / structured output)
The promptfoo agentic eval suite tests AFM's tool-calling and structured-output across multiple server configurations and real-world agent framework schemas. It manages its own server lifecycle.
Prerequisites: promptfoo CLI must be installed (npm install -g promptfoo).
Run the full suite:
AFM_MODEL=MODEL \
AFM_BINARY=.build/arm64-apple-macosx/release/afm \
MACAFM_MLX_MODEL_CACHE=/Volumes/edata/models/vesta-test-cache \
./Scripts/feature-promptfoo-agentic/run-promptfoo-agentic.sh all
Run individual suites:
# Just structured output tests
./Scripts/feature-promptfoo-agentic/run-promptfoo-agentic.sh structured
# Just tool calling (all 3 parser profiles)
./Scripts/feature-promptfoo-agentic/run-promptfoo-agentic.sh toolcall
# Just grammar constraint validation (8 server phases)
./Scripts/feature-promptfoo-agentic/run-promptfoo-agentic.sh grammar-constraints
# Just one agent framework
./Scripts/feature-promptfoo-agentic/run-promptfoo-agentic.sh opencode
Available modes: all, structured, structured-stress, toolcall, toolcall-quality, grammar-constraints, agentic, frameworks, opencode, pi, openclaw, hermes, default, adaptive-xml, adaptive-xml-grammar
Suite Coverage (~137 test cases across 16 configs)
| Suite | Tests | Profiles | What it validates |
|---|---|---|---|
| structured | 6 | 1 (api json_schema) | response_format=json_schema strict compliance |
| structured-stress | 4 | 1 | Nested arrays, enums, nullable types in schema |
| toolcall | 7 | 3 (default, adaptive-xml, grammar) | Basic tool call parsing: weather, time, multi-tool |
| toolcall-quality | 6 | 3 | BFCL-inspired when-to-call decisions (should model use a tool?) |
| grammar-constraints | 17 | 8 server phases | Schema + tool enforcement across: no-grammar, grammar-enabled, adaptive-xml, concurrent, prefix-cache, mixed-strict, header downgrade/enforce |
| agentic | 4 | 3 | Multi-turn coding workflow tool chains |
| frameworks | 8 | 3 | Agent framework tool shapes (OpenCode, Pi, OpenClaw, Hermes) |
| opencode | 37 | 3 | OpenCode built-in tools (primary-source derived) |
| pi | 20 | 3 | Pi coding-agent tools |
| openclaw | 12 | 3 | OpenClaw tool coverage |
| hermes | 12 | 3 | Hermes agentic framework tools |
Server Profiles (managed automatically by the script)
| Profile | AFM flags | Purpose |
|---|---|---|
default |
(none) | Baseline: auto-detected tool call format |
adaptive-xml |
--tool-call-parser afm_adaptive_xml |
Adaptive XML with JSON-in-XML fallback |
adaptive-xml-grammar |
--tool-call-parser afm_adaptive_xml --enable-grammar-constraints |
Adaptive XML + EBNF grammar enforcement |
grammar-enabled |
--enable-grammar-constraints |
Grammar without adaptive XML |
grammar-enabled-adaptive-xml |
Both flags | Regression guard: grammar + adaptive XML |
grammar-enabled-concurrent |
--enable-grammar-constraints --concurrent 2 |
Grammar under concurrency |
grammar-enabled-prefix-cache |
--enable-grammar-constraints --enable-prefix-caching |
Grammar + prefix caching interaction |
grammar-enabled-concurrent-cache |
All three flags | Full feature stack |
Custom Provider & Judges
providers/afm_provider.mjs— Custom promptfoo provider with two transports:api(OpenAI-compatible HTTP) andcli-guided-json(direct binary invocation). Supports extract modes:content,tool_calls,normalized_message,full_response. CapturesresponseHeadersfor grammar header assertions.judges/assert-grammar-header.mjs— ValidatesX-Grammar-Constraintsresponse header: expects"downgraded"when grammar not available, absent when grammar active.judges/classify-failures.mjs— Post-run AI-based failure classifier: categorizes each failure asafm_bug(server/protocol),model_quality(wrong tool/args), orharness_bug(false negative).
Environment Variables
| Variable | Default | Purpose |
|---|---|---|
AFM_MODEL |
mlx-community/Qwen3.5-35B-A3B-4bit |
Model to test |
AFM_BINARY |
.build/arm64-apple-macosx/release/afm |
Binary path |
AFM_PROMPTFOO_OUT_DIR |
/Volumes/edata/promptfoo/data/maclocal-api/current |
Report output dir |
AFM_PROMPTFOO_PORT |
9999 |
Server port |
MACAFM_MLX_MODEL_CACHE |
(none) | Model cache dir |
Output
JSON reports per suite+profile in $AFM_PROMPTFOO_OUT_DIR:
structured-MODEL_SLUG.jsontoolcall-{default,adaptive-xml,adaptive-xml-grammar}-MODEL_SLUG.jsongrammar-{schema,tools}-{no-grammar,grammar-enabled,adaptive-xml,concurrent,prefix-cache}-MODEL_SLUG.json{agentic,frameworks,opencode,pi,openclaw,hermes}-{default,adaptive-xml,adaptive-xml-grammar}-MODEL_SLUG.json
7. Review Reports
- Assertion report:
test-reports/assertions-report-*.html - Smart analysis:
test-reports/smart-analysis-{tool}-*.md - HTML report:
test-reports/mlx-model-report-*.html - GPU profile:
/tmp/afm-gpu-profile.html(+/tmp/afm-metal.tracefor Instruments) - JSONL data:
test-reports/assertions-report-*.jsonl,test-reports/mlx-model-report-*.jsonl - Promptfoo evals:
$AFM_PROMPTFOO_OUT_DIR/{suite}-{profile}-MODEL_SLUG.json(default:/Volumes/edata/promptfoo/data/maclocal-api/current/)
7. Stop Server (if we started it)
kill %1 # or whatever the background job is
Interpreting Results
Assertion Test Failures
| Group | Common failures | What to check |
|---|---|---|
| Stop | Stop string found in output | Check MLXModelService.swift stop buffer logic, streaming vs non-streaming paths |
| Logprobs | Schema invalid, logprob > 0 | Check resolveLogprobs() and buildChoiceLogprobs() |
| Think | <think> tags in content |
Check extractThinkContent() and extractThinkTags() |
| Tools | No tool_calls, invalid JSON args | Check extractToolCallsFallback(), model's tool call format |
| Cache | cached_tokens always 0 | Check enablePrefixCaching, findPrefixLength(), PromptCacheBox |
| Concurrent | Non-200 responses | Check SerialAccessContainer locking, request queuing |
| Error | Wrong HTTP status codes | Check controller validation logic |
| Kwargs | Thinking not disabled by enable_thinking: false |
Check chat_template_kwargs merging into additionalContext in MLXModelService.swift |
| Perf | Low tok/s, high TTFT | Check model quantization, Metal kernel performance |
| OpenAI-compat | Stream usage chunk missing, logprobs absent | Check StreamingUsageChunk encoding, empty choices on final chunk |
| Guided JSON | Schema validation failure, invalid JSON | Check --guided-json / response_format pipeline, grammar constraints |
| Batch | Garbage output, wrong answers at B>1 | Check BatchScheduler, KV cache isolation, mask generation |
Smart Analysis False Positives
Known patterns where AI judges score incorrectly (see references/interpreting-scores.md):
- Stop sequences truncating output scored as "low quality" — truncation IS the expected behavior
- Empty content when stop fires on first visible token — correct behavior
- JSON mode not constraining thinking models — prompt injection, not grammar-constrained
- "Missing reasoning" when model doesn't support
<think>— correct, not a bug - Thinking model consuming entire
max_tokensbudget on reasoning with empty visible content — model behavior, not a server bug [all]baseline prompt scored low when it runs with a code/math test's highmax_tokensand system prompt — irrelevant context for the baseline prompt
Promptfoo Eval Failures
| Category | Typical pass rate | What failures mean |
|---|---|---|
| structured, structured-stress | 100% | Server bug in response_format pipeline — investigate immediately |
| toolcall (all profiles) | 100% | Server bug in tool call parsing — investigate immediately |
| toolcall-quality | ~80% | Model chose wrong tool or missed when-to-call — model quality, not server |
| grammar-schema / grammar-tools (non-concurrent) | 100% | Grammar constraint enforcement broken — server bug |
| grammar-schema / grammar-tools (concurrent) | ~50-70% | Known race condition in --concurrent 2 grammar path — not release blocker |
| grammar-header / grammar-mixed | 100% | X-Grammar-Constraints header or mixed-strict wiring broken — server bug |
| agentic | ~75-100% | Multi-turn failures are usually model quality; 0% pass = server bug |
| frameworks | 100% | Framework tool shapes must parse correctly — server bug if failing |
| opencode | ~70-80% | Complex 37-tool scenarios; model can't always pick correct tool — model quality |
| pi | ~80-90% | Model prompt injection resistance varies — model quality |
| openclaw | ~80-85% | Model quality on OpenClaw-specific schemas |
| hermes | ~90-100% | Hermes format failures on adaptive-xml profiles = parser difference, not bug |
Key rule: structured, toolcall, grammar-* (non-concurrent), frameworks suites should be 100% pass. Any failure there is a server bug. Everything else has model-quality variance.
Post-run failure classification (optional): Run judges/classify-failures.mjs on any result JSON to get AI-based afm_bug vs model_quality vs harness_bug classification.
When to Escalate
- SDPA regression: NaN or garbage in long-context tests → check MLX version, see MEMORY.md
- Tool call format mismatch: Unknown format → check
ToolCallFormat.infer()and model's config.json - Build failure: Vendor patch conflict → run
Scripts/apply-mlx-patches.sh --check
Concurrency Benchmark
Full-harness concurrency sweep that starts the server, runs warmup, tests all concurrency levels, collects GPU metrics via mactop, saves JSON results, and generates a comparison chart.
Script
Scripts/benchmarks/benchmark_afm_vs_mlxlm.py
Usage
# AFM-only concurrency sweep (recommended for quick benchmarks)
python3 Scripts/benchmarks/benchmark_afm_vs_mlxlm.py --afm-only
# Full AFM vs mlx-lm comparison (both servers, fair A/B)
python3 Scripts/benchmarks/benchmark_afm_vs_mlxlm.py
# Re-generate graph from existing results
python3 Scripts/benchmarks/benchmark_afm_vs_mlxlm.py --graph
python3 Scripts/benchmarks/benchmark_afm_vs_mlxlm.py --graph Scripts/benchmark-results/FILE.json
What it does
- Detects hardware (chip, memory)
- Starts server(s) with
--concurrent N - 60s GPU settle + multi-round warmup (JIT kernel compilation)
- Sweeps concurrency levels:
[1, 2, 4, 8, 12, 16, 20, 24, 32, 40, 50] - At each level: fires N simultaneous streaming 4096-token requests, measures aggregate tok/s, per-request tok/s, GPU power/temp/usage via mactop
- Saves JSON to
Scripts/benchmark-results/concurrency-benchmark-TIMESTAMP.json - Generates PNG chart to
Scripts/benchmark-results/concurrency-benchmark-TIMESTAMP.png
Configuration (top of script)
| Variable | Default | Purpose |
|---|---|---|
MODEL_ID |
mlx-community/Qwen3.5-35B-A3B-4bit |
Model to benchmark |
MAX_TOKENS |
4096 | Tokens per request (forces long decode) |
MAX_CONCURRENT |
50 | --concurrent flag value (must be >= max level) |
LEVELS |
[1,2,4,8,12,16,20,24,32,40,50] |
Concurrency levels to test |
AFM_PORT |
9999 | Port for AFM server |
Reference results (March 18, v0.9.7, M3 Ultra 512GB, --concurrent 28)
B Agg t/s Per-req Wall GPU% GPU W
1 118.7 118.7 34.5s 94% 28.5W
2 193.9 97.0 42.2s 93% 41.6W
4 298.4 74.6 54.9s 97% 62.7W
8 407.3 50.9 80.5s 96% 75.5W
12 493.4 41.1 99.6s 98% 83.4W
16 573.9 35.9 114.2s 99% 88.2W
20 581.6 29.1 140.8s 98% 79.1W
24 629.6 27.4 149.6s 99% 83.2W
Additional batch validation scripts
| Script | Purpose |
|---|---|
Scripts/feature-mlx-concurrent-batch/batch_stress_mactop.py |
Quick stress test at arbitrary concurrency (client-only, needs running server on port 9876) |
Scripts/feature-mlx-concurrent-batch/batch_stress_ioreg.py |
Same but uses ioreg for GPU stats (less accurate) |
Scripts/feature-mlx-concurrent-batch/validate_responses.py |
Known-answer correctness at B={1,2,4,8} |
Scripts/feature-mlx-concurrent-batch/validate_mixed_workload.py |
Mixed short+long workload batch validation |
Scripts/feature-mlx-concurrent-batch/validate_multiturn_prefix.py |
Multi-turn prefix cache under concurrency |
Key File Reference
| File | Purpose |
|---|---|
Scripts/benchmarks/benchmark_afm_vs_mlxlm.py |
Full concurrency benchmark harness (server lifecycle, warmup, sweep, GPU metrics, chart generation) |
Scripts/test-assertions.sh |
Automated pass/fail assertion tests (unit/smoke/standard/full tiers, includes swift test) |
Scripts/test-llm-comprehensive.txt |
Comprehensive smart analysis test suite (model-generic, [@ label] template mode, has [all] baseline) |
Scripts/test-Qwen3.5-35B-A3B-4bit.txt |
Model-specific test suite for Qwen3.5-35B-A3B-4bit (same tests as comprehensive, hardcoded model) |
Scripts/test-edge-cases.txt |
Legacy smart analysis test prompts (smaller set) |
Scripts/test-sampling-params.sh |
Sampling parameter tests (seed, temp, top_p, etc.) |
Scripts/test-structured-outputs.sh |
JSON schema / structured output tests |
Scripts/test-tool-call-parsers.py |
Unit tests for tool call parsing |
Scripts/mlx-model-test.sh |
Test harness: runs prompts, collects results, generates reports |
Scripts/test-chat-template-kwargs.sh |
Standalone chat_template_kwargs tests (includes --no-think CLI + precedence) |
Scripts/regression-test.sh |
Quick regression smoke test |
Scripts/feature-codex-optimize-api/test-openai-compat-evals.py |
OpenAI-python SDK compatibility evals (non-stream, stream, logprobs, vllm bench) |
Scripts/feature-codex-optimize-api/test-guided-json-evals.py |
Guided JSON / structured output evals (API, streaming, CLI, SDK parse, edge cases) |
Scripts/feature-mlx-concurrent-batch/validate_responses.py |
Batched generation correctness: known-answer questions at B={1,2,4,8} |
Scripts/feature-mlx-concurrent-batch/validate_mixed_workload.py |
Mixed short+long workload batch validation with GPU metrics |
Scripts/feature-mlx-concurrent-batch/validate_multiturn_prefix.py |
Multi-turn prefix cache validation under concurrency |
Scripts/gpu-profile-report.py |
Full GPU shader profiling harness: mactop BW + --gpu-profile + --gpu-trace + HTML report |
Scripts/gpu-profile.sh |
GPU profiling helpers: bandwidth monitor, capture, trace, power |
Scripts/create-shader-template.py |
One-time: patches Metal System Trace template for per-kernel shader names |
Tests/MacLocalAPITests/StreamingUsageChunkTests.swift |
Unit tests: streaming usage chunks, finish reasons, Foundation commonPrefixLength |
Tests/MacLocalAPITests/ConcurrentBatchTests.swift |
Unit tests: RequestSlot, StreamChunk, BatchScheduler internals |
Scripts/feature-promptfoo-agentic/run-promptfoo-agentic.sh |
Promptfoo agentic eval orchestrator: 11 modes, 8 server profiles, 16 configs |
Scripts/feature-promptfoo-agentic/providers/afm_provider.mjs |
Custom promptfoo provider: api + cli-guided-json transports, 4 extract modes |
Scripts/feature-promptfoo-agentic/judges/assert-grammar-header.mjs |
Custom assertion: validates X-Grammar-Constraints response header |
Scripts/feature-promptfoo-agentic/judges/classify-failures.mjs |
AI-based failure classifier: afm_bug vs model_quality vs harness_bug |
Scripts/feature-promptfoo-agentic/promptfooconfig.*.yaml |
16 promptfoo config files (~137 test cases total) |
Scripts/feature-promptfoo-agentic/datasets/ |
16 YAML dataset files across structured, toolcall, grammar, agentic directories |
Validation Checklist
Smoke Tier
- Server reachable, model loaded
- Basic completion returns content
- Stop sequences work (absent from output, correct finish_reason)
- Logprobs schema valid
- Think extraction works (if model supports it)
- Basic tool call works
- Error handling (empty messages, malformed JSON)
Standard Tier (adds)
- All smoke checks
- Streaming stop sequence parity
- Streaming logprobs
- Prompt cache: cached_tokens=0 first, >0 second
- Concurrent requests (2 and 3 simultaneous)
- Multi-tool calls
- Additional stop edge cases
- chat_template_kwargs:
enable_thinking=falsedisables thinking (if model supports it) - chat_template_kwargs: streaming parity
- chat_template_kwargs: default behavior unaffected
- OpenAI-compat evals:
test-openai-compat-evals.py(non-stream, stream, logprobs, usage chunk) - Guided JSON evals:
test-guided-json-evals.py(API schema, streaming schema, SDK parse)
Full Tier (adds)
- All standard checks
- Performance: TTFT < 5s, tok/s > 1
- Long context (2K, 4K tokens) no crash/NaN
- Smart analysis: test-llm-comprehensive.txt (AI judge opt-in — default off, ask the user; enable with
--smart 1:claude/--smart 1:codex) - Streaming parity (assembled content matches non-streaming)
- Cache timing improvement visible
- Batch correctness:
validate_responses.pyat B={1,2,4,8} - Batch mixed workload:
validate_mixed_workload.py(short+long decode, GPU metrics) - Batch prefix cache:
validate_multiturn_prefix.py(multi-turn conversations under concurrency) - GPU shader profile:
gpu-profile-report.py(bandwidth, power, kernel names, HTML report) - API profile:
X-AFM-Profile: truereturnsafm_profilewith GPU power + bandwidth - API profile:
X-AFM-Profile: extendedreturnsafm_profile_extendedwith samples array - API profile: no header → no
afm_profilefields in response (no null pollution) - API profile: streaming → profile SSE event before
[DONE] - API profile: concurrent profiled requests → second skips gracefully
- Promptfoo structured: 100% pass (json_schema + stress)
- Promptfoo toolcall: 100% pass (all 3 profiles)
- Promptfoo grammar-constraints (non-concurrent): 100% pass
- Promptfoo frameworks: 100% pass (all 3 profiles)
- Promptfoo opencode/pi/openclaw/hermes: >70% pass (model quality variance expected)
- Promptfoo grammar-header: downgrade/enforce headers correct