name: coverage-kaizen description: "Systematic coverage gap analysis and test writing for the APR Model QA Playbook. Uses pmat query --coverage-gaps for highest-ROI targets, provides test patterns for each crate (proptest for gen, Evidence assertions for runner, MQS scoring for report), handles CommandRunner trait extensions, clippy pedantic/nursery landmines, and ExecutionConfig construction. Target: 95% library coverage." disable-model-invocation: false user-invocable: true allowed-tools: "Read, Grep, Glob, Bash" argument-hint: "target: crate name (apr-qa-gen, apr-qa-runner, apr-qa-report, apr-qa-certify), function name, or coverage goal (e.g., 96%)"
Coverage Kaizen
Continuous improvement workflow for maintaining >= 95% library test coverage across the APR Model QA Playbook workspace.
Quick Start
Find Coverage Gaps
# Top coverage gaps ranked by ROI (MANDATORY: always use pmat query)
pmat query --coverage-gaps --rank-by impact --limit 20 --exclude-tests
# Coverage gaps for a specific crate
pmat query --coverage-gaps --limit 30 --exclude-file "tests" | grep "apr-qa-runner"
# Current coverage percentage
cargo llvm-cov --workspace --lib 2>&1 | grep "^TOTAL"
Verify Compliance
# PMAT compliance check (>= 95%)
make coverage-check
# Or manually
./scripts/coverage-check.sh
# Full HTML coverage report
make coverage
# Opens: target/llvm-cov/html/index.html
Coverage Commands
| Command | What It Does |
|---|---|
make coverage |
HTML report (library code only) |
make coverage-summary |
Terminal summary |
make coverage-check |
Verify >= 95% threshold |
cargo llvm-cov --workspace --lib |
Raw coverage data |
cargo llvm-cov --workspace --lib --html |
HTML with source annotation |
Never use cargo tarpaulin. It's slow, unreliable, and causes hangs.
Kaizen Workflow
Step 1: Identify Targets
pmat query --coverage-gaps --rank-by impact --limit 20 --exclude-tests
Pick the function with the highest impact_score first (best ROI per test written).
Step 2: Read the Function
# Use pmat query to read source (NOT cat/Read)
pmat query "function_name" --include-source --limit 1
Step 3: Write Tests
Follow the crate-specific patterns below. Key rules:
- Tests follow Popperian falsification (design to fail, not to pass)
- Use
Evidence::corroborated()(4 args) andEvidence::falsified()(5 args) - Use
..Default::default()forExecutionConfigin tests - Allow
clippy::unwrap_usedandclippy::expect_usedin test code (alreadycfg_attr-allowed)
Step 4: Verify Improvement
# Re-check coverage
pmat query --coverage-gaps --rank-by impact --limit 20 --exclude-tests
# Verify threshold
make coverage-check
Step 5: Run Full Gate
make check # fmt-check + lint + test + docs-check
Crate-Specific Test Patterns
apr-qa-gen (Scenario Generation)
Key types: QaScenario, Oracle, OracleResult, ModelId, Modality, Backend, Format
Pattern: Proptest strategies
use proptest::prelude::*;
use crate::proptest_impl::*;
proptest! {
#[test]
fn scenario_always_has_valid_id(scenario in scenario_strategy()) {
prop_assert!(!scenario.id.is_empty());
prop_assert!(scenario.id.contains('_'));
}
}
Pattern: Oracle evaluation
#[test]
fn arithmetic_oracle_correct_addition() {
let oracle = ArithmeticOracle::new();
let result = oracle.evaluate("What is 3+4?", "The answer is 7.");
assert!(matches!(result, OracleResult::Corroborated { .. }));
}
#[test]
fn garbage_oracle_detects_repetition() {
let oracle = GarbageOracle::new();
let result = oracle.evaluate("test", "abcabcabcabcabcabcabcabcabc");
assert!(matches!(result, OracleResult::Falsified { .. }));
}
Pattern: Oracle selection
#[test]
fn selects_arithmetic_for_math_prompt() {
let oracle = select_oracle("What is 5+3?");
assert_eq!(oracle.name(), "arithmetic");
}
Available proptest strategies:
model_id_strategy()- Random model IDs from supported familiesmodality_strategy()- Run/Chat/Servebackend_strategy()- Cpu/Gpuformat_strategy()- Gguf/SafeTensors/Aprarithmetic_prompt_strategy()- Verifiable math promptscode_prompt_strategy()- Code completion promptsedge_case_prompt_strategy()- Empty, unicode, XSS, SQL injectionany_prompt_strategy()- Weighted combinationscenario_strategy()- Complete random scenariostemperature_strategy()- 0.0, 0.7, 1.0, or randommax_tokens_strategy()- 1, 32, 128, 512, 2048, or random
apr-qa-runner (Execution Engine)
Key types: Evidence, Outcome, EvidenceCollector, Executor, ExecutionConfig, CommandRunner
Pattern: Evidence construction
use crate::evidence::{Evidence, Outcome};
use apr_qa_gen::scenario::QaScenario;
fn test_scenario() -> QaScenario {
QaScenario::new(
ModelId::new("test/model"),
Modality::Run,
Backend::Cpu,
Format::Gguf,
"What is 2+2?".to_string(),
42,
)
}
#[test]
fn corroborated_evidence_is_pass() {
let e = Evidence::corroborated("F-QUAL-001", test_scenario(), "output", 100);
assert!(e.outcome.is_pass());
assert_eq!(e.reason, "Test passed");
assert_eq!(e.exit_code, Some(0));
}
#[test]
fn falsified_evidence_is_fail() {
let e = Evidence::falsified("F-QUAL-001", test_scenario(), "bad output", "", 100);
assert!(e.outcome.is_fail());
}
Pattern: EvidenceCollector
#[test]
fn collector_counts_outcomes() {
let mut collector = EvidenceCollector::new();
collector.add(Evidence::corroborated("F-001", test_scenario(), "", 0));
collector.add(Evidence::falsified("F-002", test_scenario(), "fail", "", 0));
assert_eq!(collector.pass_count(), 1);
assert_eq!(collector.fail_count(), 1);
assert_eq!(collector.total(), 2);
}
Pattern: ExecutionConfig construction in tests
#[test]
fn executor_respects_timeout() {
let config = ExecutionConfig {
default_timeout_ms: 5000,
dry_run: true,
..Default::default()
};
let mut executor = Executor::with_config(config);
// ...
}
Pattern: Custom CommandRunner for testing
When you need to test executor behavior with controlled subprocess responses:
struct MyTestRunner;
impl CommandRunner for MyTestRunner {
fn run_inference(&self, model_path: &Path, prompt: &str,
max_tokens: u32, no_gpu: bool, extra_args: &[&str]) -> CommandOutput {
CommandOutput::success("The answer is 4.\nCompleted in 100ms")
}
fn convert_model(&self, _source: &Path, _target: &Path) -> CommandOutput {
CommandOutput::success("")
}
// MUST implement ALL 28 methods - see CommandRunner Trait section below
// Most can stub with CommandOutput::success("")
fn inspect_model(&self, _: &Path) -> CommandOutput { CommandOutput::success("") }
fn validate_model(&self, _: &Path) -> CommandOutput { CommandOutput::success("") }
// ... (all 28 methods)
}
#[test]
fn test_with_custom_runner() {
let config = ExecutionConfig::default();
let runner = Arc::new(MyTestRunner);
let mut executor = Executor::with_runner(config, runner);
// ...
}
apr-qa-report (Scoring & Reports)
Key types: MqsScore, MqsCalculator, GatewayResult, CategoryScores
Pattern: MQS scoring
use crate::mqs::MqsCalculator;
use apr_qa_runner::evidence::{Evidence, EvidenceCollector};
#[test]
fn perfect_score_all_corroborated() {
let mut collector = EvidenceCollector::new();
collector.add(Evidence::corroborated("F-QUAL-001", scenario(), "ok", 100));
collector.add(Evidence::corroborated("F-PERF-001", scenario(), "ok", 100));
let score = MqsCalculator::calculate("test/model", collector.all());
assert!(score.gateways_passed);
assert!(score.raw_score > 0);
}
#[test]
fn gateway_failure_zeroes_score() {
let mut collector = EvidenceCollector::new();
collector.add(Evidence::crashed("G1-LOAD-001", scenario(), "segfault", 139, 100));
let score = MqsCalculator::calculate("test/model", collector.all());
assert!(!score.gateways_passed);
assert_eq!(score.raw_score, 0);
}
Pattern: JUnit report generation
#[test]
fn junit_report_valid_xml() {
let collector = build_test_collector();
let xml = junit::generate_report("test/model", collector.all());
assert!(xml.starts_with("<?xml"));
assert!(xml.contains("<testsuites"));
}
Pattern: Grade assertions
Be careful with float comparison - clippy float_cmp is strict. Use ranges:
// WRONG (clippy::float_cmp)
assert_eq!(score.normalized_score, 95.0);
// CORRECT
assert!(score.normalized_score >= 90.0);
assert!(score.normalized_score <= 100.0);
apr-qa-certify (Certification Tracking)
Key types: ModelCertification, CertificationStatus, SizeCategory
Pattern: CSV parsing
#[test]
fn parse_csv_round_trip() {
let models = vec![ModelCertification { /* ... */ }];
let csv = write_csv(&models);
let parsed = parse_csv(&csv).unwrap();
assert_eq!(parsed.len(), models.len());
assert_eq!(parsed[0].model_id, models[0].model_id);
}
Pattern: README table generation
#[test]
fn generated_table_has_headers() {
let models = vec![sample_model()];
let table = generate_table(&models);
assert!(table.contains("| Model |"));
assert!(table.contains("| Status |"));
}
CommandRunner Trait (28 Methods)
When implementing a custom CommandRunner for tests, you MUST implement all 28 methods. There are currently 4 custom implementations in executor.rs tests that serve as reference.
Complete method list:
| # | Method | Signature |
|---|---|---|
| 1 | run_inference |
(&self, model: &Path, prompt: &str, max_tokens: u32, no_gpu: bool, extra_args: &[&str]) -> CommandOutput |
| 2 | convert_model |
(&self, source: &Path, target: &Path) -> CommandOutput |
| 3 | inspect_model |
(&self, model: &Path) -> CommandOutput |
| 4 | validate_model |
(&self, model: &Path) -> CommandOutput |
| 5 | bench_model |
(&self, model: &Path) -> CommandOutput |
| 6 | check_model |
(&self, model: &Path) -> CommandOutput |
| 7 | profile_model |
(&self, model: &Path, warmup: u32, measure: u32) -> CommandOutput |
| 8 | profile_ci |
(&self, model: &Path, min_throughput: Option<f64>, max_p99: Option<f64>, warmup: u32, measure: u32) -> CommandOutput |
| 9 | diff_tensors |
(&self, model_a: &Path, model_b: &Path, json: bool) -> CommandOutput |
| 10 | compare_inference |
(&self, model_a: &Path, model_b: &Path, prompt: &str, max_tokens: u32, tolerance: f64) -> CommandOutput |
| 11 | profile_with_flamegraph |
(&self, model: &Path, output: &Path, no_gpu: bool) -> CommandOutput |
| 12 | profile_with_focus |
(&self, model: &Path, focus: &str, no_gpu: bool) -> CommandOutput |
| 13 | validate_model_strict |
(&self, model: &Path) -> CommandOutput |
| 14 | fingerprint_model |
(&self, model: &Path, json: bool) -> CommandOutput |
| 15 | validate_stats |
(&self, fp_a: &Path, fp_b: &Path) -> CommandOutput |
| 16 | pull_model |
(&self, hf_repo: &str) -> CommandOutput |
| 17 | inspect_model_json |
(&self, model: &Path) -> CommandOutput |
| 18 | run_ollama_inference |
(&self, model_tag: &str, prompt: &str, temperature: f64) -> CommandOutput |
| 19 | pull_ollama_model |
(&self, model_tag: &str) -> CommandOutput |
| 20 | create_ollama_model |
(&self, model_tag: &str, modelfile: &Path) -> CommandOutput |
| 21 | serve_model |
(&self, model: &Path, port: u16) -> CommandOutput |
| 22 | http_get |
(&self, url: &str) -> CommandOutput |
| 23 | profile_memory |
(&self, model: &Path) -> CommandOutput |
| 24 | run_chat |
(&self, model: &Path, prompt: &str, no_gpu: bool, extra_args: &[&str]) -> CommandOutput |
| 25 | http_post |
(&self, url: &str, body: &str) -> CommandOutput |
| 26 | spawn_serve |
(&self, model: &Path, port: u16, no_gpu: bool) -> CommandOutput |
Stub template for most methods:
fn method_name(&self, /* args */) -> CommandOutput {
CommandOutput::success("")
}
When adding a new method to the trait: You must update ALL 4 custom implementations in executor.rs tests plus the MockCommandRunner in command.rs.
Clippy Landmine Reference
The workspace uses clippy::pedantic + clippy::nursery + strict custom rules. These are the lints that most commonly trip up new test code.
Workspace-Level Denials
| Lint | Level | Impact |
|---|---|---|
unsafe_code |
deny | No unsafe anywhere, #![forbid(unsafe_code)] in lib.rs files |
unwrap_used |
deny | No .unwrap() in library code (allowed in tests via cfg_attr) |
panic |
deny | No panic!() in library code |
expect_used |
warn | Prefer map_err / ? over .expect() |
Common Pedantic/Nursery Traps
| Lint | What Triggers It | Fix |
|---|---|---|
float_cmp |
assert_eq!(f64, f64) |
Use range: assert!(x >= 0.9 && x <= 1.0) |
option_if_let_else |
if let Some(x) = opt { a } else { b } |
Use opt.map_or(b, |x| a) |
manual_let_else |
let x = match opt { Some(v) => v, None => return } |
Use let Some(x) = opt else { return }; |
doc_link_with_quotes |
/// See ["quoted"] in doc comments |
Wrap in backticks: `"quoted"` |
or_fun_call |
.unwrap_or(String::new()) |
Use .unwrap_or_default() |
cast_precision_loss |
x as f64 when x is u64 |
Already #![allow]'d in most crates |
cast_possible_truncation |
x as u32 when x is u64 |
Already #![allow]'d in runner |
missing_const_for_fn |
Pure function without const |
Already #![allow]'d in all crates |
struct_excessive_bools |
Struct with many bool fields | Already #![allow]'d on ExecutionConfig |
too_many_lines |
Function > 100 lines | Add #[allow(clippy::too_many_lines)] |
too_many_arguments |
Function with > 7 args | Add #[allow(clippy::too_many_arguments)] |
needless_pass_by_value |
fn f(s: String) when &str works |
Already #![allow]'d in most crates |
doc_markdown |
Unlinked type names in docs (HuggingFace) |
Already #![allow]'d in most crates |
suboptimal_flops |
a * b + c instead of a.mul_add(b, c) |
Already #![allow]'d in report |
Test-Specific Allowances
These are already allowed in #[cfg(test)] blocks via cfg_attr:
clippy::unwrap_used- OK to unwrap in testsclippy::expect_used- OK to expect in testsclippy::redundant_closure_for_method_callsclippy::redundant_cloneclippy::float_cmp(only in apr-qa-report)clippy::uninlined_format_args(only in apr-qa-runner)clippy::cast_sign_loss(only in apr-qa-runner)
Per-Crate Allow Lists
Each crate has specific #![allow(...)] in its lib.rs. Check the relevant lib.rs before writing tests to know which lints are pre-allowed.
Most restrictive: apr-qa-certify (almost no allows)
Most lenient: apr-qa-report (20+ allows for scoring math)
Evidence Constructor Cheat Sheet
corroborated(gate_id, scenario, output, duration_ms) → 4 args, reason="Test passed"
falsified(gate_id, scenario, reason, output, duration_ms) → 5 args
timeout(gate_id, scenario, timeout_ms) → 3 args
crashed(gate_id, scenario, stderr, exit_code, duration_ms)→ 5 args
skipped(gate_id, scenario, reason) → 3 args
Key facts:
Evidence.outputisString, NOTOption<String>Evidence.stderrisOption<String>(onlySomeforCrashed)Evidence.exit_codeisOption<i32>(Some(0) for corroborated, None for falsified)- Constructor uses
impl Into<String>so both&strandStringwork
Gate ID Conventions
Gate IDs map to MQS categories via prefix:
| Prefix | Category | Max Points |
|---|---|---|
F-QUAL-* |
Quality | 200 |
F-PERF-* |
Performance | 150 |
F-STAB-* |
Stability | 200 |
F-COMP-* |
Compatibility | 150 |
F-EDGE-* |
Edge Cases | 150 |
F-REGR-* |
Regression | 150 |
F-CONV-* |
Compatibility (conversion) | 150 |
F-CONV-RT* |
Regression (round-trip) | 150 |
F-CONTRACT-* |
Compatibility (contract) | 150 |
G0-* |
Stability (integrity) | 200 |
G1-* through G4-* |
Gateway (zeroes all) | - |
When writing tests, use F-{CATEGORY}-{NNN} format for gate IDs to ensure correct MQS category scoring.
Common Test Count Gotcha
When adding new test phases to the executor's execute() method (like contract tests, parity tests):
- Existing tests that assert
total_scenarioscounts will break because the executor now runs more tests - Fix: Update the expected counts in affected tests
- Prevention: Search for
total_scenariosassertions before adding phases:pmat query --literal "total_scenarios" --exclude-tests --limit 10
ExecutionConfig Construction
In library code (apr-qa-cli/src/lib.rs): Constructed explicitly field-by-field, NO ..Default::default(). When adding a new field, you must update both construction sites in lib.rs.
In test code: Use ..Default::default():
let config = ExecutionConfig {
dry_run: true,
default_timeout_ms: 5000,
..Default::default()
};
Current field count: 21 fields. Check executor.rs line ~81 for the latest.
See Also
References
- test-patterns.md - Extended test pattern cookbook
- clippy-landmines.md - Full clippy configuration reference
Commands
make test # Run all tests
make lint # Clippy with zero warnings
make check # Full gate: fmt + lint + test + docs
make coverage # HTML coverage report
make coverage-check # Verify >= 95%
Key Files
| File | Purpose |
|---|---|
Cargo.toml (root) |
Workspace lint configuration |
crates/*/src/lib.rs |
Per-crate allow lists |
crates/apr-qa-runner/src/command.rs |
CommandRunner trait (28 methods) |
crates/apr-qa-runner/src/executor.rs |
ExecutionConfig + 4 test runners |
crates/apr-qa-runner/src/evidence.rs |
Evidence constructors |
scripts/coverage-check.sh |
95% threshold check |