scenario-test - SKILL.md Agent Skill

name: scenario-test description: Use when you need to validate code changes in Azure before committing, especially for infrastructure, auth, job orchestration, or integration points that cannot be tested locally

AI-Driven Scenario Testing

Analyze code changes, identify which user flows exercise those code paths, generate minimal test scripts, and execute end-to-end scenario tests against production infrastructure with isolated test data.

Core principle: Test code changes with realistic user scenarios at minimal cost by reusing existing prod infrastructure and isolating test data through naming conventions.

When to Use

Infrastructure changes (Bicep, env vars, managed identity)
Auth/credential changes that only work inside Azure
Job orchestration logic (controller, failover)
Integration points (LLM, email, blob storage)
Any change that cannot be validated with local unit tests

When NOT to Use

Pure unit-testable changes — Use pytest tests/unit instead
Documentation-only changes — README, comments, docs/
Frontend-only changes — Use local dev server, Playwright
Changes already covered by CI — If pushing to trigger CI anyway
Local config changes — .env, local scripts

Invocation

Explicit: /scenario-test
Natural: "test my changes", "verify this fix", "run scenario test"
With scope: /scenario-test HEAD~2..HEAD or /scenario-test "fixed managed identity auth"

The Process

digraph scenario_test {
    rankdir=TB;

    "Analyze code changes" [shape=box];
    "Trace affected code paths" [shape=box];
    "Identify minimal user flow" [shape=box];
    "Determine success criteria" [shape=box];
    "Generate test script" [shape=box];
    "CHECKPOINT: Show script, get approval" [shape=diamond style=filled fillcolor=yellow];
    "User approves?" [shape=diamond];
    "Execute setup phase" [shape=box];
    "CHECKPOINT: Ready to trigger?" [shape=diamond style=filled fillcolor=yellow];
    "Execute and poll for completion" [shape=box];
    "Verify success criteria" [shape=box];
    "Report PASS/FAIL" [shape=box];
    "Cleanup test data (always)" [shape=box style=filled fillcolor=lightgreen];
    "Abort" [shape=box];

    "Analyze code changes" -> "Trace affected code paths";
    "Trace affected code paths" -> "Identify minimal user flow";
    "Identify minimal user flow" -> "Determine success criteria";
    "Determine success criteria" -> "Generate test script";
    "Generate test script" -> "CHECKPOINT: Show script, get approval";
    "CHECKPOINT: Show script, get approval" -> "User approves?";
    "User approves?" -> "Execute setup phase" [label="yes"];
    "User approves?" -> "Abort" [label="no"];
    "Execute setup phase" -> "CHECKPOINT: Ready to trigger?";
    "CHECKPOINT: Ready to trigger?" -> "Execute and poll for completion" [label="yes"];
    "CHECKPOINT: Ready to trigger?" -> "Cleanup test data (always)" [label="abort"];
    "Execute and poll for completion" -> "Verify success criteria";
    "Verify success criteria" -> "Report PASS/FAIL";
    "Report PASS/FAIL" -> "Cleanup test data (always)";
}

Phase 1: Analysis

Step 1: Identify Changed Components

Read the git diff to understand what changed:

git diff HEAD~1..HEAD --name-only
git diff HEAD~1..HEAD

Map changed files to components:

src/providers/*.py → Provider classes
src/jobs/*.py → Job logic (controller, processor, poller)
src/services/*.py → Business logic
src/api/*.py → API endpoints
infra/*.bicep → Infrastructure configuration

Step 2: Trace Call Graph Upstream

For each changed component, trace who calls it:

Example: ContainerAppsProvider.start_job() changed
  ← called by: controller.py:trigger_processor_job()
  ← called by: controller.py:main()
  ← triggered by: Scheduled job OR manual trigger
  ← condition: pending_count >= 10 OR oldest_pending >= 7 days

Step 3: Trace Call Graph Downstream

What happens after the changed code runs:

Example: start_job() succeeds
  → processor job starts
  → claims deliveries, transcribes, summarizes, emails

Step 4: Identify Minimal Trigger Point

Determine what's needed to exercise the code path:

To exercise this code path, we need:
  - Pending deliveries in DB (to meet batch threshold)
  - Controller job to run

We do NOT need:
  - Real RSS feeds (deliveries inserted directly)
  - Real audio files (if testing auth, not processing)
  - Real email recipients (test user domain)

Step 5: Determine Success Criteria

Select appropriate criteria based on what's being tested:

Change Category	Success Criteria	Rationale
Auth/Identity	Job execution exists, no auth errors	Auth happens at job start
Job orchestration	Correct job triggered, execution exists	Testing job selection logic
Transcription	Transcript blob created, non-empty	Need actual processing
Summarization	Summary blob created, content reasonable	Need LLM output
Email delivery	Delivery marked `sent`, no errors	Full end-to-end
API endpoints	HTTP 200, correct response	Direct API test

Phase 2: Script Generation

Generate a Python test script using the helper library. Save to scripts/scenario_testing/generated/scenario_test_{description}_{uuid}.py.

Script Template

#!/usr/bin/env python3
"""Scenario test: {description}

Auto-generated by scenario-test skill
Generated: {timestamp}
Cleanup: Automatic on completion/error

Code change tested:
    {file}: {change_summary}

Success criteria:
    {criteria}
"""

import sys
import uuid

sys.path.insert(0, ".")

from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker

from scripts.scenario_testing import (
    # Constants
    CONTROLLER_JOB,
    # Test data
    cleanup_test_data,
    create_test_channel,
    create_test_deliveries,
    create_test_subscription,
    create_test_user,
    # Azure ops
    deploy_infra,
    get_job_logs,
    trigger_job,
    wait_for_job_completion,
    # Verification
    check_logs_for_errors,
    job_execution_exists,
    # Reporting
    report_failure,
    report_success,
)
from src.config import get_settings

TEST_ID = "{scenario}-{uuid}"


def get_db_session():
    settings = get_settings()
    engine = create_engine(settings.database_url)
    Session = sessionmaker(bind=engine)
    return Session()


def setup(db):
    """Insert test data into prod DB."""
    # Generated based on scenario requirements
    pass


def execute():
    """Trigger the flow and wait for completion."""
    # Generated based on what needs to be triggered
    pass


def verify(execution_name: str) -> bool:
    """Check success criteria."""
    # Generated based on success criteria
    pass


def cleanup(db):
    """Remove all test data."""
    cleanup_test_data(db, TEST_ID)


def main():
    db = get_db_session()
    try:
        setup(db)
        success, execution_name = execute()
        if not success:
            report_failure(f"Execution failed: {execution_name}")
            sys.exit(1)

        passed = verify(execution_name)
        if passed:
            report_success("Scenario test passed")
            sys.exit(0)
        else:
            logs = get_job_logs(CONTROLLER_JOB, execution_name)
            report_failure("Scenario test failed", logs)
            sys.exit(1)
    finally:
        cleanup(db)
        db.close()


if __name__ == "__main__":
    main()

Phase 3: Checkpoint 1 - Script Approval

Present to user:

Summary of what changed
Code paths affected
Generated test script
What will be written to prod DB
Success criteria

Ask: "Ready to execute this test? [yes/modify/abort]"

Phase 4: Setup Execution

If approved:

Deploy infrastructure changes if any: deploy_infra("infra/main.bicep", "infra/parameters.prod.json")
Insert test data using helper functions
Verify setup complete

Phase 5: Checkpoint 2 - Trigger Approval

Present:

Setup complete confirmation
What job/API will be triggered
Expected behavior

Ask: "Ready to trigger the flow? [yes/abort]"

If abort: Run cleanup immediately.

Phase 6: Execution

Trigger the job or API
Poll for completion (no artificial timeout)
Collect logs from relevant components
Report real-time status updates

Phase 7: Verification

Check success criteria:

Parse logs for error patterns
Verify expected state changes
Report PASS or FAIL with evidence

Phase 8: Cleanup (Always Runs)

cleanup_test_data(db, TEST_ID)

Report: "Cleanup complete. 0 test records remaining."

Test Data Conventions

Naming Patterns

Entity	Pattern	Example
Users	`test-{scenario}-{uuid}@podsum-test.local`	`test-auth-a1b2c3@podsum-test.local`
Channels	`TEST-{scenario}-{uuid}`	`TEST-auth-a1b2c3`
Auth tokens	Linked to test users only	—
Subscriptions	Linked to test users/channels	—
Episodes	Linked to test channels	—
Deliveries	Linked to test subscriptions	—

Isolation Guarantees

Email safety: Test users use @podsum-test.local domain which ACS rejects
Query safety: All test data queries filter by test prefixes
Cascade cleanup: Deleting test users cascades to related records

Helper Library Reference

Import from scripts.scenario_testing:

Test Data Creation

create_test_user(db, test_id) -> User
create_test_channel(db, test_id) -> Channel
create_test_subscription(db, user, channel) -> Subscription
create_test_episode(db, channel, test_id) -> Episode
create_test_deliveries(db, subscription, count=10) -> list[Delivery]

Azure Operations

deploy_infra(bicep_path, parameters_path) -> bool
trigger_job(job_name) -> tuple[bool, str]
wait_for_job_completion(job_name, execution_name) -> JobStatus
get_job_logs(job_name, execution_name) -> str

Verification

job_execution_exists(job_name, execution_name) -> bool
check_logs_for_errors(logs, patterns=None) -> list[str]
check_blob_exists(container, path) -> bool
check_delivery_status(db, delivery_id) -> str | None

Cleanup & Reporting

cleanup_test_data(db, test_id) -> dict[str, int]
report_success(message)
report_failure(message, logs=None)

Common Scenarios

Managed Identity Fix

Changed: infra/main.bicep (AZURE_CLIENT_ID for controller)
Affects: DefaultAzureCredential in ContainerAppsProvider
Code path: controller → start_job() → Azure SDK auth
Trigger: Batch threshold met, controller runs
Success: Processor job execution exists, no auth errors

Batch Threshold Change

Changed: src/jobs/controller.py (threshold logic)
Affects: When batch processing triggers
Code path: controller.main() → check_pending() → trigger_processor()
Trigger: Insert N pending deliveries, run controller
Success: Processor triggered (or not) per new threshold

Email Template Change

Changed: src/services/delivery.py (email content)
Affects: Email body sent to users
Code path: processor → generate_email() → send_email()
Trigger: Full processing flow with test delivery
Success: Delivery marked sent, email content matches template

Red Flags

Mistake	Consequence	Prevention
Test data without unique ID	Collision with parallel runs	Always include UUID in test IDs
Forgetting cleanup on error	Orphaned test data in prod	`finally:` block always runs cleanup
Running against wrong environment	Test data in wrong place	Skill validates resource group
Wrong success criteria	Test passes but doesn't validate fix	Explain choice, user can override
Not deploying infra first	Testing old config	Deploy infra before inserting test data