shift-left-testing

name: shift-left-testing description: >- Move quality earlier in the development lifecycle. Covers dev/QA pairing patterns, Three Amigos sessions, TDD facilitation (Red-Green-Refactor), PR review checklists for testability, and Definition of Done with quality gates. Includes shift-left maturity model for team assessment. Use when: "shift left," "TDD," "dev-QA pairing," "definition of done," "testability," "quality culture," "QA in sprint planning." Not for: writing the unit tests themselves — use unit-testing; automated PR test-quality review at scale — use ai-qa-review; multi-quarter QA direction or roadmap — use test-strategy. Related: unit-testing, ai-qa-review, test-strategy. license: MIT metadata: author: kindlmann version: "2.0" category: process

Move quality validation earlier in the development lifecycle where defects are cheaper, faster, and simpler to fix. A missing validation rule caught in refinement is a five-minute conversation; the same bug in production is an incident, a hotfix, and a postmortem. This skill covers the practices, patterns, and cultural shifts that embed quality into every phase — from story refinement to PR merge — plus a maturity model to find the next concrete step.

Quick Route

Situation	Go to
QA only sees features after dev is "done"	Dev/QA Pairing → QA in Sprint Planning
Need a pre-dev requirements conversation	Dev/QA Pairing → Three Amigos Sessions
Deciding whether to TDD this work	TDD Facilitation → When TDD vs. Test-After
Reviewing a PR for test quality	PR Review Checklist
Reviewing an AI-generated/AI-using PR	PR Review Checklist → When the Change Is AI-Generated
"Where is my team and what's next?"	Shift-Left Maturity Model

Discovery Questions

Check .agents/qa-project-context.md first — if it exists, use it (team composition, dev/QA workflow, sprint structure, quality goals) and skip anything answered there.

Current Dev/QA Workflow

When does QA first see a feature? After PR is raised? After merge to staging? Only when a bug appears? The answer reveals how far right your quality currently sits.
Who writes tests, and when? Developers only? QA only after dev is "done"? Both but on different timelines? Understanding current ownership is essential before changing it.
How are requirements communicated? Written specs? Verbal handoffs? Figma links with no acceptance criteria? Ambiguous requirements are the #1 source of defects that shift-left prevents.
Is there interest in TDD? Has the team tried it before? Did it stick or collapse? Understanding past attempts prevents repeating failed approaches.
What does the PR review process look like? Who reviews? Is testability a review criterion? Are tests required before merge? PR review is the lowest-friction place to introduce quality checks.
What is the team's Definition of Done? Written or unwritten? Does it include testing? Is it enforced or aspirational? The DoD is the contractual boundary between "in progress" and "done."
How does QA participate in sprint planning? Not at all? Consulted on estimates? Actively refining stories? Sprint planning participation determines how early QA thinking enters the cycle.

Core Principles

1. Quality Is Everyone's Responsibility

Quality is not a phase performed by the QA team after development. It is a property of the entire workflow: product managers write testable requirements, developers write tests alongside code, code reviewers check for testability, and QA engineers design the strategy and catch what automation misses. When quality belongs to everyone, defects are caught by whoever encounters them first.

2. Earlier Detection = Cheaper Fixes (Directionally)

A missing validation rule caught during story refinement is a five-minute conversation. The same defect found in production is an incident, a hotfix, a postmortem, and eroded user trust. The cost of fixing a defect rises sharply the further right it is caught — refinement < design < development < QA < staging < production. That direction is real and well-attested; the exact multipliers are not. The widely cited "1x → 100x" table traces to an undated, unsourced IBM Systems Science Institute training chart with no published methodology, so treat any precise figure as illustrative, not measured. Lead with the concrete cost story above, not invented numbers. Shift-left practices aim to catch defects in the cheap left-hand columns — refinement through development — before QA, staging, or prod ever see them.

3. QA Is Embedded, Not a Gate

Traditional QA acts as a gate at the end of development: code is "thrown over the wall" for testing. Shift-left embeds QA throughout the process. QA contributes to story refinement, pairs with developers on test design, reviews PRs for testability, and validates early through continuous testing. The gate model creates bottlenecks and adversarial dynamics. The embedded model creates collaboration and shared ownership.

4. Testability Is a Design Concern

Code that is hard to test is usually hard to maintain, hard to debug, and likely to contain defects. Testability should be a first-class design constraint alongside performance, security, and usability. When developers ask "how will we test this?" during design -- before writing a single line of code -- the resulting architecture is cleaner, more modular, and more reliable.

5. Start Small, Prove Value, Then Expand

Introducing every shift-left practice simultaneously overwhelms teams. Pick one practice (usually PR review checklists or Three Amigos), prove its value with data (fewer bugs escaping, faster PR cycles), then use that success to justify the next practice. Cultural change happens one demonstrated win at a time.

Dev/QA Pairing Patterns

QA in Sprint Planning

What it looks like: QA engineers attend sprint planning and actively participate in story refinement. They ask clarifying questions about edge cases, identify missing acceptance criteria, and flag risk areas before development begins.

Concrete actions during planning:

Review each story for testable acceptance criteria. Every acceptance criterion should be verifiable -- "user can sort the table" is testable; "table is user-friendly" is not.
Identify edge cases and negative scenarios. What happens with empty data? Max length input? Concurrent users? Network failure mid-operation?
Flag integration risks. Does this story touch a third-party API? Does it change database schema? Does it affect existing test data?
Estimate QA effort. Automation time, exploratory testing time, environment setup. Include this in sprint capacity.
Define test approach per story. Unit tests for business logic, integration tests for API changes, E2E for user-facing flows.

Template: QA questions for each story

Story: [PROJ-1234] Add coupon code to checkout
───────────────────────────────────────────────
QA questions before development starts:
1. What happens if the coupon is expired?
2. What happens if the coupon is already used (single-use)?
3. Can multiple coupons be stacked?
4. What error message does the user see for invalid codes?
5. Does the discount update the total in real-time or on submit?
6. Is there a rate limit on coupon validation attempts?

Test approach:
- Unit: coupon validation logic, discount calculation, expiry check
- Integration: coupon API endpoint, database state after redemption
- E2E: apply coupon in checkout flow, verify discount on confirmation
- Exploratory: edge cases with currency rounding, max discount limits

Three Amigos Sessions

A structured 15-30 minute conversation between three perspectives before development begins.

The three perspectives:

Product/Business: What does the user need? Why does this matter?
Development: How will we build it? What are the technical constraints?
QA/Testing: How will we verify it? What could go wrong?

Optional fourth amigo (AI participant): A coding agent can generate edge cases and counter-scenarios from the acceptance criteria mid-session. Treat AI output as a checklist to validate, not a decision — humans still own the criteria.

Session format (30 minutes max):

Product presents the story (5 min) -- user need, acceptance criteria
Development asks clarifying questions (5 min) -- feasibility, dependencies
QA asks testing questions (5 min) -- edge cases, error states, testability
Group identifies gaps (10 min) -- missing criteria added, assumptions made explicit
Agreement and next steps (5 min) -- updated story, risks documented, test approach agreed

When to use: Stories with risk score Medium+, anything touching payments/auth/data integrity, stories with ambiguous requirements, cross-team stories.

When to skip: Simple bug fixes with clear repro steps, copy/text-only changes, dependency updates with no behavioral change.

QA Pairing on Test-First Design

QA and developer collaborate on test cases before implementation. This is not full TDD -- it is test thinking applied collaboratively.

How it works:

Developer and QA sit together (or share screen) for 20-30 minutes
QA describes the scenarios they plan to test
Developer writes the test signatures (function names, inputs, expected outputs)
Together they identify which tests are unit, integration, and E2E
Developer implements the feature with these tests as the target

Example output from a pairing session: a set of agreed test signatures spanning unit, integration, and E2E levels, written before implementation. See references/tdd-examples.md for the full coupon-feature pairing output.

QA Reviewing PRs

QA engineers review pull requests with a focus on testability and test quality, complementing the code review performed by other developers.

Getting started for teams new to QA PR reviews:

Start with one QA reviewer on high-risk PRs only. Do not try to review every PR on day one.
Time-box reviews to 15 minutes. QA is checking for test quality, not re-reviewing business logic.
Use the PR Review Checklist below. It provides concrete, objective criteria -- no subjective judgment required.
Leave comments as suggestions, not demands. Frame as "Consider adding a test for the empty state" rather than "Missing tests."
Track value. Note when QA review catches a gap. After 2-4 weeks, share the count with the team to demonstrate ROI.

TDD Facilitation

Red-Green-Refactor

TDD follows a strict three-step cycle. Each step has a clear purpose and a clear exit condition.

┌──────────────────────────────────────────────────────┐
│  RED: Write a failing test                           │
│  - Test describes the desired behavior               │
│  - Test MUST fail (if it passes, it tests nothing)   │
│  - Write the minimum test to specify one behavior    │
│                                                      │
│  GREEN: Make the test pass                           │
│  - Write the minimum code to pass the test           │
│  - No extra features, no premature optimization      │
│  - It is OK if the code is ugly                      │
│                                                      │
│  REFACTOR: Clean up                                  │
│  - Improve code structure without changing behavior  │
│  - All tests still pass after refactoring            │
│  - Remove duplication, improve naming, simplify      │
└──────────────────────────────────────────────────────┘

Example: TDD for a password strength validator — first failing test, minimum passing code, then a behavior-preserving refactor into a rules array. See references/tdd-examples.md for the full Red-Green-Refactor walk-through.

When TDD vs. Test-After: Decision Guide

TDD is not always the right choice. Use this guide to decide.

Scenario	Approach	Why
Pure business logic (validators, calculators, transformers)	TDD	Clear inputs/outputs, fast feedback, tests document behavior
Bug fix with known reproduction	TDD	Write failing test first = proof the fix works
API endpoint with clear contract	TDD	Request/response is a natural test boundary
Exploratory UI prototyping	Test-after	Design is unstable; tests would rewrite constantly
Third-party integration	Test-after	Need to understand the API behavior first
Complex data migration	Test-after with fixtures	Write sample data first, then test transformation
Performance optimization	Test-after with benchmarks	Need baseline before testing improvement
AI-generated implementation	TDD (test first)	LLMs happily produce passing-looking code; the failing test is the spec the agent must satisfy. Highest-leverage check on AI output.

TDD for Bugs (The Litmus Test)

Every bug fix should start with a failing test that reproduces the bug. This practice provides three guarantees:

You understand the bug. If you cannot write a test that fails, you do not understand the bug.
The fix actually works. The test turns green when the fix is applied.
The bug never returns. The test stays in the suite as a regression guard.

See references/tdd-examples.md for a worked failing-test-first example (a JPY zero-decimal rounding bug).

Kata Exercises for Teams Learning TDD

Short exercises (30-60 min) to build TDD muscle memory:

Kata	Difficulty	Key lesson
FizzBuzz	Beginner	Basic Red-Green-Refactor cycle
String Calculator	Beginner	Incremental complexity, edge cases
Roman Numerals	Intermediate	Pattern recognition, refactoring
Bowling Game	Intermediate	State management, complex rules
Gilded Rose	Advanced	Refactoring legacy code under test harness

Format: Pair programming, 45 minutes, switch driver every 5 minutes. Debrief for 15 minutes: what was hard? What felt natural? What would you do differently?

PR Review Checklist: QA Perspective

Use this checklist when reviewing PRs for test quality and testability. Not every item applies to every PR -- use judgment based on the change scope.

Tests Exist and Are Meaningful

Tests accompany the code change. New feature? New tests. Bug fix? Regression test. Refactor? Existing tests still pass (and ideally improve). No-test PRs for behavioral changes need explicit justification.
Both happy path and edge cases are covered. At minimum: valid input, invalid input, empty/null input, boundary values. For user-facing features: error states, loading states, empty states.
Tests describe behavior, not implementation. Test names read as specifications: rejects expired coupon with clear error message not test coupon validator function line 42.

Code Is Testable

Functions have clear inputs and outputs. Pure functions are trivially testable. Functions with side effects should isolate the side effect (dependency injection, wrapper functions).
Dependencies are injectable. Database clients, HTTP clients, clocks, and random number generators should be parameters or injected -- not imported directly inside business logic.
No hardcoded magic values. Constants are named and configurable. Test can override them without modifying production code.

Test Quality

Selectors use stable strategies. E2E tests use data-testid, getByRole, or getByLabel -- not CSS classes or XPath. See the selector stability scoring in test-reliability.
Assertions are specific. expect(result).toEqual({ status: 'expired', code: 'COUPON_EXPIRED' }) not expect(result).toBeTruthy().
Test data is deterministic. No dependency on current date, random values, or auto-increment IDs without explicit control. Use factories or fixtures.
Tests clean up after themselves. Created records are deleted. Modified state is restored. No test pollution.
Test names describe the scenario. A reader unfamiliar with the code should understand what is being tested from the test name alone.
No coverage-only tests. Tests that execute code without meaningful assertions inflate coverage without providing safety.

When the Change Is AI-Generated or AI-Using

Apply these additional checks when a PR contains code authored by an AI agent or introduces an AI-powered feature.

AI provenance disclosed. PR description names the agent, model, and what it generated (so reviewers calibrate scrutiny appropriately).
Tests written first or by a human. AI-generated implementation paired with AI-generated tests is a closed loop — at least one side of the test/implementation pair should be authored or critically reviewed by a human (see TDD decision guide row above).
Prompt and model version pinned. For AI-using features (LLM calls, prompt templates), the prompt version and model ID live in a flag-based config store — LaunchDarkly AI Configs (GA 2025) or equivalent — not embedded as ad-hoc strings that drift. The PR should reference the config key, not paste the prompt inline.
Runtime kill switch wired. Any AI feature ships behind a feature flag that can disable it without redeploy. Pair shift-left prevention with shift-right containment.
Prompt eval test exists. At least one regression test for the prompt's behavior on representative inputs (see ai-system-testing).
No fabricated APIs or imports. Reviewer verifies every imported symbol exists — LLMs invent plausible-sounding APIs.

Definition of Done Template

The Definition of Done (DoD) is the team's shared agreement on what "done" means. It applies to every story before it moves to "Done" on the board.

Recommended DoD with Quality Gates

A complete DoD groups its checks under Code Complete, Tested, Quality Gates Pass, Documentation, and Deployment Ready. The Tested group requires unit tests for business logic, integration tests for API/service changes, an E2E test for user-facing critical paths, edge cases and error states covered, and manual exploratory testing completed for medium/high risk changes. The Quality Gates group requires a green CI pipeline, no new lint/type errors, and code coverage not decreased from baseline (a baseline number, not an absolute threshold pulled from the air). See references/templates.md for the full copy-paste checklist.

Enforcing the DoD

The DoD is only effective if it is enforced. Three enforcement mechanisms:

Automated gates in CI. Tests must pass, coverage must not decrease, linting must pass. These cannot be bypassed without a team lead override.
PR template checklist. Include the DoD as a checklist in the PR template. Reviewers verify items are checked.
Sprint review validation. During sprint review, stories are accepted only if the DoD is met. "It works but tests are not written yet" means it is not done.

Shift-Left Maturity Model

Assess where your team currently sits and identify the concrete next step to improve.

Level 1: Reactive

Symptoms:

QA tests only after development is complete
Bugs found in staging or production
No automated tests or minimal coverage
Requirements are ambiguous; QA discovers gaps during testing
"QA phase" is a distinct block at the end of the sprint

Next step: Introduce QA into sprint planning. Start with QA asking clarifying questions on each story before development begins. Measure: count of requirement gaps found in planning vs. found in testing.

Level 2: Gate

Symptoms:

QA reviews PRs but does not participate in design
Automated tests exist but are written after features are complete
Definition of Done exists but testing items are often skipped
QA is a checkpoint, not a collaborator
Test-after means bugs are found late; rework is common

Next step: Introduce Three Amigos for high-risk stories. QA, dev, and product discuss requirements, edge cases, and test approach before development starts. Measure: reduction in bugs found during QA testing (should decrease as upstream quality improves).

Level 3: Embedded

Symptoms: QA participates in sprint planning and story refinement. Developers write unit and integration tests during development. PR review includes testability checks. QA and dev pair on test case design. DoD enforced with automated quality gates.

Next step: Introduce test-first practices for bug fixes (every bug fix starts with a failing test). Extend to TDD for pure business logic. Measure: regression rate (should approach zero).

Level 4: Collaborative

Symptoms: Three Amigos are standard for medium/high risk stories. Developers practice TDD for business logic and bug fixes. QA focuses on exploratory testing, strategy, and risk analysis. Quality metrics tracked and reviewed regularly. Cross-functional ownership of quality.

Next step: Introduce shift-left to architecture and design reviews. QA reviews system design documents for testability before implementation begins. Measure: defect escape rate (consistently below 5%).

Level 5: Preventive

Symptoms: Quality is built into every stage. Defect escape rate consistently below 3%. QA engineers focus on strategy, coaching, and systemic improvement. Production issues are rare and trigger root cause analysis. The team cannot imagine working without early quality practices.

Maintaining this level: Quarterly maturity assessments. New team members onboarded with quality practices from day one. Retrospectives include quality metrics.

Self-Assessment Worksheet

Score eight practices (QA in planning, Three Amigos, PR review, tests-during-dev, failing-test-first bug fixes, TDD for business logic, enforced DoD, metrics review) on a Never/Sometimes/Usually/Always scale to place the team on Levels 1–5. See references/templates.md for the printable worksheet with scoring bands.

Anti-Patterns

"Shift Left" as QA Layoff

Rebranding "developers write all the tests" as shift-left to justify eliminating QA roles. Shift-left changes WHEN quality happens, not WHO does it. QA engineers bring a testing mindset, risk analysis skills, and exploratory testing capabilities that developers typically do not develop. Removing QA and telling developers to "just test more" results in blind spots, not savings.

Ceremony Without Substance

Running Three Amigos meetings as a checkbox exercise where nobody asks hard questions. If the session does not produce at least one changed acceptance criterion or one new edge case, it was not a real discussion. Track "gaps found in Three Amigos" as a metric.

All TDD, All the Time

Forcing TDD on UI prototyping, exploratory spikes, or experimental features where the design is still fluid. TDD works best when the desired behavior is clear. For uncertain domains, spike first, then write tests around the design that emerges. Use the decision guide above.

Quality Gates Without Team Buy-In

Imposing strict quality gates (coverage thresholds, mandatory QA review) without explaining why they exist or involving the team in setting the thresholds. Gates perceived as imposed slow the team and get circumvented. Gates set collaboratively are defended by the team.

Testing Everything at the Wrong Level

Writing E2E tests for business logic that should be validated by unit tests. Writing unit tests for user flows that need E2E validation. Shift-left is not just "test earlier" -- it is "test at the right level, as early as possible." A calculation bug needs a unit test, not a browser test.

Measuring Activity Instead of Outcomes

Tracking "number of Three Amigos sessions held" instead of "defects found in planning vs. found in production." Activities are inputs; outcomes are outputs. Measure whether shift-left practices actually reduce escaped defects and rework.

Verification

Prove the gates actually bite — a gate that never fires proves nothing:

A no-test behavioral PR is blocked. Open a throwaway PR that changes behavior with no test (or drops coverage below baseline) and confirm CI goes red and the merge button is disabled. If it merges, the gate is decorative.
The DoD checklist renders in the PR template. Confirm the DoD checklist exists in .github/pull_request_template.md (or your platform's equivalent) so reviewers see it on every PR — test -f .github/pull_request_template.md && grep -qi "unit test" .github/pull_request_template.md.
The AI kill switch toggles off. For any AI-powered path, flip its feature flag to disabled in your flag platform and confirm the path goes dark without a redeploy.

Done When

Definition of Done updated to include test criteria (unit, integration, and E2E gates) and committed to the repo (e.g. present in .github/pull_request_template.md)
PR review checklist with a test-coverage check is checked into the PR template and required by branch protection
At least one Three Amigos session run for an upcoming feature, with gaps documented and acceptance criteria updated in the ticket
A first dev/QA pairing session has happened, with the agreed test signatures committed to the repo
Pre-merge quality gates (test pass, coverage not decreased, linting) are active in CI and a no-test PR is observed to fail (see Verification step 1)
A feature flag exists for each risky or AI-powered code path and its disable toggle is verified in the flag platform (see Verification step 3), so prevention (shift-left) and containment (shift-right) ship together

Reference Files (in `references/`)

tdd-examples.md — Runnable code for dev/QA pairing test-first design, the Red-Green-Refactor password-validator walk-through, and the failing-test-first bug example.
templates.md — Copy-paste Definition of Done with quality gates and the shift-left maturity self-assessment worksheet.

Related Skills

unit-testing -- Detailed patterns for writing effective unit tests, the primary artifact of shift-left development practices.
ai-qa-review -- Automated PR review for test quality and testability, scaling the QA review patterns described here.
test-strategy -- The overall testing approach that shift-left practices implement at the daily level.
qa-project-context -- Project-specific context that determines which shift-left practices to introduce first.
quality-postmortem -- When shift-left fails and defects escape, postmortems identify which practice would have caught them.
qa-project-bootstrap -- Onboarding new team members includes introducing them to the team's shift-left practices.