name: e2e description: Write and run web E2E tests (Playwright) using TDD — locations, patterns, commands, and debugging.
E2E Tests
Write E2E tests using TDD (Red-Green-Refactor). Always run the tests you create and watch them fail before implementing.
Available skills and subagents
/tdd— Follow the Red-Green-Refactor cycle when writing tests./verify— Run after completing tests to ensure everything passes across the monorepo./playwright-cli— Interactive browser automation. Use to validate features against the dev server before writing tests, and to debug failing tests with--debug=cli.
Location
apps/web/e2e/
apps/web/e2e/
├── fixtures/
│ ├── backend.ts # Worker-scoped backend + frontend process
│ ├── test-base.ts # Extended fixture (apiClient, seedData, testPage)
│ └── office-fixture.ts # Office fixtures (officeApi, officeSeed with workspace+agent)
├── helpers/
│ ├── api-client.ts # HTTP client for seeding data (read for available methods)
│ └── office-api-client.ts # Office-specific API client (onboarding, issues, agents)
├── pages/ # Page objects (read for available pages and methods)
└── tests/ # Spec files (*.spec.ts), grouped by feature
├── task/ # Task creation, deletion, archiving, environment, subtasks
├── kanban/ # Kanban board, mobile kanban, preview panel
├── session/ # Session lifecycle, resume, recovery, multi-session, layout
├── workflow/ # Workflow steps, settings, automation, import/export
├── git/ # Git changes panel, commits, diffs, symlinks
├── pr/ # PR detection, watchers, changes panel
├── terminal/ # Terminal agent, keyboard, settings
├── chat/ # Quick chat, message queue, clarification, markdown, toolbar
├── settings/ # Config management, agent profiles, editor integration
└── review/ # Code review diffs
Each worker gets an isolated backend, frontend, database, and mock agent — no Docker, no API keys needed.
Run commands
Always run headless (make test-e2e). Never use --headed, e2e:headed, or test-e2e-headed — headed mode requires a display and will fail in agent environments.
Preferred: pnpm e2e:run (managed runner — builds, runs, tears down)
e2e/scripts/run-e2e.sh handles the build, the run, and cleanup in one command. Use it instead of stitching the steps together. It auto-selects docker vs host, runs N shards concurrently, enforces strict WS accounting by default (matching CI), and never leaves root-owned artifacts behind.
cd apps/web
pnpm e2e:run # auto: docker if daemon + CI image available, else host; builds first
pnpm e2e:run tests/task/my-test.spec.ts # single file (extra args pass through to Playwright)
pnpm e2e:run --shards 3 # 3 shards concurrently on this machine (isolated)
pnpm e2e:run --no-build -- --grep "task creation" # skip rebuild; forward flags after --
pnpm e2e:docker # force the docker CI image (full isolation from a host dev instance)
pnpm e2e:clean # remove build/test artifacts, incl. root-owned ones from prior docker runs
The runner solves the sharp edges hand-rolling would hit: in docker it builds the CGO backend on the host and runs it in the runtime image (forward-compatible when the host glibc ≤ the image's — the usual case; it smoke-tests this and only falls back to the build image if the host is newer), builds the FE standalone on the host, pre-creates the standalone symlinks as relative links so in-container global-setup doesn't recreate them as root, and keeps Playwright output container-local. See apps/web/e2e/README.md → "the managed runner".
Raw commands (when you need fine control)
make test-e2e # all tests, headless (host)
cd apps && pnpm --filter @kandev/web e2e -- tests/task/my-test.spec.ts # single file
cd apps && pnpm --filter @kandev/web e2e -- --grep "task creation" # by name
Flake reproduction
Start by matching CI as closely as possible, then add pressure deliberately:
- Run the exact failed shard in the CI runtime image with CI env enabled:
docker run --rm --ipc=host -v "$PWD":/work -w /work/apps/web \ -e CI=true -e GITHUB_ACTIONS=true -e GITHUB_WORKSPACE=/work \ -e NODE_OPTIONS=--dns-result-order=ipv4first \ -e PLAYWRIGHT_BROWSERS_PATH=/ms-playwright \ ghcr.io/kdlbs/kandev-ci:runtime-latest \ bash -lc 'git config --global --add safe.directory /work 2>/dev/null; npx playwright test --config e2e/playwright.config.ts --project=chromium --project=mobile-chrome --shard=10/10 --reporter=list' - If the exact shard passes, constrain container resources and repeat the
failing spec/test. GitHub-hosted runners can expose timing bugs that a roomy
local machine hides:
docker run --rm --ipc=host --cpus=2 --memory=4g --memory-swap=4g \ -v "$PWD":/work -w /work/apps/web \ -e CI=true -e GITHUB_ACTIONS=true -e GITHUB_WORKSPACE=/work \ -e NODE_OPTIONS=--dns-result-order=ipv4first \ -e PLAYWRIGHT_BROWSERS_PATH=/ms-playwright \ ghcr.io/kdlbs/kandev-ci:runtime-latest \ bash -lc 'git config --global --add safe.directory /work 2>/dev/null; npx playwright test --config e2e/playwright.config.ts --project=mobile-chrome e2e/tests/terminal/mobile-terminal-keybar.spec.ts --grep "user presses an OS-keyboard letter while no modifier is active" --repeat-each=30 --reporter=list' - Preserve nearby test ordering when a single-test repeat stays green. Run the full spec file or full shard with the same resource limits before declaring a flake non-reproducible.
Record the exact command, resource limits, repeat number, and failure artifact
path. Always inspect error-context.md; mobile/terminal flakes often show
state that the stack trace alone hides, such as duplicate active terminals or a
terminal stuck on "Starting terminal...".
CRITICAL: E2E tests run against the production build (.next/standalone/), not dev mode. After any frontend code change, you must rebuild before running tests (pnpm e2e:run does this for you):
make build-web # ~30s, required after every frontend change
Without this, tests run against stale code and failures are misleading. make build-backend is also required after Go changes. make test-e2e and pnpm e2e:run handle both automatically.
Writing a test
- Read
helpers/api-client.tsandpages/to discover available seed methods and page objects - Import fixtures from
../../fixtures/test-base— providestestPage,apiClient, andseedData(pre-created workspace with default workflow). Pullbackendfrom the fixture too when you need the backend URL — it's worker-scoped, dynamic, andprocess.env.KANDEV_API_BASE_URLis not set in the Playwright runner (only in the frontend SSR child process). Usebackend.baseUrl. - Use
data-testidattributes for selectors — add them to components as needed - Use page objects for common interactions; create new ones for new pages
- For GitHub features, use
apiClient.mockGitHub*()methods to seed mock data
IDs and response shapes — common pitfalls
apiClient.createTaskWithAgent(...)returnsCreateTaskResponse, which isTask & { session_id?: string; agent_execution_id?: string }. Readcreated.session_iddirectly — don't calllistTaskSessions(taskId)just to fetch the session that was auto-started by the same call.- The URL
/t/:idcontains the TASK ID, not the session ID. Backend routes like/port-proxy/:sessionId/:port/*pathexpect the session ID. Don't extract IDs fromwindow.location.pathnamewhen you need a session ID — pull from the API response. page.requestshares cookies/storage with the page context. Fine for the current no-auth local backend; if auth ever lands, this is where you'd plug it in.- Flows that go through a Next.js server action fetch the backend server-side (from the SSR process, not the browser).
page.waitForResponse("**/api/v1/...")on the backend URL will never fire — the browser only sees the POST to the server-action endpoint. Assert the user-visible outcome (redirect, toast, store change) and/or re-query state viaapiClientinstead. Client-side fetches (mostlib/api/domains/*calls) are visible towaitForResponse; server actions inapp/actions/*are not. - Preview iframe tests: the seed repo has no
dev_scriptconfigured, so the preview panel renders a placeholder ("Configure a dev script…") and the URL input never appears — tests that try to drive it hang on the locator timeout. To use the preview iframe in a test, set one first:await apiClient.updateRepository(seedData.repositoryId, { dev_script: "echo dev" }). Then click the Preview dockview tab (await session.clickTab("Preview")) — the toolbar will mount and the URL input becomes targetable.
Example:
import { test, expect } from "../../fixtures/test-base";
import { KanbanPage } from "../../pages/kanban-page";
test.describe("my feature", () => {
test("does something", async ({ testPage, seedData, apiClient }) => {
const task = await apiClient.createTask(seedData.workspaceId, "Test Task", "Description");
const kanban = new KanbanPage(testPage);
await kanban.goto(seedData.workspaceId);
await expect(kanban.taskCardByTitle("Test Task")).toBeVisible();
});
});
Dev-first workflow
Before writing an E2E test, validate the feature works interactively using playwright-cli against a dev server. This gives a fast feedback loop — code changes are picked up by hot reload in ~1-2 seconds, no production rebuild needed. Once confirmed working, translate the interactions into a proper E2E test.
Start the dev environment
Multiple agents may run in parallel, so use random ports to avoid collisions. Fixture ports auto-offset from 18080 (backend) and 13000 (frontend) using E2E_PORT_OFFSET (derived from PID % 30 by default) — stay outside those ranges. Parallel E2E test runs are safe by default.
OFFSET=$((RANDOM % 100))
BACKEND_PORT=$((19000 + OFFSET))
FRONTEND_PORT=$((14000 + OFFSET))
Start the backend:
E2E_TMP=$(mktemp -d) && mkdir -p "$E2E_TMP/.kandev" && \
printf '[user]\n name = E2E Test\n email = e2e@test.local\n[commit]\n gpgsign = false\n' > "$E2E_TMP/.gitconfig" && \
HOME="$E2E_TMP" KANDEV_HOME_DIR="$E2E_TMP/.kandev" KANDEV_SERVER_PORT=$BACKEND_PORT \
KANDEV_DATABASE_PATH="$E2E_TMP/kandev.db" KANDEV_MOCK_AGENT=only \
KANDEV_MOCK_GITHUB=true KANDEV_DOCKER_ENABLED=false KANDEV_WORKTREE_ENABLED=false \
KANDEV_LOG_LEVEL=warn apps/backend/bin/kandev &
Start the dev frontend:
KANDEV_API_BASE_URL=http://localhost:$BACKEND_PORT NEXT_PUBLIC_KANDEV_API_PORT=$BACKEND_PORT \
pnpm --filter @kandev/web dev --port $FRONTEND_PORT &
Validate with playwright-cli
playwright-cli open http://localhost:$FRONTEND_PORT
playwright-cli snapshot # see page structure and element refs
playwright-cli click e5 # interact using refs from snapshot
playwright-cli fill e3 "test input"
playwright-cli snapshot # verify result
Fast iteration cycle
- Make a code change in
apps/web/ - HMR picks it up in ~1-2 seconds
playwright-cli snapshotorplaywright-cli screenshotto verify- Repeat until the flow works correctly
Translate to E2E test
Once validated, write the Playwright test using project fixtures and page objects. The playwright-cli interactions map directly to Playwright API calls:
| playwright-cli | Playwright API |
|---|---|
playwright-cli click e5 |
page.getByTestId('...').click() |
playwright-cli fill e3 "text" |
page.getByTestId('...').fill('text') |
playwright-cli snapshot (verify element visible) |
expect(page.getByTestId('...')).toBeVisible() |
Use data-testid selectors in the test (not snapshot refs), and wrap common flows in page objects.
Capture PR evidence
After confirming the feature works, capture screenshots or a video as proof for the PR:
# Screenshots of key states
playwright-cli screenshot --filename=apps/web/.pr-assets/feature-before.png
# ... interact to show the feature ...
playwright-cli screenshot --filename=apps/web/.pr-assets/feature-after.png
# Or record a video walkthrough
playwright-cli video-start apps/web/.pr-assets/feature-demo.webm
# ... perform the user flow ...
playwright-cli video-stop
Create apps/web/.pr-assets/manifest.json so the /pr skill picks them up:
{
"assets": [
{"name": "feature-demo", "file": "feature-demo.webm", "format": "gif", "caption": "Feature demo"},
{"name": "feature-after", "file": "feature-after.png", "format": "png", "caption": "Result"}
]
}
Final verification
Always verify against the production build before finishing — dev mode can hide SSR/hydration issues:
playwright-cli close
# Kill dev server and backend
make build-web
cd apps && pnpm --filter @kandev/web e2e -- tests/path/to/test.spec.ts
Test organization
Tests are grouped by feature area in subdirectories under tests/. When creating a new test:
- Place it in the matching feature directory. A test for PR detection goes in
pr/, a test for session resume goes insession/, etc. - Merge related tests into the same file. Tests covering the same feature (e.g., git commit body and pre-hooks) belong in one file with separate
test.describeblocks. Don't create a new file for each narrow scenario. - Import paths from subdirectories use
../../(e.g.,from "../../fixtures/test-base"). - Standalone root files are allowed for truly cross-cutting tests that don't fit any group.
Test quality guidelines
- Test through the UI, not the API. E2E tests verify user-facing behavior. Don't write tests that only call the API and assert the response -- those are integration tests. Instead, navigate to the page, interact with UI elements, and assert what the user sees.
- Verify persistence with page reload. After changing a setting or creating data, reload the page (
testPage.reload()) and assert the state is still correct. This catches hydration bugs and SSR/client mismatches. - Seed via API, assert via UI. Use
apiClientto set up preconditions quickly, but always verify the result by opening the page and checking the DOM. - Scope terminal helpers to the active panel. Terminal/mobile helpers must avoid document-wide
.xtermorterminal-xterm-hostselectors because multiple terminal panels can be mounted at once. Scope locators throughdata-testid="terminal-panel"and prefer the visible or latest panel forpage.evaluatehelpers.
Debugging failures
Triage
When a test fails:
- Read the error output — the Playwright error message, expected vs. actual, and which locator timed out
- Read
error-context.mdfromtest-results/<test-name>/— contains a YAML DOM snapshot showing exactly what was rendered. Search for expected elements, check if the page is in the right state (e.g., simple mode vs advanced mode). These files persist across runs — always confirm timestamps (portable:ls -la e2e/test-results/.../error-context.md; orstat -c %yon Linux /stat -f %Smon macOS) or rebuild + rerun the spec fresh before trusting the snapshot. A stale context from a previous failure mode will send you debugging the wrong bug. - Read the failure screenshot from
e2e/test-results/— see what the page actually rendered - Attach to the failure for deeper debugging using
playwright-cli:cd apps && PLAYWRIGHT_HTML_OPEN=never pnpm --filter @kandev/web e2e -- tests/path.spec.ts --debug=cli & # Wait for "Debugging Instructions" with session name playwright-cli attach tw-<session> playwright-cli snapshot # inspect page state at failure point playwright-cli console # check for JS errors playwright-cli network # check API responses
Classify and fix
| Category | Signals | Fast loop |
|---|---|---|
| Test logic | Wrong selector, wrong expected text, missing page object method | Fix test files, re-run immediately (no rebuild -- Playwright transpiles TS at runtime) |
| Frontend-only | Screenshot shows wrong UI, missing element, client error. API calls succeed. | Start dev server, fix with hot reload, verify with playwright-cli, then make build-web + re-run test |
| Backend | 500 errors, wrong API response, "Backend did not become healthy" | Fix Go code, make build-backend, re-run test |
Common issues
- "Backend did not become healthy" — run
make build-backend build-web, check withE2E_DEBUG=1 - "Cannot find module" — run
cd apps && pnpm install - Port conflicts — backends use 18080+ and frontends use 13000+ (per worker), auto-offset by
E2E_PORT_OFFSET(derived from PID). SetE2E_PORT_OFFSET=0for deterministic ports - Auto-started session never goes idle — for sessions started by the same call that creates them, the mock agent can finish before the client WS subscription registers, so a raw
idleInput()visibility wait hangs. UseSessionPage.waitForChatIdle()instead; it reloads once and re-derives state from SSR. - Flaky timeouts — never increase locator timeouts to fix flaky tests. If a locator times out, the root cause is almost always something else: a setup failure, missing navigation, race condition, or the element genuinely not rendering. Investigate why the element never appears instead of giving it more time. Note: infrastructure health timeouts (30s in
fixtures/backend.ts) and overall test timeouts (60s inplaywright.config.ts) are separate and should not be modified either. - Screenshots on failure, video on first retry (CI)
Debugging CI shard failures
CI splits tests across 10 shards. To reproduce a specific shard locally:
# List which tests are in a shard
npx playwright test --config e2e/playwright.config.ts --shard=2/10 --list
# Run that shard locally (requires production build)
make build-backend build-web
cd apps/web && npx playwright test --config e2e/playwright.config.ts --shard=2/10
E2E tests run against the production build (next build), not dev mode. Always rebuild with make build-web (or pnpm --filter @kandev/web build) after code changes before running E2E tests locally.
# Unzip a shard's blob report from CI artifacts
unzip report-*.zip -d report-shard && cat report-shard/*.jsonl
When a CI shard fails, download its report-*.zip artifact and unzip it; the report is a *.jsonl event stream. Build a testId map by walking the events: test titles and locations come from the testBegin events, and final status plus duration come from the testEnd events. Match them by test id. This surfaces the slow but passing specs (the timing markers in Playwright output) that never show up as outright failures but are latent flake risks. Specs whose duration approaches the 60s per-test timeout (defined in playwright.config.ts) are the flake candidates to harden. Typically by converting raw chat-flow assertions to the waitForChatIdle() / expectChatResponseVisible() recovery helpers documented earlier in this file.
Flake triage: intrinsic race vs. contention
A test that flakes under parallel/sharded load is one of two things — decide which before touching it:
- Re-run it in a fresh, isolated container (or at minimum a single fresh worker),
--retries=0, a few reps:
(On Apple Silicon,pnpm e2e:docker --no-build -- --repeat-each=4 --workers=1 --retries=0 tests/path.spec.ts:LINE # or raw: pnpm exec playwright test --config e2e/playwright.config.ts --project=chromium --repeat-each=4 --workers=1 --retries=0 tests/path.spec.ts:LINEpnpm e2e:dockerneeds Colima + Rosetta —colima start --vz-rosetta; default QEMU segfaults the amd64 Go build. Seeapps/web/e2e/README.md.)- Flakes alone (fails some reps, fast): intrinsic race — fix it (condition-correct wait, fix the actual race; not a timeout bump). E.g. a
waitForRequestthat times out the full window means the request never fired (a click swallowed during hydration) — retry the action withawait expect(async () => { ... }).toPass(), don't extend the timeout. - Passes clean AND fast alone (well under timeout): contention, not a defect. The wait is correct; the test just starved for CPU/IO under load. No code/test fix applies.
- Flakes alone (fails some reps, fast): intrinsic race — fix it (condition-correct wait, fix the actual race; not a timeout bump). E.g. a
- Signature of contention, not a code path: two identical-config full runs giving different hard-fail counts (e.g. 0 vs 3). Same code + same config + different outcome ⇒ host oversubscription, not a bug. CI's isolated runners don't reproduce it; reduce local concurrency (2–3 shards, not 5+) for a clean signal.
- Caveat — don't flake-hunt with
--repeat-eachacross many heavy specs in one long-lived worker. It exhausts per-worker resources (agentctl port range, memory) over a long run and manufactures false failures unrelated to the test. Use one fresh container per spec instead.
Selector guidelines
- Prefer
data-testidselectors over text-based locators. Text content can change when UI is updated (e.g., hiding a badge), breaking tests that match by text. UsegetByTestId()orlocator("[data-testid='...']")for stable targeting. - Use page object methods like
clickSessionChatTab()(stabledata-testid) instead ofsessionTabByText("1")(fragile text match) for session tabs. - Dropdown menus can detach from the DOM when React re-renders the parent (e.g., WS events updating the sidebar). The
openSidebarMenuAndClick()helper insession-page.tsretries the full open-click sequence on detachment — use this pattern for similar interactions.
TDD workflow
Follow /tdd when writing E2E tests:
- RED — Write the spec, run it, watch it fail (missing
data-testid, feature not implemented, etc.) - GREEN — Implement the feature/fix, add
data-testidattributes, run the test until green - REFACTOR — Extract page objects, clean up selectors, keep tests green
- Run
/verifywhen done