Playwright e2e library — design
Playwright e2e library — design
Companion to epic gm-5v8v. Records the architecture decisions made in conversation so future contributors can read this doc instead of re-deriving the answers.
The library lives at testing/e2e/ (filed by gm-5v8v.1). The
single-script driver that ran for gm-0i0d
(scripts/e2e/hello-world.test.mjs) was migrated into
testing/e2e/specs/integration/dispatch-chain.spec.ts by gm-5v8v.15;
the script directory has been removed.
1. Library, not monolith
Decision. Build the e2e suite as a library of small, opinionated specs grouped by tier and surface. Reject “one big spec file” or “one driver per ticket”.
Reasoning. The SPA exposes ~18 distinct routes / surfaces (Board, Backlog, Grid, Graph, Sessions, Agents, Escalations, Capabilities, Health, Mail, plus drawers and the global chrome). Each route has on the order of 10–30 distinct interactions worth pinning. Multiplied out — and again across the fake + deep backend axis — the suite trends to ~300 specs. At that size, copy-paste between specs (auth setup, fixture seeding, navigation, waits) calcifies faster than the surface evolves; a refactor of any shared step touches every file. The scaffolding wins are:
- Selector reuse through page object models (POMs) — see §8.
- Data setup reuse through builders — see §9.
- Backend-axis reuse through fixtures — see §4.
A library also lets us tag specs by tier and grep-filter at the project level (§7), which is the only sane way to keep the PR-fast lane fast.
2. Surface inventory — implemented vs spec’d-pending
Decision. Specs cover both the surfaces that exist today and
the surfaces the ui-spec ratifies but hasn’t shipped yet. The
latter live under testing/e2e/specs/pending/.
Why pending specs. Two reasons:
- Spec-first lock-in. Writing the e2e for a surface before it
ships pins the contract — the spec author can’t quietly rename a
data-testidor change a hotkey without breaking the pending spec, which the implementer must fix. - Visible work. A pending spec is a checklist of “this surface
is on the roadmap and here’s what ‘done’ looks like”. Closing a
feature bead means moving its spec out of
pending/into the appropriate tier directory.
Mechanic. Pending specs are wrapped in test.fixme(...) with a
short rationale (fixme: gm-XXX — feature not shipped). Playwright
reports them but doesn’t fail. CI lanes can opt them in via
--grep-invert "@pending" (default) or --grep "@pending" (nightly
audit).
Implemented routes today (as of 2026-04-25):
| Route | Implementation file | Tier coverage |
|---|---|---|
/board | web/src/pages/BoardPage.tsx | smoke / board |
/backlog | web/src/pages/BacklogPage.tsx | smoke / route |
/grid | web/src/pages/GridPage.tsx | smoke / grid |
/graph | web/src/pages/GraphPage.tsx | smoke / graph |
/sessions | web/src/pages/SessionsPage.tsx | smoke / sessions |
/agents | web/src/pages/AgentsPage.tsx | smoke / sessions |
/escalations | placeholder | smoke / pending |
/capabilities | placeholder | smoke / pending |
/health | web/src/pages/placeholders.tsx | smoke |
/mail | placeholder (gated) | smoke / pending |
The drawer and dialog surfaces (WorkItemDrawer at ~70KB, EpicDrawer, AgentDetailDrawer, JsonlImportDialog, NewSessionDialog, EscalationPanel) are nested but tested as their own tier (§3). They don’t have routes; the test navigates to a parent surface and opens them.
3. Tier taxonomy
Decision. Group specs by tier, not by route. Each tier expresses a different kind of confidence the suite is buying.
| Tier | What it pins |
|---|---|
smoke | Every route mounts, no console errors, axe accessibility scan passes. |
chrome | Sidebar / Topbar / Command palette / Hotkeys / AdaptorBanner — global UI. |
route | Per-page interactions: filter chips, search, drawers, detail drill-in. |
realtime | SSE /events + /api/adaptors/stream drive UI updates without refresh. |
modes | Workspace-mode (unsupervised / supervised / managed) confirmation UX. |
error | 4xx / 5xx surfaces, adaptor-degraded banner, network-flake recovery. |
integration | Multi-route flows: dispatch a session from the board, observe in sessions. |
Why these seven and not, say, ten. Each tier has a distinct CI lane policy (§7) and a distinct rate of churn: smoke is stable, realtime is flaky, integration is slow. Splitting a tier costs a CI lane; merging two tiers loses the policy distinction. The seven were chosen so each one has a self-evident “did this break?” signal.
Directory layout.
testing/e2e/specs/ smoke/ # routes load, no console errors, axe chrome/ # global UI components board/ # /board specifics grid/ # /grid specifics graph/ # /graph specifics drawers/ # WorkItemDrawer / EpicDrawer / AgentDetailDrawer newproject/ # /onboard deterministic setup, conversation, ratify handoff sessions/ # /sessions + /agents realtime/ # SSE-driven UI invalidation modes/ # workspace-mode confirmation UX auth/ # cookie / bearer / login redirect error/ # 4xx / 5xx / degraded-adaptor surfaces integration/ # multi-route flows pending/ # spec'd but not shipped (test.fixme)A spec lives in one tier folder. Cross-cutting concerns (auth, hotkeys) appear as helpers, not as a tier.
4. Backend axis: fake vs deep
Decision. Each spec runs against one of two backend modes, named after the fixture:
fake—gemba servemounted with the in-memorytestadaptors.FakeWorkPlane. State resets sub-second between workers; safely parallelizes; runs on every PR.deep— realgemba serveagainst a real Dolt server with the bd adaptor pointed at a real.beads/database. Serializes to ≤2 workers; runs on merge / nightly.
Both modes use the same spec source. A spec doesn’t care which
backend it runs against unless it asserts on a behaviour that only
exists in one mode (a real bd mutation, an SSE round-trip across
the bd→dolt→hub pipeline). Those specs are tagged @deep (§5).
Mechanic — Playwright project matrix. playwright.config.ts
declares one project per (tier × backend) pair. The fixture wired
into each project’s use: block selects the backend:
type Backend = 'fake' | 'real';export const test = base.extend<{ backend: Backend; baseURL: string }>({ backend: ['fake', { option: true, scope: 'worker' }], baseURL: async ({ backend }, use, info) => { if (backend === 'fake') { // Spawn gemba serve --workplane=fake on a free port. use(await spawnFakeServer(info.workerIndex)); } else { // Lease a Dolt DB + worktree pair from the worker pool. use(await leaseDeepServer(info.workerIndex)); } },});Each project sets backend: 'fake' or backend: 'real'. The same
test file mounts in two projects with no per-spec branching — the
fixture switch carries it.
Why two modes and not one. Pure fake runs sub-second and is useful for tens of repetitions per change, but it doesn’t pin the real bd→dolt→hub pipeline. Pure deep is honest but takes minutes per spec and serializes hard. The split lets us pay for honesty only where the spec asserts on it (§5).
5. The @deep tagging rule
Rule. Tag a spec @deep if it asserts on a backend response
the fake doesn’t faithfully reproduce. Concretely:
- Spec issues a write (POST /api/work-items, PATCH, POST
/sessions) AND inspects the persisted side-effect (re-read,
list filter, SSE event landing) →
@deep. - Spec drives an SSE round-trip end-to-end (issue mutation in tab
A, observe SSE invalidation in tab B) →
@deep. - Spec exercises adaptor-specific behaviour (
bd closetranslates tostate_category=completed; explicit Beads-read-only mode returns 405; Dolt URL writes persist through SQL) →@deep. - Spec asserts an empty-state envelope when no adaptor is bound →
not
@deep. The fake handles that. - Spec asserts on UI rendering of a fixture-supplied list →
not
@deep. The fake handles that.
Mechanic. Tags ride in the Playwright test() description:
test('drag epic into In Progress dispatches a session @deep', async () => { … });The deep project’s grep filter is /@deep/; the fake project’s
filter is /^(?!.*@deep)/. A spec without @deep runs in the fake
matrix only. A spec with @deep runs in both matrices —
running it in fake ensures the test itself isn’t broken before the
deep matrix invests minutes proving the backend.
Anti-pattern. Don’t tag @deep defensively — the deep matrix
budget is small. If a spec doesn’t assert on a backend response
that fake misrepresents, dropping @deep is correct.
6. Worker isolation in deep mode
Problem. Deep mode’s “real Dolt + real bd” stack is not a static fixture; each worker’s spec mutates the database. Two workers sharing one Dolt DB will trample each other.
Decision. In deep mode, each Playwright worker leases a private (Dolt database, beads worktree) pair from a small pool. The pool size matches the configured worker count (≤2 in CI today).
Mechanic.
-
Dolt DB namespacing. The deep harness pre-creates databases
e2e_w0,e2e_w1, … one per worker slot, in the local Dolt server (port 3307). Each worker’sgemba serveis started with--dolt-db=e2e_w<workerIndex>. -
Worktree namespacing. Each worker also gets a private worktree at
/tmp/gemba-e2e/w<workerIndex>/so a spec that spawns a session doesn’t collide with a sibling worker’s session. -
Reset between specs. Each worker truncates its private tables (or
dolt sql -q "DROP TABLE …; CREATE TABLE …") in atest.beforeEach. Between-spec reset is faster than between-worker tear-down and lets specs assume a clean slate. -
Cleanup. Worktrees are removed on
afterAll; Dolt databases persist across CI runs (creation is the slow step) and are truncated, not dropped.
Anti-pattern. Do not point all workers at the production
local Dolt database. The CLAUDE.md “test pollution” warning (orphan
testdb_* / beads_t* rows) is exactly this failure mode at scale.
The lease pool keeps test data namespace-isolated AND survivable
across runs.
7. CI lane → project-matrix mapping
Decision. Three CI lanes, each binding a different subset of the (tier × backend) project grid:
| Lane | Trigger | Projects | Budget |
|---|---|---|---|
| PR-fast | every push to a PR | smoke@fake, chrome@fake, route@fake, error@fake | <2 min |
| merge | post-merge to main | PR-fast + realtime@fake, modes@fake, integration@fake, smoke@deep, chrome@deep | <8 min |
| nightly | scheduled | every project — full fake matrix + full deep matrix + pending/ audit | unlimited |
Why the lanes are layered, not parallel. Each lane is a strict superset of the previous one. A spec that fails PR-fast will also fail merge and nightly, so we don’t need a separate “what lane do I go in” decision per spec — the project tag does it.
Why deep doesn’t ride PR-fast. Two reasons:
- Cost. Deep mode serializes and a spec that takes 30 seconds in fake takes 2–5 minutes in deep (real bd subprocess spawn, real Dolt commit, real hub fan-out). Multiplying by spec count blows the PR feedback loop.
- Flakiness budget. Real bd + real Dolt has a non-zero rate of transient failures (Dolt server hiccups, file-locking races on git worktrees). Putting that on PR-fast forces every contributor to learn the deep mode failure modes; putting it on merge means the merge-queue triage stays in one team’s head.
The escape hatch. A [deep] opt-in tag in a PR title forces
PR-fast to include the deep matrix. Used when a PR touches the bd
adaptor directly.
8. POM strategy + data-testid selectors
Decision. Each route gets a Page Object Model in pages/. The
POM exposes high-level methods (board.dragCardTo(...),
grid.openImportDialog()) that internally select on
data-testid attributes. Tests do not call page.locator(...)
on raw CSS selectors.
Why data-testid. Three reasons:
- Already there. The SPA already exposes ~160+
data-testidattributes (sample:grep -rn "data-testid" web/src). Every new surface lands one alongside the markup. The cost to switch to another scheme is ≥1 day of mechanical refactoring with zero product gain. - Refactor-stable. A
data-testidsurvives Tailwind class churn, role/aria refactors, and CSS-in-JS rewrites. AgetByRole('button', { name: 'Save' })selector breaks the day someone renames the button. - Searchable.
grep "grid-import-jsonl"instantly answers “what test depends on this control?” — a property neither role-based nor structural selectors give cheaply.
The contract. New UI MUST land a data-testid on every control
the spec author would target: buttons, modals, dialog wrappers,
input fields, list rows, status pills. The naming convention is
<surface>-<purpose> (e.g. grid-import-jsonl, board-column-1,
drawer-close). Nested testids stack: grid-import-summary lives
inside grid-import-dialog.
POM shape (sketch).
export class BoardPage { constructor(private page: Page) {} async goto() { await this.page.goto('/board'); } column(name: 'Backlog' | 'In Progress' | 'Done') { return this.page.getByTestId(`board-column-${name}`); } card(id: string) { return this.page.getByTestId(`board-card-${id}`); } async dragCardTo(cardId: string, columnName: string) { … }}Specs read like prose, not Playwright:
const board = new BoardPage(page);await board.goto();await board.dragCardTo('gm-foo', 'In Progress');await expect(board.card('gm-foo')).toBeVisible();When NOT to add a POM method. If a method would be called by exactly one spec, inline the testid in the spec. POMs exist to deduplicate, not to ceremonially wrap every selector.
9. Builders, not fixture-as-data
Decision. Test setup uses builder functions that produce WorkItems / Escalations / Agents / Sessions, not static JSON fixtures.
Why. Static JSON fixtures (a fixtures/work-items.json file
listing 50 beads) have three failure modes:
- Drift. Adding a required field to
WorkItembreaks every fixture file at once and the failure points at the JSON, not at the spec that produces the failing scenario. - Inflexibility. A spec that needs “any bead in the
startedstate with priority 2” picks one from the file and becomes coupled to that particular row’s other fields. - Discovery.
grep -l "gm-fixture-7"doesn’t tell you what that bead represents semantically. A builder callbead({ stateCategory: 'started', priority: 2 })does.
Builder shape.
export function bead(patch: Partial<WorkItem> = {}): WorkItem { return { id: nextId(), kind: 'task', title: 'fixture', state_category: 'unstarted', status: 'open', created_at: nowISO(), updated_at: nowISO(), ...patch, };}Specs compose:
const epic = bead({ kind: 'epic', state_category: 'started' });const child = bead({ kind: 'task', relationships: [{ kind: 'parent_child', from: epic.id, to: '' }] });await server.seed([epic, child]);Anti-pattern. Don’t add a seedTen() helper that returns ten
arbitrary beads. Specs that need “ten beads” either care about a
specific ten (use builders) or they’re really testing pagination
(use Array.from({ length: 10 }, () => bead()) inline). The
intermediate “named fixture set” surface gets stuck between the two.
Open questions — to be resolved in scaffold (gm-5v8v.1)
These are intentionally not pinned here because the scaffold ticket will inherit them. Listed for visibility:
- Dolt DB lease pool sizing in CI. Today’s plan is ≤2 deep workers. Whether to push this to 3–4 depends on Dolt server steady-state CPU on the CI runners, which we don’t have a measurement for yet.
- Auth fixture shape. Fake mode runs
auth=openby default; deep mode needs token + cookie path tested. Whether the auth tier is its own project axis or a per-spec setup function is a judgement call the scaffold will make. - Pending-spec discoverability.
test.fixmeworks but doesn’t link back to the bead that ungates it. We may want a custom reporter that prints “5 pending specs ungated by gm-XXX” alongside each lane’s results. Out of scope for the design but cheap to add later.
References
- Epic: gm-5v8v
- Migrated dispatch driver:
testing/e2e/specs/integration/dispatch-chain.spec.ts(gm-5v8v.15) - Conformance audit pattern (mirrors the tier idea on the Go side):
testing/runner.go - ui-spec / surface inventory bead: gm-p27 (closed gate)