Acceptance test — native + gastown end-to-end builds of a target SPA via beads
D15 — Acceptance: native + gastown end-to-end builds of a target SPA via beads
Status: Active implementation. This doc remains the contract for the
gm-root.27acceptance harness, but the code now includes the shared Playwright body, mock-backed CI default, real native-agent opt-in, demo-mode video capture, narration JSON, and screenshot artifacts. Keep this document in sync withtesting/acceptance/temperature-spa/README.mdand the shared spec.Decision: gm-1avi (D15) Implementation epic: gm-root.27 Sub-decisions: D16 (gm-lpcn), D17 (gm-xw8a), D18 (gm-bvlm), D19 (gm-zdik), D20 (gm-l38m)
1. Goal
Implement a headless, fully-autonomous end-to-end acceptance test that:
- Bootstraps a fresh Gemba project on disk (ephemeral, hermetic).
- Imports a milestone/epic/bead JSONL pack that drives the agent through three milestones: M1 (scaffolding), M2 (Hello world MVP), M3 (full conversion table).
- Configures a pool via the SPA’s
/settings/poolseditor (Playwright drives the UI). - Lets the autodispatch daemon route work to a pool of
acceptance-engineeragents, who claim beads and complete them autonomously. - Injects a synthetic escalation between M2 and M3 to exercise the triage path; resolves it via the SPA escalation surface.
- Builds and serves the target SPA after each MVP; Playwright validates the rendered output.
- On any oracle failure, files a bug bead in the gemba rig (NOT the target rig) with full reproduction context.
- Writes a structured report (JSON + markdown) capturing the run.
Two variants ship together, with a third execution mode for demos:
- Native (default CI): runs through
--orchestration=mockso the dispatch loop, escalation path, build/serve gates, and oracle run deterministically without model credentials. - Native real-agent (opt-in): runs through the native orchestrator
with
GEMBA_ACCEPTANCE_REAL_AGENTS=1. The default real-agent path is Codex viagemba-codex-driver; setGEMBA_ACCEPTANCE_AGENT=claudeto use Claude Code. - Gastown (manual / scheduled; opt-in via env): pool reuse via
gt slingacross rigs. - Demo mode (
GEMBA_ACCEPTANCE_DEMO_MODE=1): records video, screenshots key moments, and emits narration JSON for the edited MP4.
The variants share the same Playwright spec body, target JSONL pack,
and oracle. They differ in pool scope (local vs <rig>), server flag
(--orchestration=mock, --orchestration=native, or
--orchestration=gastown), runner behavior (template mock vs real
agent), and bootstrap surfaces (none vs UI-driven gt rig create +
gt polecat create).
2. Why first-class
A real acceptance test that exercises the full propulsion loop (operator files beads → daemon dispatches → pool member claims → agent works → bead-done emit → SessionReady → next bead) is the only way to catch regressions in the orchestration layer that unit tests can’t see. Today there’s:
- A smoke E2E for
/settings/pools(gm-s47n.18) — UI only. - Conformance tests for adaptors (Group A–F) — adaptor surface only.
- Per-component vitest — per-component only.
What’s missing: a test that exercises a complete operator-driven build, end to end, with both native and gastown agent backends. This decision establishes that test.
3. Foundation (what exists today)
Recently shipped, all reused:
| Bead | What it ships |
|---|---|
gm-root.24 | gemba newproject auto-seeds agents.toml + personas + CLAUDE.md. |
gm-s47n.10 (D13) | Session-pool primitive. |
gm-s47n.11 | Native idle lifecycle: SessionReady + RecycleSession. |
gm-s47n.12 | Autodispatch daemon wired into gemba serve. |
gm-s47n.14 | /done slash command emits gemba-state bead-done (slash-command path; direct gt done still pending gm-s47n.20). |
gm-s47n.16 | Pool config editor at /settings/pools (adaptor-aware). |
gm-s47n.17 | RunCommandModal + gt mutation buttons in pool editor (gastown rig/polecat creation via UI). |
gm-s47n.18 | Smoke E2E spec for /settings/pools editor (selector reference). |
gm-e7.12 + gm-e7.13 | gt-adaptor capability probe + RecycleSession via gt handoff. |
gm-root.26 | FTUX trio: onboarder CTA gate, session toast + LiveSessionsBadge, pool empty-state helper. (Used as bonus oracle assertions.) |
gm-e3.8 | ClaimModel manifest gate + soft-skip on inline races. |
Future enhancement (does not block):
| Bead | Effect when it lands |
|---|---|
gm-s47n.20 | Direct gt done invocations also emit bead-done. Acceptance-engineer persona’s done-skill pin to slash command can drop. |
4. MockAgentRunner architecture (D16, gm-lpcn)
The CI default. Deterministic. Uses no API tokens, makes no network calls. Tests the orchestration layer, not code-generation quality.
4.1 Contract
interface AgentRunner { run(ctx: Context, beadID: string): Promise<void>; recycle(): Promise<void>; close(): Promise<void>;}
interface AgentRunnerFactory { create(env: Env): AgentRunner;}
// Factory returns mock by default; real-claude when env.GEMBA_ACCEPTANCE_REAL_AGENTS=14.2 Templates
The mock matches each claimed bead against a small library of template handlers. Each handler reads bead description frontmatter (template:, testid:, files:) and performs the mechanical work:
| Template | What it does |
|---|---|
init-repo | git init, write package.json, write minimal vite.config.ts. |
npm-install | npm install in temp project (uses pre-warmed offline cache). |
write-component | Writes a TSX component matching testid from frontmatter. |
write-test | Writes a vitest spec asserting the bead’s DoD. |
build | npm run build, verifies dist/index.html exists. |
serve | vite preview on injected port, verifies HTTP 200. |
error-then-recover | Fails on first attempt, succeeds on second (exercises retry). |
noop | No-op for milestone-marker beads. |
After running the template, the mock:
- Closes the bead via
bd close <id>. - Emits
gemba-state bead-done --bead <id>(the same emit a real polecat does).
4.3 Determinism
- Bounded sleeps (max 200ms per template).
- Pinned random seed via
NewMockAgentRunner({seed: 0xACCEPT}). - No network: npm install uses
--offlineagainst a pre-warmed cache; vite build is local; vite preview is local.
4.4 Real-vs-mock factory
A single AgentRunnerFactory interface; NewAgentRunner(env) returns
the deterministic mock runner by default and a real native-agent runner
when GEMBA_ACCEPTANCE_REAL_AGENTS=1. The real path defaults to Codex
through gemba-codex-driver / codex_exec; set
GEMBA_ACCEPTANCE_AGENT=claude to exercise the same flow through Claude.
Both branches wrap the native adaptor spawn path rather than duplicating
session orchestration.
5. Target JSONL pack (D17, gm-xw8a)
Three milestones × ~4 task beads each = 12 task beads + 3 epic beads + 3 milestone beads + 2 internal decisions = ~20 beads in the target project.
5.1 Pack files
testing/acceptance/temperature-spa/shared/target-jsonl/m1.jsonl— scaffolding milestone, epic, 3 tasks.testing/acceptance/temperature-spa/shared/target-jsonl/m2.jsonl— Hello world MVP milestone, epic, 4 tasks.testing/acceptance/temperature-spa/shared/target-jsonl/m3.jsonl— conversion table milestone, epic, 5 tasks.
5.2 M1 — Scaffolding
| Bead | Template | Files |
|---|---|---|
| M1.1 | init-repo | package.json, vite.config.ts |
| M1.2 | noop | (config copy: tsconfig.json, index.html, src/main.tsx) |
| M1.3 | npm-install | node_modules/ |
5.3 M2 — Hello world MVP
| Bead | Template | testid | Files |
|---|---|---|---|
| M2.1 | write-component | app-root | src/App.tsx |
| M2.2 | write-test | app-root | src/App.test.tsx |
| M2.3 | build | — | dist/ |
| M2.4 | serve | — | (preview server) |
5.4 M3 — Conversion table
| Bead | Template | testid | Files |
|---|---|---|---|
| M3.1 | write-component (pure fn) | — | src/temperatureRows.ts |
| M3.2 | write-component | temperature-table, row-{c} | src/TemperatureTable.tsx |
| M3.3 | write-test | — | src/TemperatureTable.test.tsx |
| M3.4 | write-component (replaces App.tsx) | app-root, temperature-table | src/App.tsx |
| M3.5 | build + serve | — | (rebuild + reserve) |
5.5 Bead description frontmatter
Every task bead description begins with:
template: write-componenttestid: temperature-tablefiles: src/TemperatureTable.tsxMockAgentRunner reads this; falls back to keyword match on title; on no match files a synthetic escalation template_unknown.
5.6 Edges
- Each task
depends-onits predecessor in the same milestone. - Each epic
parent_childto its milestone. - Each task
parent_childto its epic. - M2 epic
depends-onM1 milestone (so the daemon doesn’t dispatch M2 work until M1 closes). - M3 epic
depends-onM2 milestone.
6. Oracle (D18, gm-bvlm)
6.1 Per-milestone gates
M1 oracle:
- All M1 beads → closed.
- On disk:
package.json,vite.config.ts,tsconfig.json,index.html,src/main.tsx,node_modules/. npm run devexits 0 (or starts and listens; SIGTERM after liveness probe).
M2 oracle:
- All M2 beads → closed.
npm run buildexits 0;dist/index.htmlexists.vite preview(or equivalent) on injected port responds 200 at/.- Playwright loads the URL, asserts
[data-testid="app-root"]with textContent"Hello world". npm test(vitest) passes.
M3 oracle:
- All M3 beads → closed.
- Build + preview gates as M2.
- Playwright loads URL.
[data-testid="temperature-table"]exists.- Exactly 16
[data-testid^="row-"]elements. - For each row, the °F cell equals
(c * 9/5 + 32).toFixed(1)wherecis the row’s celsius value parsed from the testid. - Cherry-pick assertions:
row-0→ 32.0row-100→ 212.0row-300→ 572.0
npm test(vitest) passes.
6.2 Numeric tolerance
Strict-numeric: exact match on integer-valued cells (0, 32, 212, 572). One decimal place tolerance for fractional cells (e.g., row-20 → 68.0). The oracle compares formatted strings, not floats — locale and rounding stay reproducible.
6.3 Failure paths
Each failure type files a separate bug bead via .19 (bug-filing helper):
| Failure | Bug title | Severity |
|---|---|---|
npm run build non-zero | ”target build failed” + log tail | CRITICAL |
| Test runner failed | ”target tests failed: | HIGH |
| Numeric mismatch | ”oracle mismatch row-{c}: expected X, got Y” | HIGH |
| Row count mismatch | ”row-count: expected 16, got N” | HIGH |
| Missing testid | ”missing testid: {testid}“ | HIGH |
| Preview server unreachable | ”preview server unreachable on port N” | CRITICAL |
6.4 FTUX bonus assertions
The test also exercises gm-root.26 surfaces:
- Pool empty-state helper visible BEFORE pool configured (sanity check; the helper is the cue a fresh-project operator would see).
- Onboarder CTA gates correctly on a project without
[llm]in~/.gemba/config.toml. - Session toast appears when first session dispatches.
- LiveSessionsBadge increments 0 → 1 on first dispatch and decrements back on completion.
7. Pool setup via UI (D19, gm-zdik)
The test does NOT write pool.toml directly. It drives Playwright through /settings/pools to exercise the editor itself.
7.1 Native flow
- Navigate to
/settings/pools. - Scope dropdown auto-selects
local(editor hides scope axis on native). - Persona dropdown: select
acceptance-engineer. - Set
size = 1,floor = 0.5. - Click Save →
PUT /api/pool-config→ server writespool.toml. - Restart prompt appears; harness restarts gemba server with new config.
7.2 Gastown flow
- Navigate to
/settings/pools. - Click
+ New rig→RunCommandModalopens. - Type rig name (e.g.,
acceptance-{run-id}) → modal runsgt rig create. - Refresh.
- Click
+ New polecat→ modal runsgt polecat create. - Scope dropdown: select new rig.
- Persona dropdown: select
acceptance-engineer. - Set
size = 1,floor = 0.5. - Save → restart server with new pool.toml.
7.3 Selector reuse
From gm-s47n.18 smoke spec:
[data-testid="pools-page"][data-testid="pool-scope-select"][data-testid="pool-persona-select"][data-testid="pool-size-input"][data-testid="pool-floor-input"][data-testid="pool-save-button"]
Plus gm-s47n.17’s additions:
[data-testid="pool-new-rig-button"][data-testid="pool-new-polecat-button"][data-testid="run-command-modal"][data-testid="run-command-input"][data-testid="run-command-execute"]
7.4 Side benefit
If the editor surface ever ships without these testids, the acceptance test fails fast with a “selector not found” error. The acceptance test is a regression net for the editor itself.
8. Native vs gastown variant differences (D20, gm-l38m)
8.1 Shared
- Target JSONL pack (D17).
- Playwright spec body (
shared/spec.ts). - Oracle (D18).
- MockAgentRunner (D16).
- Pool-via-UI path (D19).
8.2 Differences
| Native mock | Native real-agent | Gastown | |
|---|---|---|---|
| Server flag | --orchestration=mock | --orchestration=native | --orchestration=gastown |
| Pool TOML scope | local mock acceptance pool | [pool.local.acceptance-engineer] with Codex by default or Claude via --agent claude | [pool.<rig-name>.acceptance-engineer] |
| Pre-server bootstrap | none beyond gemba newproject | none beyond gemba newproject and local agent credentials/CLI availability | UI-driven gt rig create + gt polecat create |
| Capability probe | mock adaptor self-check | gemba-codex-driver / Claude driver and credentials checked before dispatch | gt binary version checked at startup; fail fast if too old |
| Teardown | rm temp dir, free ports | rm temp dir, free ports, terminate native sessions | rm temp dir, free ports, UI-driven gt rig remove |
| Time budget | <15 min | <90 min real agent generation | <30 min mocked / hours real |
| CI viability | default | manual / release evidence | manual / nightly only |
| Pool reuse path | simulated session lifecycle | Codex is one-shot per dispatch; Claude can reuse ready sessions through done-skill flow | /done slash → bead-done → SessionReady |
8.3 Done-skill pin
Both variants pin the acceptance-engineer persona’s done-skill to the /done slash command (per gm-s47n.14). Direct gt done would not emit bead-done until gm-s47n.20 lands. Pinning to slash command guarantees pool reuse on both backends today.
8.4 Concurrent runs
Each variant runs against an ephemeral Dolt server on a random port. Two
acceptance runs (native + gastown) on the same machine concurrently is
supported. CI default runs the mock-backed native wrapper; real native
agents and Gastown remain opt-in through GEMBA_ACCEPTANCE_REAL_AGENTS=1
and GEMBA_ACCEPTANCE_RUN_GASTOWN=1.
9. Implementation epic — wave structure
The epic gm-root.27 has 20 implementation children:
Wave 1 — Shared core (gm-root.27.1 – .5)
.1Headless agent supervisor (factory: real claude vs mock).2MockAgentRunner with bead templates.3Ephemeral Dolt + project bootstrap helper.4Target JSONL pack (the actual files committed, per D17).5Synthetic escalation injector
Wave 2 — Shared Playwright spec (gm-root.27.6 – .11)
.6Spec scaffolding (server lifecycle, navigation helpers).7Pool config setup driver via/settings/pools.8M1 step.9M2 step.10M3 step.11Triage step (synthetic escalation + UI-resolve)
Wave 3a — Native variant (gm-root.27.12 – .14)
.12Native variant spec wrapper.13Native pool fixture.14Native cleanup
Wave 3b — Gastown variant (gm-root.27.15 – .17)
.15Gastown variant spec wrapper + UI-driven gt rig + polecat creation.16Gastown pool fixture template.17Gastown cleanup
Wave 4 — Reporting + bug-filing + cleanup orchestration (gm-root.27.18 – .20)
.18Test report writer.19Bug-filing helper.20Cleanup orchestration
10. Module layout
testing/acceptance/temperature-spa/ shared/ spec.ts # Playwright orchestration body helpers/ bootstrap.ts # ephemeral Dolt + gemba newproject pool-via-ui.ts # drives /settings/pools escalation.ts # synthetic escalation injector cleanup.ts # generic cleanup utilities report.ts # JSON + markdown report writer bug-filer.ts # files beads in gemba rig runner/ factory.ts # AgentRunnerFactory mock.ts # MockAgentRunner + templates real-claude.ts # wraps native adapter spawn target-jsonl/ m1.jsonl # scaffolding pack m2.jsonl # Hello world MVP pack m3.jsonl # conversion table pack oracle/ m1.ts # M1 assertions m2.ts # M2 assertions m3.ts # M3 assertions ftux.ts # FTUX bonus assertions variants/ native/ spec.ts # native entry point fixtures/ pool.toml # [pool.local.acceptance-engineer] gastown/ spec.ts # gastown entry point fixtures/ pool.toml.tmpl # template with {{.Rig}} placeholder gt-bootstrap.ts # UI-driven gt rig + polecat gt-teardown.ts # UI-driven gt rig remove reports/ # historical run reports11. Risks and open questions
11.1 MockAgentRunner template drift
If a target JSONL bead’s template: directive references a non-existent template, the runner files a template_unknown synthetic escalation. The implementer of .4 (target JSONL pack) and .2 (MockAgentRunner) must keep their template names in lockstep. Mitigation: a unit test in .4 that loads each JSONL file, extracts every template: value, and asserts the runner has a matching handler.
11.2 Daemon polling cadence
The autodispatch daemon ticks every 10s. Each milestone has 3–5 beads with depends-on chains; per bead is at minimum 10s wait + ~200ms mock work + bead-close + recycle. M3’s 5 beads serially is ~50s minimum. Three milestones is ~150s minimum + Playwright overhead + build/serve time. Total: ~5–10 min mocked. Fits the <15 min budget.
11.3 Real-claude path stability (deferred)
The real-claude path is opt-in and not validated in this epic. Expect rate-limit handling, retry on session crash, and longer timeouts in the follow-up epic.
11.4 Gastown rig name collisions
Each gastown run generates a unique rig name (e.g., acceptance-{run-id} where run-id is a ulid). Cleanup removes it. If cleanup fails (test process panic), an orphaned rig sits in gt. The .20 cleanup orchestrator includes a “purge stale acceptance rigs” step that runs at startup of every test run; rigs older than 24h get removed.
11.5 Gemba-server restart cost
Pool config save requires a server restart (per gm-s47n.16 — hot-reload deferred). The harness orchestrates the restart cleanly. Restart adds ~5s per variant per run.
11.6 Escalation respond UI completeness
Per the survey, the escalation respond UI is partially shipped: backend handlers ready, frontend surface exists at /escalations but full approve/deny button wiring may be incomplete. The triage step (.11) probes for the UI; on missing button, falls back to direct API call and files a bead noting “escalation respond UI incomplete.” The acceptance test thus also serves as a regression net for that surface.
11.7 Concurrent test runs on shared Dolt
The deep-mode E2E gating issue (gm-h4n) — bd-init colliding on the shared Dolt server — is solved by the ephemeral-Dolt helper (.3). Each acceptance run gets its own Dolt instance on a random port. CI parallelism is supported.
11.8 npm offline cache
MockAgentRunner’s npm-install template uses --offline. The cache is pre-warmed at test setup time (one-time npm install run during the harness’s first invocation, cached at ~/.cache/gemba-acceptance-npm/). If the cache is missing or stale, the test falls back to online install (slower but correct).
12. Acceptance criteria for this decision
D15 is ratified when:
- This doc exists and links back to
gm-1avi. - Implementation epic
gm-root.27filed with all 20 children. - Sub-decisions D16–D20 filed.
D15 is rejected when:
- The operator decides the acceptance test is not worth the maintenance cost (e.g., flake budget too high).
- A superseding decision is filed with a
supersedes:gm-1aviedge.
Until either resolution, status is draft.
13. References
- D6 (
gm-d1m1) — Decision-capture convention - D13 (
gm-s47n.13) — Session pool primitive - D14 (
gm-thq1) — Channel-bridge architecture (reuses some patterns: identity, persistence, pluggability) - DD-16 (
gm-ege) — External-consumer optionality internal/cli/newproject.go— project bootstrapinternal/server/newproject_ratify.go— atomic ratificationinternal/adapter/native/— real-claude spawn pathinternal/adapter/gastown/— gt sling pathinternal/planner/— claim index + dispatch policyinternal/server/escalations.go— escalation respondweb/src/pages/PoolsPage.tsx— pool editor (reference)testing/e2e/— existing Playwright infrastructure to mirrorgm-s47n.18— smoke E2E spec (selector reference)