name: c1-dev-stack-in-squire description: >- Stand up a full c1 dev stack inside a Squire env — process-compose, postgres, envoy, pub-api, pub-auth, be-* services — wired so an external client can drive c1's gRPC surface end to end with TLS + OAuth2 client_credentials. Use when testing a Latchkey or other c1 client against a real (not stubbed) c1 backend, or when reproducing c1 server-side behavior locally. Triggers on: c1 dev env, squire c1 stack, pc/up, dev-util mint-test-client, test against c1, c1 OAuth client_credentials, run c1 integration tests in squire, repro buildkite integration test, TEST_LOCAL_EXEC, api_no_uplift.
Standing up a c1 dev stack in a Squire env
This is a runbook with all the friction points encoded. Treat it as a script — if you skip steps the stack flaps and you spend an hour debugging postgres unix sockets.
When to use
- Driving the Latchkey CLI (or any c1 client) end-to-end against a real c1 pub-api over TLS with a real OAuth-minted Bearer.
- Reproducing pub-api / be-session / be-innkeeper behavior locally.
- Producing a self-contained env you can hand to a teammate by SSH-forwarding envoy 2443.
Prerequisites
squireCLI authenticated to the gateway (squire loginif needed).- An entry in
/etc/hostsmapping127.0.0.1 c1dev.c1.ductone.com(one-time; needed because c1's pub-auth resolves the tenant from the Host header and the dev tenant isc1devon installation domainc1.ductone.com). - The default squire image does not ship with c1 cloned, despite what the generic squire-env-management skill claims. We clone it manually.
Step 1 — create the env
squire new c1-dev --no-open
# wait until: squire env | grep c1-dev | awk '{print $4}' == "running"
Avoid --prompt / --open if you're driving the env from your laptop rather
than the in-env OpenCode agent.
Step 2 — clone c1 with the envmgr git_token MCP tool
The default-image squire credential helper handles https://github.com/...
URLs after the initial clone, but you need a real token to bootstrap. Pull one
from the env's MCP gateway at localhost:9877:
squire ssh <env> -- 'set -e
init() {
curl -sf -i -X POST http://localhost:9877/mcp \
-H "content-type: application/json" \
-H "accept: application/json, text/event-stream" \
-d "{\"jsonrpc\":\"2.0\",\"id\":1,\"method\":\"initialize\",\"params\":{\"protocolVersion\":\"2024-11-05\",\"capabilities\":{},\"clientInfo\":{\"name\":\"probe\",\"version\":\"1\"}}}" \
| grep -i ^mcp-session-id | tr -d "\r" | cut -d" " -f2
}
SID=$(init)
call() {
curl -sf -X POST http://localhost:9877/mcp \
-H "content-type: application/json" \
-H "accept: application/json, text/event-stream" \
-H "mcp-session-id: $SID" -d "$1"
}
call "{\"jsonrpc\":\"2.0\",\"method\":\"notifications/initialized\"}" >/dev/null
call "{\"jsonrpc\":\"2.0\",\"id\":5,\"method\":\"tools/call\",\"params\":{\"name\":\"git_token\",\"arguments\":{\"repo\":\"ductone/c1\"}}}" \
| jq -r ".result.content[0].text" | jq -r ".token"
'
Then clone (depth 50 is plenty; full clone is slow over ~3M files):
TOK=ghs_...
squire ssh <env> -- "git clone --depth 50 https://x-access-token:$TOK@github.com/ductone/c1 /data/squire/src/c1
git -C /data/squire/src/c1 config user.email squire@conductorone.com
git -C /data/squire/src/c1 config user.name 'Squire Agent'
git -C /data/squire/src/c1 remote set-url origin https://github.com/ductone/c1.git"
Reset the remote URL so the squire credential helper handles future git push
(don't bake the short-lived token into the remote URL — it expires in ~30 min).
Step 3 — pre-fix two known config bugs
Both are env-image quirks, not c1 bugs. Fix them before pc/up so services
don't burn max_restarts and get marked Skipped.
Postgres unix socket lock
Postgres tries to /run/postgresql/.s.PGSQL.5432.lock which is root-owned in
the squire image. Patch process-compose.yaml to point it at /tmp:
squire ssh <env> -- "mkdir -p /tmp/pg-socket
sed -i '/-c port=5432/a\\ -c unix_socket_directories=/tmp/pg-socket' \
/data/squire/src/c1/dev/process-compose/process-compose.yaml"
Innkeeper Zoho client id / secret can't be empty
gen-env.sh writes empty strings for the Zoho Manage Engine OAuth provider,
but the runtime config validation requires min_len=3. The other OAuth
providers have placeholder abc1234 strings — the Zoho ones don't, so
innkeeper crashloops on startup. Run after make pc/init:
squire ssh <env> -- "sed -i \
's/^INNKEEPER_ZOHOMANAGEENGINEOAUTHPROVIDER_CLIENT_ID=\$/INNKEEPER_ZOHOMANAGEENGINEOAUTHPROVIDER_CLIENT_ID=abc1234/;
s/^INNKEEPER_ZOHOMANAGEENGINEOAUTHPROVIDER_CLIENT_SECRET=\$/INNKEEPER_ZOHOMANAGEENGINEOAUTHPROVIDER_CLIENT_SECRET=abc1234/' \
/data/squire/src/c1/.dev/env/be-innkeeper.env"
Step 4 — build + bring up
# Cold build of all 21 binaries — about 11 minutes on small flavor.
squire ssh <env> -- "nohup nix develop /data/squire/src/c1#localdev \
--command bash -c 'export GOOS=linux GOARCH=arm64 && \
make -C /data/squire/src/c1 pc/build && \
make -C /data/squire/src/c1 pc/init' > /tmp/pc-build.log 2>&1 &"
# Once that's done (poll for a sentinel file, e.g. '&& touch /tmp/pc-build.done'
# appended to the build command), run pc/up via a launcher script. TWO traps:
# 1. process-compose.yaml uses $PWD-relative paths ($PWD/.dev/...), so the
# working directory MUST be the repo root — launched from $HOME, every
# service resolves /home/squire/.dev and postgres/dynamodb/temporal fail
# with missing dirs/jars.
# 2. `make pc/up` does NOT pass -t=false, so it dies on a non-interactive
# SSH with `terminal entry not found: term not set`. Use raw
# process-compose with -t=false.
# A launcher script satisfies both (write locally + scp if your harness blocks
# inline `cd`):
cat > /tmp/pc-launch.sh <<'EOF'
#!/bin/bash
cd /data/squire/src/c1 || exit 1
export GOOS=linux GOARCH=arm64
exec nix develop /data/squire/src/c1#localdev --command \
process-compose up -t=false -f /data/squire/src/c1/dev/process-compose/process-compose.yaml
EOF
scp /tmp/pc-launch.sh <env>.squire:/tmp/pc-launch.sh
squire ssh <env> -- "chmod +x /tmp/pc-launch.sh; setsid -f /tmp/pc-launch.sh > /tmp/pc-up.log 2>&1"
If you have to kill and relaunch process-compose, kill the ORPHANS too:
killing the supervisor leaves postgres holding postmaster.pid + port 5432
(and valkey its port), so the next instance's copies crashloop to Skipped
while everything else flaps. Find PIDs via ss -tlnp on 5432/6379/8080/2443
and kill -9 them before relaunching. (pkill -f "postgres -D\|postgres:"
does NOT work — \| is not alternation in pkill.) Alternatively, if an
orphaned postgres is healthy and serving, adopt it: stop the flapping pc
copies via curl -X PATCH http://localhost:8080/process/stop/postgres (and
valkey) and carry on.
Process-compose exposes a REST API on localhost:8080:
squire ssh <env> -- "curl -sf http://localhost:8080/processes | \
jq -r '.data[] | \"\(.name): \(.status)\"' | sort"
Wait until postgres / valkey / pub-api / pub-auth / be-session / be-vault / be-innkeeper are
all Running and ensure: Completed. The whole bringup takes 1-2 minutes.
If be-innkeeper: Skipped
If you see this even after the Zoho fix, innkeeper hit max_restarts=30
during the early postgres-flapping period and process-compose gave up. Start
it manually:
squire ssh <env> -- "set -a; . /data/squire/src/c1/.dev/env/be-innkeeper.env; set +a;
nohup /data/squire/src/c1/build/linux_arm64/be-innkeeper/be-innkeeper \
> /tmp/innkeeper.log 2>&1 &"
Then re-run dev-util ensure to populate CrossTenantSettings (innkeeper's
init code creates this row on first start; without it, anything that calls
tenants.TenantDomain returns dynamo: no item found):
squire ssh <env> -- "set -a; . /data/squire/src/c1/.dev/env/dev-shell.env; set +a;
/data/squire/src/c1/build/linux_arm64/dev-util/dev-util ensure"
Step 4b — file gateway needs minio (Latchkey uploads)
The sealed-file gateway (PUT /file/:fileToken on pub-api — KeyPackage
publish, secret-value upload, protocol artifacts) writes to the tenant object
store, which gen-env.sh points at a REAL AWS bucket
(API_TENANTOBJECTSTORES3_BUCKET_NAME=dev-c1-tenant-objects-us-west-2) with
empty creds. Squire envs have no AWS credentials, so every upload 504s (hang)
or 500s. Run minio and point pub-api at it — the config has an Endpoint
field (proto field 5):
squire ssh <env> -- 'curl -sSL -o /tmp/minio https://dl.min.io/server/minio/release/linux-arm64/minio && chmod +x /tmp/minio
mkdir -p /tmp/minio-data/dev-c1-tenant-objects-us-west-2
MINIO_ROOT_USER=dummyaccesskey MINIO_ROOT_PASSWORD=dummysecret \
nohup /tmp/minio server /tmp/minio-data --address 127.0.0.1:34567 > /tmp/minio.log 2>&1 &
sleep 3; curl -sf http://127.0.0.1:34567/minio/health/live && echo MINIO_LIVE'
# NOTE: use nohup-in-the-ssh-command, and VERIFY MINIO_LIVE prints — a
# setsid -f launch in a one-shot ssh has died silently here before.
squire ssh <env> -- "sed -i \
's|^API_TENANTOBJECTSTORES3_ACCESS_KEY_ID=.*|API_TENANTOBJECTSTORES3_ACCESS_KEY_ID=dummyaccesskey|; \
s|^API_TENANTOBJECTSTORES3_SECRET_ACCESS_KEY=.*|API_TENANTOBJECTSTORES3_SECRET_ACCESS_KEY=dummysecret|; \
/^API_TENANTOBJECTSTORES3_OBJECT_PREFIX=/a\\API_TENANTOBJECTSTORES3_ENDPOINT=http://127.0.0.1:34567' \
/data/squire/src/c1/.dev/env/pub-api.env
curl -sf -X POST http://localhost:8080/process/restart/pub-api"
Step 5 — mint a client_credentials pair
The dev-util mint-test-client cmd (PR #17295 / merged) creates a user in the
target tenant, promotes them to SystemOwnerRoleId, and mints a personal
OAuth2 client. Without this cmd you'd be doing direct postgres inserts.
squire ssh <env> -- "set -a; . /data/squire/src/c1/.dev/env/dev-shell.env; set +a;
/data/squire/src/c1/build/linux_arm64/dev-util/dev-util mint-test-client \
--tenant-domain=c1dev --log_level=error" 2>&1 | grep -E '^(client_|user_|tenant_)'
Output is grep-able:
client_id=mellow-flatcar-10265@c1dev.c1.ductone.com/pcc
client_secret=secret-token:conductorone.com:v1:eyJrdHk...
user_id=3D5vAVJPtjmttwCTphpWsZ2uVav
tenant_id=3D5ijhr15puycSTgo0ol87hz4yE
tenant_domain=c1dev
Multi-principal tests: pass --user-email. The user is keyed by email
(default test-cli@dev.local), NOT by --display-name — re-running the cmd
without --user-email mints a new client for the SAME user, which silently
defeats any two-principal flow (share-to-self). For a second principal:
--user-email=test-cli-b@dev.local --display-name=test-cli-b.
The client_id encodes the tenant's installation domain (c1.ductone.com
in this default config). If your env has a different INNKEEPER_INSTALLATION_DOMAIN
(squire envs sometimes get squire-specific ones like
envoy--<env-id>.us-west-2.squire.ductone.com), the client_id will look
different and the laptop /etc/hosts entry won't apply.
Step 6 — drive a client from your laptop
Three pieces of laptop setup:
# (a) tunnel envoy 2443 — squire's own `tunnel` mangles TLS bytes; use ssh -L
ssh -fN -L 12443:127.0.0.1:2443 <env>.squire
# (b) /etc/hosts (one-time, requires sudo)
echo "127.0.0.1 c1dev.c1.ductone.com" | sudo tee -a /etc/hosts
# (c) pull the dev CA fresh — it's regenerated by certgen on each pc/init
scp <env>.squire:/data/squire/src/c1/.dev/pki/service-ca.crt /tmp/c1-dev-ca.pem
Then drive the client. For Latchkey:
latchkey \
--c1-url https://c1dev.c1.ductone.com:12443 \
--tls-trust-cert /tmp/c1-dev-ca.pem \
--tls-server-name localhost \
--client-id "mellow-flatcar-10265@c1dev.c1.ductone.com/pcc" \
--client-secret "secret-token:..." \
vault list
Why these flags:
- URL host is the tenant subdomain so pub-auth's
tenants.SplitDomainfinds the c1dev tenant and pub-api's authn middleware accepts the request. --tls-server-name=localhostbecause the dev cert SAN islocalhostplus internal-service DNS names — it doesn't includec1dev.c1.ductone.com. The override tells tonic + reqwest to validate against thelocalhostSAN while the URL host staysc1dev.c1.ductone.comfor routing.- The CLI exchanges client_credentials against
https://c1dev.c1.ductone.com:12443/auth/v1/token(pub-auth, not the legacy/auth/token) on startup, then uses the access token as Bearer.
Smoke test (30s) — is this env still healthy?
Run this when picking up a paused / older env, or when something looks off mid-test, before spending 15 min re-bringing-up. Three layers: process-compose is alive, OAuth still mints, gRPC still answers.
ENV=<env-name> # e.g. lk-mint-client
CLIENT_ID="..." # cached from mint-test-client
CLIENT_SECRET="..."
# (1) Inside the env — pc states + critical service health.
squire ssh "$ENV" -- '
cd /data/squire/src/c1
pc/list 2>/dev/null | grep -E "envoy|pub-api|pub-auth|be-session|be-innkeeper|postgres|valkey" \
| awk "{ printf \"%-20s %s\n\", \$1, \$2 }"
echo "---"
curl -ksf https://localhost:2443/healthz/ready && echo "envoy: OK" || echo "envoy: FAIL"
'
# (2) From the laptop — OAuth round-trip against the SSH-forwarded
# envoy. Returns the access_token if pub-auth + dev CA + tunnel
# all work end to end.
curl -sf --cacert /tmp/c1-dev-ca.pem \
--resolve c1dev.c1.ductone.com:12443:127.0.0.1 \
-d grant_type=client_credentials \
-d client_id="$CLIENT_ID" \
-d client_secret="$CLIENT_SECRET" \
https://c1dev.c1.ductone.com:12443/auth/v1/token \
| jq -r '.access_token // .error_description // .error' | head -c 80; echo
# (3) Trivial gRPC roundtrip via the CLI. Empty list = stack is
# healthy and your principal has Latchkey perms.
latchkey \
--c1-url https://c1dev.c1.ductone.com:12443 \
--tls-trust-cert /tmp/c1-dev-ca.pem \
--tls-server-name localhost \
--client-id "$CLIENT_ID" \
--client-secret "$CLIENT_SECRET" \
--format json-line \
vault list
# Expected: {"list":[],"next_page_token":""}
Failure mapping:
- (1) any of envoy/pub-api/pub-auth not in
Running: process-compose has flapped. Openpc/attach, restart the failing service, and consult the Verification chain table below for the usual root causes (postgres unix-socket perms, innkeeper Zoho env, etc.). - (2) returns
error/error_description: pub-auth path is up but rejecting the credentials. Re-mint withdev-util mint-test-clientand update CLIENT_ID/CLIENT_SECRET. - (2) curl exits non-zero: SSH tunnel is dead or
/etc/hostslost thec1dev.c1.ductone.commapping. Re-run the laptop setup one-liners above. - (3) succeeds with
{"list":[]}but you expected vaults: your principal mints but lacks Latchkey perms — re-check the SystemOwner ServiceRoles + tenant Latchkey FF (Verification table). - (3) fails with
policy_denied (PermissionDenied: ...): same as the previous bullet; you reached pub-api but the role/FF chain is broken.
Use latchkey auth claims (no extra round-trip) to verify the
principal/tenant the CLI is scoped to before driving any
device-register or per-tenant flow.
Verification chain — what you should see at each step
| Symptom | Meaning |
|---|---|
transport: error sending request |
Stale CA cert. SCP /data/squire/src/c1/.dev/pki/service-ca.crt fresh. |
Invalid input domain: 'localhost:12443' |
Forgot the /etc/hosts entry; URL host needs to be the tenant subdomain. |
dynamo: no item found (mint-test-client) |
be-innkeeper never came up; CrossTenantSettings missing. Restart innkeeper + re-run ensure. |
not_found (5) from /auth/v1/token |
Client_id/secret don't match a row in postgres. Re-run mint-test-client. |
oauth2 invalid_client (CLI) |
Same as above; CLI maps OAuth invalid_client to Unauthenticated. |
policy_denied (PermissionDenied: ...) |
Auth chain works — user just lacks permissions for the specific RPC. SystemOwnerRoleId's ServiceRoles list is a hand-rolled allowlist in pkg/builtin_roles/builtin_roles.go::GetSystemOwner — newer services aren't in it by default (e.g. Latchkey). Add latchkey_v1.LatchkeyServiceOwnerRole (or whichever new service-role) to the slice and rebuild + restart pub-api and be-session (be-session is what builds the passport). The persisted role record in dynamo is overlayed by builtin_roles.ApplyBuiltinAttributes on every read, so just rebuilding the binaries is enough — no DB migration needed. |
unauthenticated |
Bearer token invalid or expired (default lifetime is 30 min). Re-run with fresh creds. |
Squire-env-specific caveats
squire tunnelproxies as a websocket and corrupts TLS handshakes in both directions. Always usessh -Lfor TLS-fronted services.- The default OpenCode model whitelist on a fresh env may be
claude-opus-4-7only. If you spawn an in-env OpenCode agent and set a different model inprompt_async, the call returns silently withProviderModelNotFoundErrorand the agent looks frozen. Alwayscat .config/opencode/opencode.json | jq '.provider.anthropic.whitelist'first. - OpenCode + opus-4-7 will sometimes hit the Anthropic API
assistant message prefill400 error mid-session and stop streaming. The partial work it did is salvageable — checkgit logandgit ls-remote originfrom the env; if a branch is pushed, drive the rest from outside. - Each squire env's
INNKEEPER_INSTALLATION_DOMAINis set per-env. In cloud-routed envs it's a squire subdomain. In non-cloud-tested envs the default isc1.ductone.com. Always check.dev/env/be-innkeeper.envbefore composing tenant URLs.
Running c1 integration tests in a Squire env (no docker)
This is a different goal from standing up the running stack above. The
tests/... integration suites (e.g. tests/api_no_uplift, run in CI as the
buildkite go-...-testapinouplift-api-no-uplift shard) don't need the
process-compose services — the suite starts the c1 services in-process.
They only need the data backends: postgres, dynamodb-local, temporal, valkey,
and an S3 endpoint.
CI runs them with TEST_TEST_CONTAINER=true, which spins those up as
docker containers (ci/integration.sh). Squire envs have no docker
daemon — neither the base image nor the c1 image — so the testcontainer
path is dead. Use the docker-less path instead: TEST_LOCAL_EXEC=true
(pkg/utest/integration.go → newLocalResourceClient), which runs every
backend as a native binary on PATH. The c1 nix localdev devshell already
provides postgres, temporal, valkey, and java; three things it does not
set up for you:
- DynamoDBLocal.jar —
newLocalResourceClient.allocateDynamoDBlooks only in/home/dynamodblocalor/usr/local/dynamodblocal(both root-owned). The squire user has passwordless sudo:sudo mkdir -p /usr/local/dynamodblocal && sudo chown "$(id -un)" /usr/local/dynamodblocal curl -sSL https://d1ni2b6xgvw0s0.cloudfront.net/v2.x/dynamodb_local_latest.tar.gz \ | tar xz -C /usr/local/dynamodblocal - An S3 endpoint on
127.0.0.1:34567—allocateS3connects to it (credsdummyaccesskey/dummysecret) but never starts it, and nothing in process-compose does either. Run minio (arm64 env):curl -sSL -o /tmp/minio https://dl.min.io/server/minio/release/linux-arm64/minio && chmod +x /tmp/minio MINIO_ROOT_USER=dummyaccesskey MINIO_ROOT_PASSWORD=dummysecret \ setsid -f /tmp/minio server /tmp/minio-data --address 127.0.0.1:34567 - A UTF-8 locale — pgtest's
initdbinherits the shell locale, which is unset on a fresh env. WithLC_ALL=Cthe DB comes up SQL_ASCII and every query dies withsimple protocol queries must be run with client_encoding=UTF8(the config setsPreferSimpleProtocol: true). UseC.UTF-8(present asC.utf8inlocale -a), notC.
Then run the suite (use go -C since the block-cd hook rejects cd; the
first compile is slow, the binary caches after):
nix develop /data/squire/src/c1#localdev --command bash -c '
export TEST_LOCAL_EXEC=true LC_ALL=C.UTF-8 LANG=C.UTF-8 LC_CTYPE=C.UTF-8
go -C /data/squire/src/c1 test -vet=off -count=1 -v \
./tests/api_no_uplift/... -run TestAPINoUplift -timeout 25m
'
The buildkite shard name maps directly to this: TEST_CASE="TestAPINoUplift|api_no_uplift"
→ -run "TestAPINoUplift" ./tests/api_no_uplift/... (split on the last |,
see ci/integration.sh). Launch it detached (setsid -f ... > log; touch done)
and poll the log — squire ssh sessions drop on long holds.
Caveat — this does not match CI's postgres. local-exec uses the devshell's
native postgres (currently 18.3); buildkite's testcontainer pins ECR
postgres:2 (an older major). A test that passes here can still fail in
buildkite (and vice-versa) when the behavior is postgres-version-dependent —
e.g. partitioned-table schema handling. So a green local-exec run rules out
"broken on modern pg / general staleness," but it does not clear a
buildkite failure. To reproduce a version-specific failure you need docker +
the pinned image (TEST_TEST_CONTAINER=true with
TEST_TEST_CONTAINER_POSTGRES_IMAGE set), which means a docker host, not squire.
Cleanup
# stop the env (preserves state — restart with `start_env`)
squire env <env-id> # selects
# or via MCP from inside another env: stop_env tool
# delete entirely
squire env delete <env-id>
State on EFS persists between stop/start; the dev CA + postgres data + minted clients all survive.