name: rbac-expert
description: "Authorization/RBAC navigation for cogni-template — points at the canon (OpenFGA model, AuthorizationPort, rbac.md invariants, the access-request flow, the hardening roadmap) and captures the durable mental model + hard-won gotchas that aren't obvious when you read it: OpenFGA is the sole authority, principal→role→capability, deny-by-default / fail-closed-with-distinction, why authorization is undefined, immutable hashed models, the request→approve→flight grant loop, and which checks aren't wired yet. Use when adding an authz check to a route/tool, designing a new protected action or role, debugging authz_denied vs authz_unavailable, deciding why authorization is undefined, granting/revoking node access, or touching packages/authorization-core / OpenFgaAuthorizationAdapter / infra/openfga/rbac-model.json / scripts/ci/bootstrap-openfga.sh / node_access_requests / POST /api/v1/nodes/{id}/{access-requests,developers} / POST /api/v1/vcs/flight. Triggers: 'OpenFGA', 'RBAC', 'ReBAC', 'authorization', 'AuthorizationPort', 'authz check', 'node.flight', 'can_flight', 'developer role', 'access request', 'approve agent', 'grant access to a node', 'tuple write', 'writeRelation', 'authz_denied', 'authz_unavailable', 'deny by default', 'fail closed', 'subjectId', 'on-behalf-of', 'delegation', 'OPENFGA_STORE_ID', 'authorization model', 'immutable model', 'bootstrap-openfga', 'rbac-model.json', 'add a role', 'add a permission', 'why is authz undefined', 'why 503 authz_unavailable', 'production_promoter', 'preview_promoter', 'can_promote_production', 'NODE_ACCESS_ROLES', 'validate an rbac extension', 'prove a grant works', 'candidate-flight-infra', 'two-lever bootstrap', 'role-grant workflow'."
RBAC Expert
Navigation for authorization in cogni-template. This file deliberately does NOT restate the model, the invariant text, the action map, or the roadmap — those live in canon and rot if copied. It points at canon and captures what isn't obvious once you're reading it.
Read the canon (don't duplicate it here)
| Source | Owns — go here for the current truth |
|---|---|
docs/spec/rbac.md |
Numbered invariants, the ReBAC model + DSL, the action→relation table, §6 Node Access Request Flow, the candidate-flight use case |
infra/openfga/rbac-model.json |
The authored, immutable model — the real SSOT for what's grantable |
packages/authorization-core/src/index.ts |
AuthorizationPort (check / writeRelation / deleteRelation), decision codes, relationForAuthzAction() (the action→relation SSOT — read it, don't memorize a copy), resource helpers |
docs/spec/identity-model.md |
Principals; actorId (runtime string) vs actor_id (economic-subject column) |
work/projects/proj.rbac-hardening.md |
Live roadmap + as-built status (what's wired vs pending) — the authority on "is X enforced yet" |
docs/spec/access-control-charter.md |
Layer-cake framing (Identity → AuthN → AuthZ → Secrets → DAO) |
The mental model (the durable part)
- OpenFGA is the SOLE authority for permission + delegation. ToolPolicy + grant-intersection run before it as capability/safety gates ("does this capability exist?"), never as authz. Never add a second authority — no per-service role tables; tracking rows (
node_access_requests) are display state, never read by acheck(). - Principal → role → capability. You grant a role (a directly-assignable relation, e.g.
developer); OpenFGA derives the capability (a computed relation, e.g.can_flight from developer); a route checks the capability viarelationForAuthzAction(). Adding an access level = add a role relation + itscan_X from <role>in the model, then map the action. The principal (who:user:/agent:/service:) is orthogonal to the role. - Dual-check on-behalf-of: when a
check()carriessubjectId, BOTH must pass — subject has the permission AND actordelegatesfor subject.subjectIdis bound server-side only (never from a body/arg). - Two invariants bite hardest (full numbered set in rbac.md "Core Invariants"): deny-by-default (no tuple ⇒ deny) and fail-closed-with-distinction (infra failure ⇒ deny, coded
authz_unavailable= 503, distinct fromauthz_denied= 403). Conflating those two hides outages. Also: check before the side effect, never after.
The grant loop (node access — the product surface)
rbac.md §6. register → agent POST /nodes/{id}/access-requests {role} (files a tracking row; owner sees it in the Agents UI) → owner POST /nodes/{id}/developers {agentUserId, decision, role} (writes/deletes the OpenFGA role tuple — the authority; the row transition is best-effort) → the gated route enforces the capability. The flow is role-general: role ∈ NODE_ACCESS_ROLES (developer→can_flight, production_promoter→can_promote_production); the approve route writes relation:<role> (default developer). Two flight paths share the node.flight check: the direct route and the core__vcs_flight_candidate graph tool (gated as tool.execute).
A capability relation with no grantable role is inert. Adding
can_X from <role>to the model is only half the work — the role must ALSO be inNODE_ACCESS_ROLES+ the access-request CHECK + writable by the approve route, or no principal can ever hold it. (This is whyproduction_promotershipped with the role-grant path, not after.)
Proof a grant actually works: the gated route returns
403 authz_deniedbefore approval and flips to a downstream error (e.g.catalog_missing/ preflight) after — RBAC passed; the failure moved past it.
developergrants TWO planes (rbac.md §6a,PUSH_LOGIN_FROM_REQUEST, proven on candidate-a 2026-06-24). Adeveloperapproval is not just the OpenFGA tuple — it ALSO provisions GitHub branch-push on the node's repo for the agent's GitHub identity. The agent declares its owngithubLoginon the access REQUEST (SELF_REQUEST_ONLY); the owner's Approve click supplies NO login. The operator App (privilege bridge; agent holds no standing GitHub admin) resolves the node's own repo viaresolveNodeRepo(catalogsource_repo) — NOTnodes.repoOwner/repoName, which is the submodule-parent monorepo (bug.5054, the cause of anApp not installed on Cogni-DAO/cogni (404)mis-grant). The agent then auto-accepts the GitHub invite with its own token (no human). Branch-push is best-effort: failure (branchPush: error/skipped:*, observable viarouteId="nodes.developers"+githubStatus) never reverses the authoritative tuple. Two contributor tiers: trusted = branch-push (this), anonymous = fork-PR.
Validate an RBAC extension end-to-end (API + Grafana — NEVER SSH)
Every new role/capability is proven on candidate-a entirely over HTTP, observed in Loki. Do not SSH the VM to write tuples or read OpenFGA — the grant API is the surface. If a role can't be granted via API, that's the bug to fix (see grant loop), not an SSH workaround.
Setup — one owner session + one fresh requester agent:
- Owner session = captured
.local-auth/candidate-a-operator.storageState.json(Bearer also works; the gated routes resolve Bearer→session). - Requester =
POST /api/v1/agent/register {name}→{userId, apiKey}. - Billing-before-authz gotcha: the gated routes check a billing account before the authz check (mirrors flight). A fresh principal 403s
billing_account_missingand never reaches the gate — masking it. Provision one by hitting any BYO-AI status route once with the principal's Bearer:GET /api/v1/auth/openai-compatible/statusget-or-creates the billing account. - Need a node you own →
POST /api/v1/nodes {slug, chainId}returns itsid.
The four-state proof (gated route = the one your action maps to, e.g. POST /api/v1/deploy/promote {nodeId, env:"production"} for can_promote_production):
- deny-by-default →
403 authz_denied(billing present, no role tuple). - grant → requester
POST /nodes/{id}/access-requests {role}; ownerPOST /nodes/{id}/developers {agentUserId, decision:"approve", role}. - flip → re-hit the gated route → flips off
authz_deniedto a downstream code (catalog_missing, preflight, 200). RBAC passed. - revoke → owner
…{decision:"reject", role}→ back to403 authz_denied. Deny restored.
Observability (tier-1, ties to YOUR request): each route logs route="<routeId>" — deploy.promote, nodes.developers, nodes.access-requests, vcs.flight. Query {namespace="cogni-candidate-a", pod=~"operator-node-app-.*"} | json | route="deploy.promote" and match the status ladder (403→…→403) to your exercise window. scripts/loki-query.sh '<logql>' <mins> <limit> — export GRAFANA_URL+GRAFANA_SERVICE_ACCOUNT_TOKEN inline (.env.cogni has placeholder lines that break set -a; source).
The two-lever bootstrap trap (503-vs-403 tell) — the reason you'd be tempted to SSH: candidate-flight (app lever) deploys only the app image; it does NOT bootstrap the OpenFGA model. A PR that adds/renames a relation ships the app, but the deployed store still runs the old model → your gated route returns 503 authz_unavailable (the check resolves a relation the model lacks → fail-closed), NOT authz_denied. Fix is a second lever, not a hand-edit: gh workflow run candidate-flight-infra.yml --ref <your-branch> → deploy-infra.sh → bootstrap-openfga.sh mints the new model, repoints OPENFGA_AUTHORIZATION_MODEL_ID in the operator config, and rollout restarts the pods. Diagnostic: an existing-relation route (vcs/flight→can_flight) returning 403 authz_denied while your new-relation route returns 503 proves the adapter is healthy and only your relation is missing → model-bootstrap lever, not a code bug. candidate-a mirrors preview/prod only when both levers run (preview/prod get the model via promote-and-deploy's deploy-infra job on merge).
Gotchas (hard-won, not in the spec)
authorizationisundefineduntilOPENFGA_STORE_IDexists.container.ts(~L842) builds the adapter only whenOPENFGA_API_URLandOPENFGA_STORE_IDare both set — reachability ships before policy. Prod has a LIVE OpenFGA store since 2026-06-14 — RBAC is enforced on prod (e.g.production_promoterwas exercised end-to-end there), NOT candidate-a-only. Where a store is absent,/developersreturns503 authz_unavailableand flight falls back to the V0 owner-only check. Verify per env before relying on it — candidate-a + prod have stores; preview's store status should be confirmed against the env, not assumed. (Corrects the prior "prod + preview have no store" note.)authz_unavailable(503) ≠authz_denied(403). A timeout/outage is unavailable, not denied.- Models are immutable + hashed. Editing
rbac-model.jsonmints a new model version on next bootstrap; tuples reference relations by name, so renaming a live relation (e.g.developer) is a migration, not an edit. Add relations; don't rename live ones. - The model is principal-agnostic —
node.developer: [user, agent]accepts both today. V0 grantsuser:{agent_user_id}(agents register as users); anagent:{actor_id}form later is additive (new@agent:tuples — no model change, no tuple rewrite). Not debt, not split-brain. Never narrowdeveloperto[user]. - Not every action is enforced yet.
node.flight+tool.executeare wired;graph.invokeandconnection.usechecks are still pending (seeproj.rbac-hardening.md). Don't assume a capability is gated — verify in the route before relying on it. - Don't overload reserved identity terms.
scope→ reserved forscope_id(governance);actor→ reserved foractor_id(economic) + theactorIdprincipal string. Principals are agent/user/service — which is why the access-request column isrole, notscope. - Never read
node_access_requeststo authorize — it's display/UX; the OpenFGA tuple is the authority.
Where each surface lives
| Surface | File |
|---|---|
Authz construction (and when it's undefined) |
nodes/operator/app/src/bootstrap/container.ts (~L842) |
node.flight enforcement + V0 owner fallback |
nodes/operator/app/src/app/api/v1/vcs/flight/route.ts |
| Role tuple write/delete (approve/deny/revoke), role-aware | nodes/operator/app/src/app/api/v1/nodes/[id]/developers/route.ts |
| Agent access request (role enum) | nodes/operator/app/src/app/api/v1/nodes/[id]/access-requests/route.ts |
Tracking schema + NODE_ACCESS_ROLES + CHECK |
nodes/operator/app/src/shared/db/node-access-requests.ts, features/nodes/access-requests.ts |
| OpenFGA adapter + deterministic fake | packages/authorization-core/src/adapters/, .../test/ |
| Per-env store/model bootstrap | scripts/ci/bootstrap-openfga.sh (via scripts/ci/deploy-infra.sh) |
| Re-bootstrap the model on candidate-a | gh workflow run candidate-flight-infra.yml --ref <branch> (the infra lever; app lever skips it) |