name: arcana-ai-agent-flow-skill
description: Build & operate an autonomous CI workflow platform on a SINGLE Kogito BPMN engine (SonataFlow retired 2026-06-09) running three processes — ci-flow (red-build remediation with human handoff — park at humanFixTask, resume the agent's Claude session via claude --resume <sid>), merge-flow (verified-green PR automerge + automatic release-please releases), ci-maintenance (hourly read-only health governance) — all feeding one Kogito Data Index, driven by a Rust task-worker dispatching to a Claude agent-task-node, monitored live on an Angular bpmn-js dashboard behind Authelia. Use when the user wants a workflow engine + real-time monitoring dashboard, autonomous CI remediation/merge/release as visible BPMN flows, or AI-to-human handoff with session continuity. Triggers "arcana-ai-agent-flow", "workflow monitor", "工作流監控", "BPMN dashboard", "Kogito Data Index", "green PR automerge", "human handoff", "流程引擎 + dashboard".
skill_version: 1.1.0
created_date: 2026-06-02
skill_type: complex
status: production (deployed to bluesea / workflow.arcana.boo)
arcana-ai-agent-flow skill
An autonomous CI workflow platform: a single Kogito BPMN engine runs three
processes (remediate red builds, automerge + release green PRs, hourly health
governance), an agent fleet drives them (Claude + Jenkins), humans take over
seamlessly when AI can't finish, and everything is watched live on a bpmn-js
dashboard. Built Mac-first, deployed to bluesea behind Authelia 2FA at
https://workflow.arcana.boo.
SonataFlow (the former second engine) was retired 2026-06-09 — its only flow (ci-maintenance) was a heartbeat shell, BPMN is a superset of SWF for this platform, and SWF's real edge (Knative scale-to-zero) was unused with both engines running as always-on containers. ci-maintenance was ported to BPMN; one engine now runs everything.
What this builds
Jenkins RunListener (ci-bpmn-trigger.groovy v7) ci-scheduler (hourly)
red build ──POST /ci-flow (6h cooldown)─┐ POST /ci-maintenance
green PR build (fleet-wide) ─POST /merge-flow─┐ │
▼ ▼ ▼
Kogito BPMN engine (Quarkus, PG persistence, kafka events)
ci-flow: Triage(ai)→Build(jenkins)→Fix(ai)⟲→Decide(ai)
→endGate→ humanFixTask(human) | End
merge-flow: Start→Merge(ai)→Release(ai)→End
ci-maintenance: Scan→Analyze(ai)→Remediate→Verify (scriptTasks
→ ci-maint-endpoint, read-only, no docker sock)
│ process/task events (kafka)
▼
Kogito Data Index (PostgreSQL, GraphQL) ← one queryable layer
▼
arcana-cloud-rust /api/v1/workflows/* (Axum read-API, BPMN_DIR → bpmn-js XML)
▼ (/api proxy, single origin)
Angular dashboard (bpmn-js diagrams, handoff banner w/ claude --resume cmd)
+
workflow-task-worker (RUST) — dispatch by task name; group=human NEVER
auto-completed (parked); reconciler repairs Data Index from engine truth
ai → agent-task-node (Claude CLI, persistent session via sid)
jenkins → Jenkins rebuild
When to use
- User wants a workflow engine + real-time monitoring (task list + live bpmn-js flow diagram), with processes + instances stored in PostgreSQL.
- Orchestrate CI failure remediation as a visible role-based flow (red build → diagnose(ai) → rebuild(jenkins) → fix(ai) → decide(ai)) with human handoff instead of dead-ending: unfixable builds park at a human task and the human resumes the agent's exact Claude session.
- Autonomous green-PR merging + releases: any fleet PR that builds green is verified and squash-merged by the agent, then release-please runs on every merge — release PRs are themselves green → automerged → releases cut full-auto (requires conventional commits; Renovate PRs qualify).
- Hourly health governance as an auditable flow: read-only scan → AI analysis (severity + recommendation) → bounded remediation → verify, all process vars visible in the dashboard (KPI/audit).
The three processes (templates/kogito-bpmn/*.bpmn2 — production copies)
| Process | Shape | Notes |
|---|---|---|
ci-flow |
Triage(ai)→Build(jenkins)→[fixable? Fix(ai)→Build ⟲3]→Decide(ai)→endGate | endGate: green or AI-judged-merged → End; else → humanFixTask (group=human) — parked until a human completes it out=verify (re-Build) or out=giveup (→failEnd). sid process var threads ONE Claude conversation through triage/fix/decide and is what the human resumes. |
merge-flow |
Start→Merge(ai)→Release(ai)→End | Merge: agent re-checks gh pr view/checks (open + green + no conflicts) then gh pr merge --squash --delete-branch. Release: via agent /task/release — FIRST a scoped claude readme-sync pass (syncs README version claims vs the repo's dependency manifests via gh api, PLUS the dynamic Tests badge from the latest green main build's Jenkins console and the Coverage badge from the SonarQube measures API — coverage metric, projectKey read from the Jenkinsfile; commits docs: sync README versions + CI badges if stale, leaves a badge unchanged if the number can't be determined reliably), THEN deterministic npx release-please@16 github-release + release-pr (released detection = ground-truth latest-tag before/after, not output parsing); skips repos without release-please-config. POST /task/readmesync {repo} also works standalone. |
ci-maintenance |
Scan→Analyze(ai)→Remediate→Verify | scriptTasks call boo.arcana.MaintHttp → ci-maint-endpoint (/scan disk+Jenkins+cron results, /remediate only re-onlines Jenkins nodes, /verify). Analyze = AI severity/recommendation. Execution stays on host cron; flow is read-only orchestration + record. |
Components (templates/)
| Path | What |
|---|---|
kogito-bpmn/ |
Quarkus 3.8.4 + Kogito 10 BPMN engine (flattened standalone pom, PG persistence, kafka events addon). Ships all three .bpmn2 (production copies). userTasks are GroupId-assigned (ai/jenkins/human). |
workflow-task-worker/ |
Rust poller (main.rs, image arcana/task-worker:1.3.0): ready Data-Index tasks → dispatch by lowercased task name — triage/build/fix/decide/analyze/merge/release. Task-level tokio concurrency (fix=1, ai=2, jenkins=3). group=human is NEVER auto-completed — parked (stays Ready, logs ⏸ PARKED once). with_sid()/pick_sid() thread the Claude session id through ai tasks. Reconciler (every RECONCILE_SECS=300, writes DI's PG directly) repairs Data-Index drift from engine truth both ways — survives kafka outages. MODE=auto (local synth) / real (prod). |
read-api/ |
workflow_controller.rs (engine-agnostic endpoints incl. /definitions/{id}/bpmn → raw XML for bpmn-js), data_index.rs (GraphQL client), bpmn.rs (sequence-flow edges + GroupId roles), Dockerfile.flow (installs protobuf-compiler). Drop into a copy of the arcana-cloud-rust template; repository must be PostgreSQL. Reads BPMN_DIR=/app/bpmn. |
dashboard/ |
nginx.conf (SPA + /api proxy via resolver+variable) + Dockerfile (node:24, npm install). Angular: multi-instance table + bpmn-js diagram (falls back to custom SVG only if no BPMN XML). Handoff banner: a run with a Ready human-group task shows amber banner with sid + copyable docker exec -it agent-task-node claude --resume <sid>. nodeStatus() honors instance state (FaultNode→Failed when terminal, not perma-Running). |
bluesea-jenkins/ci-bpmn-trigger.groovy |
RunListener v7 (production copy): red build → POST /ci-flow (6h per-job cooldown); green PR build, fleet-wide (.*-app(-pipeline)?-mb/.* + CHANGE_URL) → POST /merge-flow {job,prUrl}. Install to init.groovy.d, hot-apply via /scriptText. |
docker-compose.bluesea.yml (+ .mac.yml, deploy-bluesea.sh, kogito-pg-init/) |
Production compose (synced): kogito-pg (3 DBs), kogito-bpmn, ci-maint-endpoint (Rust Axum, /data + /var/log read-only, Jenkins API, zero docker socket), data-index, read-API, dashboard, task-worker (RECONCILE_GROUPS=ai,jenkins,human), ci-scheduler (hourly ci-maintenance POST). |
docker-compose.mac.yml |
Local stack (adds its own kafka). |
Fix-node remediation strategy (ci-flow Fix(ai))
The Fix node (worker → agent-task-node /task/fix, prompt in server.py) is built to fix autonomously and only escalate when it genuinely can't:
- Archive-first — before reinventing, it
vsearch/csearchthe shared session archive for a proven fix to the same root cause. Every past fix (any session) is recallable; a human's manual fix gets ingested (~15 min) and becomes the agent's playbook for the next occurrence. - Dependency-major playbook (encoded in the prompt so it's recognised on first sight) — renovate
chore(deps)majors fail in patterned ways:- peer-dep coupling — a tooling major can't go alone (e.g.
typescript6 is locked to Angular 22;npm cishows ERESOLVEpeer … from @angular/*). Fix = bundle the framework major via its official codemod (ng update @angular/core@N @angular/cli@N @angular/cdk@N), which auto-applies migration schematics, then ONE combined PR. - quality-gate coverage drop after a test-runner major (vitest/jest) — tests pass but SonarQube
coverage < 80. The runner changed coverage scope (e.g. vitest v4 newly counts 0%-covered bootstrap/entry files). Fix = add them tocoverage.exclude(same category as already-excludedsrc/index.ts) — never pad fake tests or lower the gate. - lockfile out of sync (
renovate/artifactsfailed) → regenerate (npm install) + commit.
- peer-dep coupling — a tooling major can't go alone (e.g.
- Disposable-container build — the agent container has python/java/rust + the docker CLI but only node v22 and no go/gradle. When a fix needs a toolchain it lacks or a newer version (e.g.
ng updateto Angular 22 needs node ≥ 24.15), it builds in a throwaway official-image container exactly like CI (docker run --rm -v $(pwd):/w -w /w node:24 …) instead of parking with "can't build locally". (permissions.allowhas bareBash→ docker runs headless.) - Close the loop — applies the fix to the PR-head branch the failing pipeline will rebuild (feature branches aren't protected); only fixes that must target main open a PR + stop (review-gated).
Self-fixes vs parks: code-level API breaks (fixed go#31 mongo-driver coverage.out → merged), recurrences, and known patterns → self-fix; a genuinely-novel failure the first time → park for human handoff (below). Big framework majors stay review-gated even when buildable (pushed to the PR branch, never auto-merged to main).
Human handoff (ci-flow)
- AI can't fix → endGate routes to
humanFixTask(group=human); worker parks it. - Dashboard shows the parked run (
currentNode=HumanFix) + banner with the command:docker exec -it agent-task-node claude --resume <sid>— re-attaches the same Claude conversation the agent used (agent-task-node runsclaude -pWITH session persistence;/root/.claudeis a host bind mount so sessions survive recreate). - Human fixes, then completes the task
out=verify(loops back to Build to confirm green) orout=giveup(→ failEnd).
Build & deploy
- Mac-first:
docker compose -f docker-compose.mac.yml up -d --build; verify Data Index GraphQL (:8180/graphql) returns ProcessInstances + UserTaskInstances withpotentialGroups. (Engine needsmvn clean packagefirst — central-only mirror, see gotchas.) - read-API: copy the arcana-cloud-rust template (don't edit upstream), port
repository MySQL→PostgreSQL, add
read-api/files, nest/workflowsafter the auth layer (no token). - bluesea: build arm64 images,
docker save | ssh | docker load(worker + engine can also build ON bluesea: worker is self-contained Rust; engine via maven-docker jar thendocker build),./deploy-bluesea.sh [--with-worker], front with Authelia. Agent/task/releaseneedsGH_TOKEN+ node/npx in the agent container. Seereferences/deploy-bluesea.md.
Critical gotchas (full list in references/build-gotchas.md)
- Engine Maven build in containers: jboss.org repo is flaky → central-only
mirror +
-Uor the build stalls (see memory kogito-bpmn-maven-jboss-trap). - BPMN XML comments must not contain
--(e.g.claude --resume) — Kogito codegen dies with SAXParseException "string -- not permitted". - Changing a process's node structure without bumping its version leaves
stale rows in Data Index
definitions_nodes(old + new nodes overlay on the sameversion=1.0) → garbled diagram. Bump the version, or DELETE the stale node ids. Instances execute correctly either way. - After engine
--force-recreate, restart the task-worker — its in-memory ready cache goes stale (shows N ready while engine/DI have 0). - Kafka outage ≠ lost instances: the engine is the source of truth; the worker
re-checks
complete()failures against the engine and the reconciler repairs Data Index both ways. Never abort instances off stale DI work-items. - Events addon needs the kafka connector
quarkus-smallrye-reactive-messaging-kafka(NOTquarkus-messaging-kafka) + the MetricDecorator ArC exclude. - nginx
/apiproxy:resolver 127.0.0.11+set $upstream …; proxy_pass $upstream;(no URI) so it survives upstream restarts and forwards the full path. - New/changed BPMN diagram on the dashboard: ship the
.bpmn2to./bpmn/, restart the read-API, and hard-refresh the SPA (bpmnXml signal cache).
References
references/architecture.md— single-engine design, the three flows, role model, decisions.references/deploy-bluesea.md— bluesea runbook (images, compose, worker, Authelia, B2 trigger).references/build-gotchas.md— every build/deploy trap hit + fix.