halfstack - SKILL.md Agent Skill

name: halfstack description: Diagnose and fix Docker Compose halfstack issues — config mapping, service health, DB/Valkey/etcd inspection, supergraph regeneration invoke_method: user auto_execute: false enabled: true tags: - dev - docker - halfstack - troubleshooting

Halfstack Troubleshooting & Fix

Diagnose and directly fix issues with the Docker Compose halfstack development environment.

When to Use

Docker Compose services fail to start or keep restarting
Config files are missing, stale, or have wrong port/secret values
Supergraph schema needs regeneration after GQL changes
Need to inspect DB, Valkey, or etcd state directly
Halfstack needs to be brought up after a fresh clone or branch switch

Compose File

The runtime compose file is always docker-compose.halfstack.current.yml (project root). It is generated from docker-compose.halfstack-main.yml (or halfstack-ha.yml for HA mode).

Quick Reference Commands

# Check all halfstack services
docker compose -f docker-compose.halfstack.current.yml ps

# Check a specific service's logs
docker compose -f docker-compose.halfstack.current.yml logs <service-name>

# Restart a specific service
docker compose -f docker-compose.halfstack.current.yml restart <service-name>

# Bring everything up
docker compose -f docker-compose.halfstack.current.yml up -d --wait

Service Names & Profiles

Optional services are gated behind Docker Compose profiles. By default (docker compose up -d) only the required services start. To include optional ones, pass --profile <name>.

Service	Image	Purpose	Profile
`backendai-half-db`	postgres:16.3-alpine	Main database	(required)
`backendai-half-redis`	valkey/valkey:9.1.0-alpine	Cache / pub-sub	(required)
`backendai-half-etcd`	etcd v3.5	Config store	(required)
`backendai-half-apollo-router`	Hive Gateway	GraphQL federation (manager has 2 GQL servers federated through this)	(required)
`backendai-half-prometheus`	Prometheus	Metrics — manager queries it for deployment autoscale rule evaluation	(required)
`backendai-half-otel-collector`	OTel Collector	Trace / metric export	`telemetry`, `observability`
`backendai-half-loki`	Loki	Log aggregation	`telemetry`, `observability`
`backendai-half-grafana`	Grafana	Dashboards	`observability`
`backendai-half-tempo`	Tempo	Tracing	`observability`
`backendai-half-pyroscope`	Pyroscope	Profiling	`observability`
`backendai-half-db-exporter`	postgres-exporter	Postgres metrics	`observability`
`backendai-half-redis-exporter`	redis_exporter	Valkey metrics	`observability`
`backendai-half-minio`	MinIO	Object storage	`storage`

Profile semantics:

telemetry — service-level export only (otel-collector + loki). Visualisation (Grafana) and supporting backends (Tempo, Pyroscope, exporters) are typically managed centrally; this profile is a good default for dev installs that just want their logs and traces forwarded.
observability — superset of telemetry. Brings up the full local stack including Grafana / Tempo / Pyroscope / exporters.
storage — MinIO only.

Enabling optional profiles

# Required only (default)
docker compose -f docker-compose.halfstack.current.yml up -d --wait

# + telemetry export (OTel collector + Loki forwarding logs/traces to a central monitor)
docker compose -f docker-compose.halfstack.current.yml --profile telemetry up -d --wait

# + full observability stack (Grafana / Tempo / Pyroscope / exporters in addition to telemetry)
docker compose -f docker-compose.halfstack.current.yml --profile observability up -d --wait

# + object storage (MinIO)
docker compose -f docker-compose.halfstack.current.yml --profile storage up -d --wait

# Everything
docker compose -f docker-compose.halfstack.current.yml --profile observability --profile storage up -d --wait

When stopping/removing, profile flags must also be passed for those containers to be torn down:

docker compose -f docker-compose.halfstack.current.yml --profile observability --profile storage down

scripts/delete-dev.sh already passes both profiles so a clean wipe works regardless of what was enabled.

Docker Configs — Files That Must Exist in Project Root

The compose file declares a configs: section. Docker Compose reads these as files. If a file is missing when docker compose up runs, Docker creates a directory at that path instead. Once a directory exists where a file should be, even copying the correct file won't help — the directory must be removed first.

Fix Procedure for Missing Config Files

Step 1: Stop affected services (or all services):

docker compose -f docker-compose.halfstack.current.yml down

Step 2: Check and remove any directories that should be files:

# These MUST be regular files, not directories
for f in prometheus.yaml otel-collector-config.yaml loki-config.yaml \
         tempo-config.yaml supergraph.graphql gateway.config.ts; do
  [ -d "$f" ] && rm -rf "$f" && echo "Removed directory: $f"
done

# These MUST be directories
for d in grafana-dashboards grafana-provisioning; do
  [ -f "$d" ] && rm -f "$d" && echo "Removed file: $d"
done

Step 3: Copy config files from source (same as scripts/install-dev.sh):

# Docker Compose configs (plain copy, no transformation)
cp configs/prometheus/prometheus.yaml ./prometheus.yaml
cp configs/otel/otel-collector-config.yaml ./otel-collector-config.yaml
cp configs/loki/loki-config.yaml ./loki-config.yaml
cp configs/tempo/tempo-config.yaml ./tempo-config.yaml
cp configs/graphql/gateway.config.ts ./gateway.config.ts

# Supergraph — generated, but can be copied from last known-good
cp docs/manager/graphql-reference/supergraph.graphql ./supergraph.graphql

# Grafana (recursive directory copy)
cp -r configs/grafana/dashboards ./grafana-dashboards
cp -r configs/grafana/provisioning ./grafana-provisioning

Step 4: Ensure volume directories exist:

mkdir -p volumes/postgres-data
mkdir -p volumes/etcd-data
mkdir -p volumes/redis-data

Step 5: Bring services back up:

docker compose -f docker-compose.halfstack.current.yml up -d --wait

Config Source Mapping Reference

File in project root	Source path	Used by service
`prometheus.yaml`	`configs/prometheus/prometheus.yaml`	backendai-half-prometheus
`otel-collector-config.yaml`	`configs/otel/otel-collector-config.yaml`	backendai-half-otel-collector
`loki-config.yaml`	`configs/loki/loki-config.yaml`	backendai-half-loki
`tempo-config.yaml`	`configs/tempo/tempo-config.yaml`	backendai-half-tempo
`supergraph.graphql`	`docs/manager/graphql-reference/supergraph.graphql`	backendai-half-apollo-router
`gateway.config.ts`	`configs/graphql/gateway.config.ts`	backendai-half-apollo-router
`grafana-dashboards/`	`configs/grafana/dashboards/`	backendai-half-grafana (volume mount)
`grafana-provisioning/`	`configs/grafana/provisioning/`	backendai-half-grafana (volume mount)

Missing or Stale Compose File

If docker-compose.halfstack.current.yml doesn't exist or is outdated:

cp docker-compose.halfstack-main.yml docker-compose.halfstack.current.yml

Then apply port substitutions. Read existing component toml files to determine current ports, or use defaults from scripts/install-dev.sh:

Setting	Default	sed pattern
POSTGRES_PORT	8101	`s/8100:5432/${POSTGRES_PORT}:5432/`
REDIS_PORT	8111	`s/8110:6379/${REDIS_PORT}:6379/`
ETCD_PORT	8121	`s/8120:2379/${ETCD_PORT}:2379/`

Note: The source template has 8100/8110/8120 but install-dev.sh defaults are 8101/8111/8121. Always check existing config files first to determine the correct port.

Supergraph / Hive Gateway

The Hive Gateway serves the federated GraphQL schema. Regenerate when:

GQL schema types or fields change
New GQL modules are added
v2 schema is modified

# 1. Generate new schemas and supergraph
./scripts/generate-graphql-schema.sh

# 2. Copy to project root (where compose expects it)
cp docs/manager/graphql-reference/supergraph.graphql ./supergraph.graphql
cp configs/graphql/gateway.config.ts ./gateway.config.ts

# 3. Restart the gateway
docker compose -f docker-compose.halfstack.current.yml restart backendai-half-apollo-router

If manager code is broken and generate-graphql-schema.sh fails, copy the last known-good supergraph from git:

git show main:docs/manager/graphql-reference/supergraph.graphql > ./supergraph.graphql

Direct Service Inspection

PostgreSQL

PGCONTAINER=$(docker compose -f docker-compose.halfstack.current.yml ps -q backendai-half-db)

# Interactive psql
docker exec -it -e PGPASSWORD=develove $PGCONTAINER psql -U postgres -d backend

# Non-interactive query
docker exec -e PGPASSWORD=develove $PGCONTAINER psql -U postgres -d backend -c "SELECT version();"

# Check databases
docker exec -e PGPASSWORD=develove $PGCONTAINER psql -U postgres -tc "SELECT datname FROM pg_database;"

# Check alembic migration version (manager)
docker exec -e PGPASSWORD=develove $PGCONTAINER psql -U postgres -d backend -c "SELECT * FROM alembic_version;"

# Check alembic migration version (appproxy)
docker exec -e PGPASSWORD=develove $PGCONTAINER psql -U postgres -d appproxy -c "SELECT * FROM alembic_version;"

# List tables
docker exec -e PGPASSWORD=develove $PGCONTAINER psql -U postgres -d backend -c "\dt"

Common fix — appproxy DB missing:

docker exec -e PGPASSWORD=develove $PGCONTAINER psql -U postgres -c "CREATE DATABASE appproxy;"
docker exec -e PGPASSWORD=develove $PGCONTAINER psql -U postgres -c "CREATE ROLE appproxy WITH LOGIN PASSWORD 'develove';"
docker exec -e PGPASSWORD=develove $PGCONTAINER psql -U postgres -d appproxy -c "GRANT ALL ON SCHEMA public TO appproxy;"
./py -m alembic -c alembic-appproxy.ini upgrade head

Valkey

REDIS_CONTAINER=$(docker compose -f docker-compose.halfstack.current.yml ps -q backendai-half-redis)

# Ping
docker exec $REDIS_CONTAINER valkey-cli ping

# Info
docker exec $REDIS_CONTAINER valkey-cli info server
docker exec $REDIS_CONTAINER valkey-cli dbsize

# List keys (dev only)
docker exec $REDIS_CONTAINER valkey-cli keys '*'

# Get/check specific key
docker exec $REDIS_CONTAINER valkey-cli get <key>
docker exec $REDIS_CONTAINER valkey-cli type <key>

# Flush all (destructive)
docker exec $REDIS_CONTAINER valkey-cli flushall

etcd

ETCD_CONTAINER=$(docker compose -f docker-compose.halfstack.current.yml ps -q backendai-half-etcd)

# List all keys
docker exec $ETCD_CONTAINER etcdctl get --prefix "" --keys-only

# Get specific key
docker exec $ETCD_CONTAINER etcdctl get <key>

# Common key prefixes
docker exec $ETCD_CONTAINER etcdctl get --prefix "config/redis"
docker exec $ETCD_CONTAINER etcdctl get --prefix "volumes"

# Health check
docker exec $ETCD_CONTAINER etcdctl endpoint health

Or via Backend.AI CLI:

./backend.ai mgr etcd get --prefix ''
./backend.ai mgr etcd get config/redis/addr
./backend.ai mgr etcd put config/redis/addr "127.0.0.1:8111"

MinIO

MINIO_CONTAINER=$(docker compose -f docker-compose.halfstack.current.yml ps -q backendai-half-minio)

# Health check
docker exec $MINIO_CONTAINER curl -sf http://localhost:9000/minio/health/live

# List buckets (set alias first)
docker exec $MINIO_CONTAINER mc alias set local http://localhost:9000 minioadmin minioadmin
docker exec $MINIO_CONTAINER mc ls local/

# Web console: http://127.0.0.1:9001 (minioadmin / minioadmin)

Component Config Files — Port/Secret Consistency

These config files live in the project root and are generated from configs/ templates.

Config file	Source template	Key transformations
`manager.toml`	`configs/manager/halfstack.toml`	etcd/PG/manager port, ipc-base-path
`alembic.ini`	`configs/manager/halfstack.alembic.ini`	PG connection string
`account-manager.toml`	`configs/account-manager/halfstack.toml`	etcd/PG/service port, ipc-base-path
`alembic-accountmgr.ini`	`configs/account-manager/halfstack.alembic.ini`	PG connection string
`agent.toml`	`configs/agent/halfstack.toml`	etcd/RPC/watcher port, ipc/var/mount paths, accelerator plugins
`storage-proxy.toml`	`configs/storage-proxy/halfstack.toml`	etcd port, 2 secrets, volume config, MinIO creds
`app-proxy-coordinator.toml`	`configs/app-proxy-coordinator/halfstack.toml`	PG/Valkey port, service port, 3 generated secrets
`alembic-appproxy.ini`	`configs/app-proxy-coordinator/halfstack.alembic.ini`	PG connection string
`app-proxy-worker.toml`	`configs/app-proxy-worker/halfstack.toml`	Valkey port, service port, same 3 secrets as coordinator
`webserver.conf`	`configs/webserver/halfstack.conf`	Manager endpoint URL, Valkey addr

Cross-Config Consistency Rules

PG port in compose must match manager.toml, alembic.ini, account-manager.toml, app-proxy-coordinator.toml, alembic-appproxy.ini
Valkey port in compose must match app-proxy-coordinator.toml, app-proxy-worker.toml, webserver.conf
etcd port in compose must match manager.toml, agent.toml, storage-proxy.toml
App Proxy secrets: app-proxy-coordinator.toml and app-proxy-worker.toml must share identical api_secret, jwt_secret, permit_hash.secret
Manager ↔ Storage Proxy: the volume auth secret in etcd (set via dev.etcd.volumes.json) must match storage-proxy.toml's [api.manager] secret

Regenerating a Component Config

When regenerating, read existing secret values from the current config file and reuse them. Only generate new secrets (python -c 'import secrets; print(secrets.token_urlsafe(32))') when the config file doesn't exist at all.

Reference scripts/install-dev.sh lines 1016–1142 for the exact sed substitution patterns per component.

Diagnostic Workflow

When halfstack issues are reported, follow this order:

Check compose file exists: ls -la docker-compose.halfstack.current.yml
Check service status: docker compose -f docker-compose.halfstack.current.yml ps
For exited/unhealthy services: read logs with docker compose ... logs <service>
For config-dependent services (prometheus, otel, loki, tempo, gateway):
- Verify referenced files exist in project root and are files, not directories
- If a directory exists where a file should be: stop service → rm -rf <dir> → copy correct file → restart
For Backend.AI components (manager, agent, etc.): verify .toml/.conf exists and ports match compose
For DB issues: connect to PostgreSQL directly and check schema/data
For Valkey/etcd issues: connect directly and inspect state
Fix the root cause directly — don't just report the problem.