deploy - SKILL.md Agent Skill

name: deploy description: > Deploy complete hackathon infrastructure from scratch including WorkAdventure, LiveKit, Coturn, and Jitsi. Use when setting up for a new event, recovering from accidental destruction, or first-time deployment. Covers pre-flight checks, Terraform bootstrap, map sync, and service verification.

Deploy Hackathon Infrastructure

Deploy the complete nf-core hackathon stack with validation at each step.

Reference files:

prerequisites.md - First-time setup, secrets, SSH keys
oauth-setup.md - GitHub OAuth configuration
service-reference.md - Per-service technical details

Pre-flight Checklist

Run all checks before deploying. Stop and fix any failures.

1. Environment Variables

./scripts/validate-env.sh

If fails: Run direnv allow, check 1Password Environment is mounted.

2. AWS Credentials

aws sts get-caller-identity --profile nf-core

If fails: Configure ~/.aws/credentials with nf-core profile.

3. EIP Quota (need 3 available)

aws ec2 describe-addresses --profile nf-core --region eu-west-1 --query 'length(Addresses)'

Must be 5 or less. NEVER release vpc-multi-runner-* EIPs.

4. Route53 Hosted Zone

aws route53 list-hosted-zones --profile nf-core --query "HostedZones[?Name=='hackathon.nf-co.re.'].Id" --output text

Should return zone ID. If missing, create zone and configure Netlify NS delegation (see prerequisites.md).

5. SSH Key

ssh-add -l | grep -i 1password

Should list keys. If empty, configure 1Password SSH agent (see prerequisites.md).

Deployment Steps

Step 1: Bootstrap Terraform Backend

./scripts/bootstrap.sh

Creates S3 bucket and DynamoDB table. Safe to re-run.

Step 2: Initialize Terraform

cd terraform/environments/hackathon
terraform init

Important: All terraform commands must be run from terraform/environments/hackathon/, not the root directory.

Step 3: Review Plan

terraform plan

For fresh deployment, expect ~50-60 resources created, 0 destroyed.

Stop if plan shows unexpected destroys - particularly anything with vpc-multi-runner.

Step 4: Apply Infrastructure

Only proceed after reviewing plan and confirming it looks correct.

terraform apply

Type yes when prompted. Takes 3-5 minutes.

Step 5: Wait for Services (5-15 minutes)

Services need time to initialize:

EC2 boot + cloud-init (2-3 min)
Docker pulls + container start (3-5 min)
TLS certificates from Let's Encrypt (2-5 min)

Monitor:

./scripts/status.sh

Run every 2-3 minutes. Initially unhealthy is normal.

Verification

Once status.sh shows all healthy:

WorkAdventure

curl -sI https://app.hackathon.nf-co.re | head -5

Expect: HTTP 302 (redirect to OAuth) or HTTP 200

LiveKit

curl -s https://livekit.hackathon.nf-co.re

Expect: OK

Jitsi

curl -sI https://jitsi.hackathon.nf-co.re | head -5

Expect: HTTP 200

Coturn

Test at https://webrtc.github.io/samples/src/content/peerconnection/trickle-ice/

Add server: turn:turn.hackathon.nf-co.re:3478
Look for "relay" candidates

For credential generation, see service-reference.md.

Full Stack Test

Open https://app.hackathon.nf-co.re
Authenticate with GitHub (requires public nf-core membership)
Enter virtual world
Test proximity video (walk near another user)
Test Jitsi (enter a meeting room zone)

Troubleshooting

Services not healthy after 15 minutes

SSH in and check logs:

./scripts/ssh.sh wa    # or lk, turn, jitsi
cloud-init status
docker ps
docker compose logs -f

OAuth shows template errors

OAuth templates are copied from the cloned hackathon-infra repo during deployment. If templates are missing, redeploy the WorkAdventure instance:

terraform apply -replace="module.workadventure.aws_instance.workadventure"

Let's Encrypt certificate errors

EIP wasn't assigned before cert request. For Jitsi:

./scripts/ssh.sh jitsi
sudo /usr/share/jitsi-meet/scripts/install-letsencrypt-cert.sh

For Coturn/LiveKit, restart Caddy:

docker restart caddy

Terraform state lock

Wait 15 minutes (locks auto-expire)
Check no other terraform process running
NEVER force-unlock without explicit user approval

user_data changes not detected

The EC2 instances use lifecycle { ignore_changes = [ami, user_data] } to prevent cascading destroys. If you need to apply user_data changes, there are two approaches:

Choosing between Terraform and SSH

Scenario	Preferred Approach
Development (no event, single dev)	Terraform force-replace. Downtime acceptable, reproducibility matters.
Live event (users online, urgent fix)	SSH if possible. Minimize disruption. Terraform only if SSH cannot achieve the fix cleanly.

Default to Terraform unless the user indicates there's an active event with users online.

Always ask the user before proceeding if downtime is involved:

"This change requires redeploying the WorkAdventure instance, which will cause ~2-3 minutes of downtime. Is that acceptable, or would you prefer I attempt an SSH fix?"

Option A: Force replace via Terraform (preferred for development)

terraform apply -replace="module.workadventure.aws_instance.workadventure"

Causes 2-3 minutes downtime
Clean, reproducible state
Changes persisted in git

Option B: SSH manual update (only during live events)

SSH to the instance and modify the running configuration directly.

No downtime (or minimal during service restart)
Changes NOT persisted - will be lost on next Terraform apply
Higher risk of configuration drift
Only use for urgent fixes during active events

SSH Notes

The ./scripts/ssh.sh script opens an interactive SSH session but does not accept commands as arguments. To execute commands:

# Get the IP first
./scripts/ssh.sh wa  # Note the IP shown

# Then run commands directly
ssh ec2-user@<IP> "your command here"

Or SSH interactively and run commands manually.