name: Troubleshoot DataSurface Yellow Kubernetes Job Failures description: Diagnose and fix common failures in Yellow Kubernetes jobs (init, merge, reconcile).
Troubleshooting Yellow Kubernetes Job Failures
This guide helps diagnose and resolve failures in DataSurface Yellow Kubernetes jobs.
Quick Diagnosis
Check Job Status
# List all jobs and their status
kubectl get jobs -n $NAMESPACE
# Expected successful output:
# NAME COMPLETIONS DURATION AGE
# demo-psp-ring1-init 1/1 45s 10m
# demo-psp-model-merge-job 1/1 30s 5m
View Job Logs
# Get logs from a job's pod
kubectl logs job/demo-psp-ring1-init -n $NAMESPACE
kubectl logs job/demo-psp-model-merge-job -n $NAMESPACE
# If the job has multiple attempts, get logs from specific pod
kubectl get pods -n $NAMESPACE | grep demo-psp
kubectl logs <pod-name> -n $NAMESPACE
Describe Job for Events
kubectl describe job demo-psp-model-merge-job -n $NAMESPACE
Common Failures and Solutions
1. Credential Not Found
Error Pattern:
Credential not found: user or password is None
ValueError: Credential 'postgres-demo-merge' not found or incomplete
Cause: The Kubernetes secret doesn't exist, has wrong key names, or the secret name doesn't match Yellow's naming convention.
Diagnosis:
# Check if secret exists
kubectl get secret postgres-demo-merge -n $NAMESPACE
# View secret keys (not values)
kubectl describe secret postgres-demo-merge -n $NAMESPACE
# Check environment variables in pod
kubectl logs job/demo-psp-model-merge-job -n $NAMESPACE | grep -i credential
Solution:
Yellow converts credential names using these rules:
- Lowercase
- Underscores (
_) become hyphens (-) - Spaces become hyphens
Create the secret with correct keys:
# For USER_PASSWORD credentials
kubectl create secret generic postgres-demo-merge \
--from-literal=USER=postgres \
--from-literal=PASSWORD=password \
-n $NAMESPACE
# For API_TOKEN credentials (e.g., git)
kubectl create secret generic git \
--from-literal=TOKEN=$GITHUB_TOKEN \
-n $NAMESPACE
Key names are case-sensitive: Use USER, PASSWORD, TOKEN (uppercase).
See credential creation guide for complete details.
2. Database Does Not Exist
Error Pattern:
FATAL: database "merge_db" does not exist
psycopg2.OperationalError: connection to server failed
Cause: PostgreSQL init scripts didn't run (existing Docker volume) or wrong PostgreSQL instance is being accessed.
Diagnosis:
# Connect to PostgreSQL and list databases
docker exec datasurface-postgres psql -U postgres -c "\l"
# Or use local psql if available
psql -h localhost -U postgres -c "\l"
Solution A - Create databases manually:
docker exec datasurface-postgres psql -U postgres \
-c "CREATE DATABASE airflow_db;" \
-c "CREATE DATABASE merge_db;"
Solution B - Reset Docker volume:
cd docker/postgres
docker compose down -v
docker compose up -d
3. PostgreSQL Port Conflict
Error Pattern:
FATAL: password authentication failed for user "postgres"
FATAL: database "merge_db" does not exist
(But you're sure the credentials and database are correct)
Cause: A local PostgreSQL (e.g., Homebrew) is running on port 5432, and Kubernetes pods connect to it instead of the Docker container via host.docker.internal:5432.
Diagnosis:
# Check what's listening on 5432
lsof -i :5432
# Connect and check PostgreSQL version
psql -h localhost -U postgres -c "SELECT version();"
# If it shows "PostgreSQL 17.x (Homebrew)" instead of "16-alpine", wrong instance!
Solution:
# Stop Homebrew PostgreSQL
brew services stop postgresql@17
# or
brew services stop postgresql@16
# or
brew services stop postgresql
# Verify Docker PostgreSQL is now accessible
psql -h localhost -U postgres -c "SELECT version();"
# Should show: PostgreSQL 16.x (Debian/Alpine)
4. ImagePullBackOff
Error Pattern:
Status: ImagePullBackOff
Failed to pull image "registry.gitlab.com/datasurface-inc/datasurface/datasurface:v1.1.0"
Diagnosis:
kubectl describe pod <pod-name> -n $NAMESPACE | grep -A10 Events
Solution:
- Verify registry secret exists:
kubectl get secret datasurface-registry -n $NAMESPACE
- Create registry secret if missing:
kubectl create secret docker-registry datasurface-registry \
--docker-server=registry.gitlab.com \
--docker-username="$GITLAB_CUSTOMER_USER" \
--docker-password="$GITLAB_CUSTOMER_TOKEN" \
-n $NAMESPACE
- Attach to default service account:
kubectl patch serviceaccount default -n $NAMESPACE \
-p '{"imagePullSecrets": [{"name": "datasurface-registry"}]}'
- Verify credentials work locally:
docker login registry.gitlab.com -u "$GITLAB_CUSTOMER_USER" -p "$GITLAB_CUSTOMER_TOKEN"
docker pull registry.gitlab.com/datasurface-inc/datasurface/datasurface:v1.1.0
5. CreateContainerConfigError
Error Pattern:
Status: CreateContainerConfigError
secret "git" not found
Diagnosis:
kubectl describe pod <pod-name> -n $NAMESPACE | grep -A5 Events
Solution:
Create the missing secret. Common missing secrets:
# Git token for model repository
kubectl create secret generic git \
--from-literal=TOKEN=$GITHUB_TOKEN \
-n $NAMESPACE
# Merge database credentials
kubectl create secret generic postgres-demo-merge \
--from-literal=USER=postgres \
--from-literal=PASSWORD=password \
-n $NAMESPACE
6. Job Using Stale Docker Image
Symptom: You pulled a new image but the job still fails with the same error.
Cause: Kubernetes caches images by tag. If the tag (e.g., v1.1.0) hasn't changed, K8s uses the cached image.
Solution:
- Ensure job YAML has
imagePullPolicy: Always:
containers:
- name: model-merge-handler
image: registry.gitlab.com/datasurface-inc/datasurface/datasurface:v1.1.0
imagePullPolicy: Always
- Delete completed job and reapply:
kubectl delete job demo-psp-model-merge-job -n $NAMESPACE
kubectl apply -f generated_output/Demo_PSP/demo_psp_model_merge_job.yaml
- Pull image locally to ensure Docker Desktop has latest:
docker pull registry.gitlab.com/datasurface-inc/datasurface/datasurface:v1.1.0
7. Git Repository Access Denied
Error Pattern:
fatal: Authentication failed for 'https://github.com/yourorg/demo1_actual.git'
remote: Repository not found
Diagnosis:
# Check git secret exists and has TOKEN key
kubectl describe secret git -n $NAMESPACE
Solution:
Verify token has repo access permissions on GitHub
Recreate secret with valid token:
kubectl delete secret git -n $NAMESPACE
kubectl create secret generic git \
--from-literal=TOKEN=$GITHUB_TOKEN \
-n $NAMESPACE
- Test token locally:
git ls-remote https://${GITHUB_TOKEN}@github.com/yourorg/demo1_actual.git
Rerunning Failed Jobs
Jobs are immutable once created. To rerun:
# Delete the failed job
kubectl delete job demo-psp-model-merge-job -n $NAMESPACE
# Reapply
kubectl apply -f generated_output/Demo_PSP/demo_psp_model_merge_job.yaml
# Watch logs
kubectl logs -f job/demo-psp-model-merge-job -n $NAMESPACE
Verifying Successful Completion
# All jobs should show COMPLETIONS as 1/1
kubectl get jobs -n $NAMESPACE
# Check pod status
kubectl get pods -n $NAMESPACE
# Expected:
# - demo-psp-ring1-init-xxxxx: Completed
# - demo-psp-model-merge-job-xxxxx: Completed
# - airflow-* pods: Running
# - demo-psp-mcp-server-*: Running
Getting Help
If issues persist:
- Collect full logs:
kubectl logs job/demo-psp-model-merge-job -n $NAMESPACE > merge-job.log
kubectl describe job demo-psp-model-merge-job -n $NAMESPACE > merge-job-describe.log
- Check generated YAML for issues:
cat generated_output/Demo_PSP/demo_psp_model_merge_job.yaml
- Verify all secrets exist:
kubectl get secrets -n $NAMESPACE