troubleshoot-datasurface-yellow-kubernetes-job-failures

star 0

Diagnose and fix common failures in Yellow Kubernetes jobs (init, merge, reconcile).

billynewport By billynewport schedule Updated 2/1/2026

name: Troubleshoot DataSurface Yellow Kubernetes Job Failures description: Diagnose and fix common failures in Yellow Kubernetes jobs (init, merge, reconcile).

Troubleshooting Yellow Kubernetes Job Failures

This guide helps diagnose and resolve failures in DataSurface Yellow Kubernetes jobs.

Quick Diagnosis

Check Job Status

# List all jobs and their status
kubectl get jobs -n $NAMESPACE

# Expected successful output:
# NAME                        COMPLETIONS   DURATION   AGE
# demo-psp-ring1-init         1/1           45s        10m
# demo-psp-model-merge-job    1/1           30s        5m

View Job Logs

# Get logs from a job's pod
kubectl logs job/demo-psp-ring1-init -n $NAMESPACE
kubectl logs job/demo-psp-model-merge-job -n $NAMESPACE

# If the job has multiple attempts, get logs from specific pod
kubectl get pods -n $NAMESPACE | grep demo-psp
kubectl logs <pod-name> -n $NAMESPACE

Describe Job for Events

kubectl describe job demo-psp-model-merge-job -n $NAMESPACE

Common Failures and Solutions

1. Credential Not Found

Error Pattern:

Credential not found: user or password is None
ValueError: Credential 'postgres-demo-merge' not found or incomplete

Cause: The Kubernetes secret doesn't exist, has wrong key names, or the secret name doesn't match Yellow's naming convention.

Diagnosis:

# Check if secret exists
kubectl get secret postgres-demo-merge -n $NAMESPACE

# View secret keys (not values)
kubectl describe secret postgres-demo-merge -n $NAMESPACE

# Check environment variables in pod
kubectl logs job/demo-psp-model-merge-job -n $NAMESPACE | grep -i credential

Solution:

Yellow converts credential names using these rules:

  • Lowercase
  • Underscores (_) become hyphens (-)
  • Spaces become hyphens

Create the secret with correct keys:

# For USER_PASSWORD credentials
kubectl create secret generic postgres-demo-merge \
  --from-literal=USER=postgres \
  --from-literal=PASSWORD=password \
  -n $NAMESPACE

# For API_TOKEN credentials (e.g., git)
kubectl create secret generic git \
  --from-literal=TOKEN=$GITHUB_TOKEN \
  -n $NAMESPACE

Key names are case-sensitive: Use USER, PASSWORD, TOKEN (uppercase).

See credential creation guide for complete details.


2. Database Does Not Exist

Error Pattern:

FATAL: database "merge_db" does not exist
psycopg2.OperationalError: connection to server failed

Cause: PostgreSQL init scripts didn't run (existing Docker volume) or wrong PostgreSQL instance is being accessed.

Diagnosis:

# Connect to PostgreSQL and list databases
docker exec datasurface-postgres psql -U postgres -c "\l"

# Or use local psql if available
psql -h localhost -U postgres -c "\l"

Solution A - Create databases manually:

docker exec datasurface-postgres psql -U postgres \
  -c "CREATE DATABASE airflow_db;" \
  -c "CREATE DATABASE merge_db;"

Solution B - Reset Docker volume:

cd docker/postgres
docker compose down -v
docker compose up -d

3. PostgreSQL Port Conflict

Error Pattern:

FATAL: password authentication failed for user "postgres"
FATAL: database "merge_db" does not exist

(But you're sure the credentials and database are correct)

Cause: A local PostgreSQL (e.g., Homebrew) is running on port 5432, and Kubernetes pods connect to it instead of the Docker container via host.docker.internal:5432.

Diagnosis:

# Check what's listening on 5432
lsof -i :5432

# Connect and check PostgreSQL version
psql -h localhost -U postgres -c "SELECT version();"
# If it shows "PostgreSQL 17.x (Homebrew)" instead of "16-alpine", wrong instance!

Solution:

# Stop Homebrew PostgreSQL
brew services stop postgresql@17
# or
brew services stop postgresql@16
# or
brew services stop postgresql

# Verify Docker PostgreSQL is now accessible
psql -h localhost -U postgres -c "SELECT version();"
# Should show: PostgreSQL 16.x (Debian/Alpine)

4. ImagePullBackOff

Error Pattern:

Status: ImagePullBackOff
Failed to pull image "registry.gitlab.com/datasurface-inc/datasurface/datasurface:v1.1.0"

Diagnosis:

kubectl describe pod <pod-name> -n $NAMESPACE | grep -A10 Events

Solution:

  1. Verify registry secret exists:
kubectl get secret datasurface-registry -n $NAMESPACE
  1. Create registry secret if missing:
kubectl create secret docker-registry datasurface-registry \
  --docker-server=registry.gitlab.com \
  --docker-username="$GITLAB_CUSTOMER_USER" \
  --docker-password="$GITLAB_CUSTOMER_TOKEN" \
  -n $NAMESPACE
  1. Attach to default service account:
kubectl patch serviceaccount default -n $NAMESPACE \
  -p '{"imagePullSecrets": [{"name": "datasurface-registry"}]}'
  1. Verify credentials work locally:
docker login registry.gitlab.com -u "$GITLAB_CUSTOMER_USER" -p "$GITLAB_CUSTOMER_TOKEN"
docker pull registry.gitlab.com/datasurface-inc/datasurface/datasurface:v1.1.0

5. CreateContainerConfigError

Error Pattern:

Status: CreateContainerConfigError
secret "git" not found

Diagnosis:

kubectl describe pod <pod-name> -n $NAMESPACE | grep -A5 Events

Solution:

Create the missing secret. Common missing secrets:

# Git token for model repository
kubectl create secret generic git \
  --from-literal=TOKEN=$GITHUB_TOKEN \
  -n $NAMESPACE

# Merge database credentials
kubectl create secret generic postgres-demo-merge \
  --from-literal=USER=postgres \
  --from-literal=PASSWORD=password \
  -n $NAMESPACE

6. Job Using Stale Docker Image

Symptom: You pulled a new image but the job still fails with the same error.

Cause: Kubernetes caches images by tag. If the tag (e.g., v1.1.0) hasn't changed, K8s uses the cached image.

Solution:

  1. Ensure job YAML has imagePullPolicy: Always:
containers:
- name: model-merge-handler
  image: registry.gitlab.com/datasurface-inc/datasurface/datasurface:v1.1.0
  imagePullPolicy: Always
  1. Delete completed job and reapply:
kubectl delete job demo-psp-model-merge-job -n $NAMESPACE
kubectl apply -f generated_output/Demo_PSP/demo_psp_model_merge_job.yaml
  1. Pull image locally to ensure Docker Desktop has latest:
docker pull registry.gitlab.com/datasurface-inc/datasurface/datasurface:v1.1.0

7. Git Repository Access Denied

Error Pattern:

fatal: Authentication failed for 'https://github.com/yourorg/demo1_actual.git'
remote: Repository not found

Diagnosis:

# Check git secret exists and has TOKEN key
kubectl describe secret git -n $NAMESPACE

Solution:

  1. Verify token has repo access permissions on GitHub

  2. Recreate secret with valid token:

kubectl delete secret git -n $NAMESPACE
kubectl create secret generic git \
  --from-literal=TOKEN=$GITHUB_TOKEN \
  -n $NAMESPACE
  1. Test token locally:
git ls-remote https://${GITHUB_TOKEN}@github.com/yourorg/demo1_actual.git

Rerunning Failed Jobs

Jobs are immutable once created. To rerun:

# Delete the failed job
kubectl delete job demo-psp-model-merge-job -n $NAMESPACE

# Reapply
kubectl apply -f generated_output/Demo_PSP/demo_psp_model_merge_job.yaml

# Watch logs
kubectl logs -f job/demo-psp-model-merge-job -n $NAMESPACE

Verifying Successful Completion

# All jobs should show COMPLETIONS as 1/1
kubectl get jobs -n $NAMESPACE

# Check pod status
kubectl get pods -n $NAMESPACE

# Expected:
# - demo-psp-ring1-init-xxxxx: Completed
# - demo-psp-model-merge-job-xxxxx: Completed
# - airflow-* pods: Running
# - demo-psp-mcp-server-*: Running

Getting Help

If issues persist:

  1. Collect full logs:
kubectl logs job/demo-psp-model-merge-job -n $NAMESPACE > merge-job.log
kubectl describe job demo-psp-model-merge-job -n $NAMESPACE > merge-job-describe.log
  1. Check generated YAML for issues:
cat generated_output/Demo_PSP/demo_psp_model_merge_job.yaml
  1. Verify all secrets exist:
kubectl get secrets -n $NAMESPACE
Install via CLI
npx skills add https://github.com/billynewport/demo_cokub_model --skill troubleshoot-datasurface-yellow-kubernetes-job-failures
Repository Details
star Stars 0
call_split Forks 1
navigation Branch main
article Path SKILL.md
More from Creator
billynewport
billynewport Explore all skills →