neo-cloud-abstraction - SKILL.md Agent Skill

name: neo-cloud-abstraction description: > ML training job dispatch abstraction for Praxion ML/AI projects: mode-invariant training_job_descriptor schema, four backends (local subprocess, SkyPilot 20+ providers, RunPod and Nebius direct adapters). Triggers: configuring a compute backend, dispatching via /run-experiment, reading/writing training_job_descriptor YAML or neo_cloud_backend.yaml, debugging backend dispatch errors, choosing local vs SkyPilot vs RunPod vs Nebius, subprocess training, GPU cloud dispatch, Nebius, backend: nebius-direct. Activate alongside ml-training and llm-training-eval for full pipeline work. allowed-tools: [Read, Write, Edit, Glob, Grep, Bash] compatibility: Claude Code staleness_sensitive_sections: - "Backend Version Reference" - "Tiered Backend Strategy"

Neo-Cloud Abstraction

Praxion's backend abstraction for ML training job dispatch. One training_job_descriptor schema — unchanged across operational modes A, B, and C. The backend implementation changes; the descriptor does not.

Satellite files (loaded on-demand):

references/local-backend.md -- subprocess semantics; serves modes A and B; pricing_query → 0.0; wall_clock enforcement
references/skypilot-backend.md -- SkyPilot ~=0.12 integration; mode C default; YAML translation; 20+ provider coverage
references/runpod-direct-adapter.md -- @runpod/mcp-server ~1.1; mode C opt-in; integration recipe; Praxion does not ship custom MCP
references/nebius-direct-adapter.md -- Nebius CLI; mode C opt-in; first v2 direct adapter; provision-VM + SSH lifecycle; InfiniBand gpu-cluster

training_job_descriptor — the invariant contract

This schema is identical across modes A, B, and C. The descriptor has NO mode: field and NO backend: field — the backend is a project-level configuration in .ai-state/neo_cloud_backend.yaml, never a descriptor concern. If any field in a descriptor requires knowing which mode is active, the abstraction is leaking; revise the schema.

# training_job_descriptor — NO mode: field. NO backend: field.
job_id: <string>                          # assigned by backend on create(); uuid or slug
run_tag: <string>                         # human-readable tag (e.g., "run-003-lr3e4")
gpu_type: H100 | H200 | B200 | A100_80GB | RTX4090 | RTX3090 | M2_Ultra | auto
gpu_count: <int>                          # 1, 2, 4, 8
container_image: <docker-uri>             # used by SkyPilot and RunPod; ignored by local backend
env_vars:
  KEY: VALUE                              # passed to the training process
volume_mounts:
  - host_path: /workspace/data
    container_path: /data
    read_only: true
wall_clock_seconds_max: <int>             # hard wall-clock cap; backend MUST enforce
gpu_hours_budget: <float>                 # cost budget; see deployment/gpu-compute-budgeting
artifact_paths:
  - /workspace/checkpoints               # paths to fetch after run completes
entry_command: "python train.py"         # entrypoint; ignored by local backend (uses Popen directly)

Validation rules:

gpu_hours_budget is REQUIRED for remote backends (SkyPilot, RunPod direct, Nebius direct). The local backend accepts 0.0 to signal "owned hardware, no $ cap".
container_image is REQUIRED for SkyPilot and RunPod direct. Local backend ignores it.
entry_command is REQUIRED for SkyPilot and RunPod direct. Local backend uses Popen with the command from WIP.md step context.
job_id is assigned by create(), not set by the caller.

Mode-invariance self-test (AC6): trace the descriptor through every backend. Does any field require if mode == C logic? If yes, the schema is wrong.

Lifecycle Operations

Eight operations form the invariant protocol. Every backend implements all eight. The local backend's implementations prove the abstraction's correctness (see note below).

Operation	Signature	Returns	Local backend	SkyPilot backend	RunPod direct
`create()`	descriptor → job_id	string	Popen spawn; returns PID as job_id	`sky launch` → cluster/job name	`pod_create` MCP tool
`start()`	job_id → void	—	No-op (Popen already running)	`sky start <cluster>`	`pod_start` MCP tool
`status()`	job_id → Status	enum	Poll subprocess returncode	`sky status <cluster>`	`pod_status` MCP tool
`log_stream()`	job_id → stream	iterator	Read stdout/stderr from Popen	`sky logs <cluster>`	`pod_logs` MCP tool
`cancel()`	job_id → void	—	`os.kill(pid, SIGTERM)`	`sky cancel <cluster>`	`pod_stop` MCP tool
`artifact_fetch()`	job_id, paths → local	list[Path]	Paths already local; return as-is	`sky rsync-down`	RunPod volume API
`list()`	→ list[JobSummary]	list	Read from local run registry	`sky status --all`	`pod_list` MCP tool
`pricing_query()`	gpu_type, gpu_count → $/hr	float	Returns 0.0 (no cloud cost)	`sky show-gpus <type>`	RunPod `pricing` MCP tool

The runpod-direct and nebius-direct opt-in adapters implement these same eight operations; their reference files carry the per-operation tool/CLI mapping (the table above shows the canonical local/SkyPilot/RunPod implementations).

The pricing_query() → 0.0 pattern (local backend): Returning 0.0 for local hardware proves the abstraction is correctly designed. If the abstraction leaked — if the caller had to handle "not applicable for local mode" as a special case — then the descriptor would need a mode: field, violating AC6. The clean return of 0.0 means /run-experiment records actual_cost_usd: 0.0 in TRAINING_RESULTS.md for local runs without any mode awareness.

Tiered Backend Strategy

Three tiers. Choose based on where the project is in the exploration → commitment lifecycle.

Tier	Config value	Operational modes	When to use
Local (default)	`backend: local`	A, B	Owned GPU; rented GPU box with Praxion installed; prototyping; cost-free runs
SkyPilot (default-remote)	`backend: skypilot`	C	Exploring cloud providers; multi-cloud; spot recovery; first remote run
RunPod direct (opt-in)	`backend: runpod-direct`	C	Committed to RunPod; want native MCP integration; avoiding SkyPilot indirection
Nebius direct (opt-in)	`backend: nebius-direct`	C	Committed to Nebius; native VM/cluster control; multi-node InfiniBand training

Start local. Move to SkyPilot when you need remote resources. Move to a direct adapter (RunPod or Nebius) only after validating on SkyPilot and deciding to commit to that provider.

Project config file:

# .ai-state/neo_cloud_backend.yaml
backend: local          # local | skypilot | runpod-direct | nebius-direct

/run-experiment reads this file to determine which backend reference to load.

Backend Version Reference

Backend	Package	Version	Source	Notes
Local	stdlib `subprocess`	Python 3.10+	stdlib	No install required
SkyPilot	`skypilot` (PyPI)	`~=0.12` (0.12.1 verified)	PyPI	Pin `~=0.12`; flag for refresh at 0.13+
RunPod direct	`@runpod/mcp-server` (npm)	`~1.1` (1.1.0 verified)	npm	Vendor-maintained; verify before using
Nebius direct	`nebius` CLI (+ `nebius` pysdk, PyPI)	CLI + pysdk latest (verified 2026-06-06)	Nebius install script / PyPI	Vendor-maintained; CLI is the dispatch path — see nebius-direct-adapter.md

Security

Never put secrets in the descriptor. Cloud credentials travel outside the descriptor:
- SkyPilot: ~/.aws/, ~/.gcp/, ~/.azure/ credential files or env vars
- RunPod: RUNPOD_API_KEY environment variable
- Nebius direct: ~/.nebius/credentials.json + ~/.nebius/NEBIUS_TENANT_ID.txt, or NEBIUS_IAM_TOKEN
- Local: no cloud credentials needed
The descriptor's env_vars block is for training hyperparameters and runtime config, not for credentials.

Performance

Local backend MUST enforce wall_clock_seconds_max via signal.alarm(seconds) or a watchdog thread — the OS will not kill a subprocess automatically at a time limit.
SkyPilot and RunPod support job-level timeouts natively; pass wall_clock_seconds_max in the YAML translation or pod config.
Set gpu_hours_budget to a non-zero value for all remote runs. A remote run without a budget cap is an open-ended billing commitment.

Related Skills

Skill	When to load it
`ml-training`	ML archetype vocabulary, operational modes, six artifact types
`llm-training-eval`	Metric thresholds, TRAINING_RESULTS.md schema, PASS/FAIL/WARN
`deployment` → `references/gpu-compute-budgeting.md`	Budget declaration patterns, cost estimation by backend
`experiment-tracking`	MLflow / W&B run logging; mapping run IDs to TRAINING_RESULTS.md