name: nvidia-nixl
description: |-
NVIDIA Inference Xfer Library (NIXL) operator + developer reference. Point-to-point KV-cache and tensor transport for distributed inference (Dynamo, vLLM, SGLang). Covers the agent API (full Python reference; C++/Rust via upstream pointers), all 13 backend plugins (UCX, GDS, GDS_MT, libfabric, mooncake, posix, hf3fs, obj/S3, azure_blob, gusli, uccl, gpunetio/DOCA, telemetry), build paths (pip nixl-cu12/cu13, meson+ninja from source), ETCD vs side-channel metadata, telemetry (Prometheus + cyclic shared-memory), NIXL-EP elastic MoE device kernels, and Dynamo / vLLM NixlConnector / SGLang integration patterns.
when_to_use: |-
Trigger on "NIXL", "ai-dynamo/nixl", "NVIDIA Inference Xfer Library", "nixl_agent", "nixl-cu12", "nixl-cu13", "nixlbench", "kvbench", "NIXL_PLUGIN_DIR", "NIXL_ETCD_ENDPOINTS", "NIXL_TELEMETRY_ENABLE", "VLLM_NIXL_SIDE_CHANNEL_HOST", "NIXL UCX/GDS/Mooncake/libfabric/HF3FS/S3/GUSLI/DOCA GPUNetIO/UCCL/Azure Blob backend", "NIXL telemetry", "NIXL ETCD", "side-channel metadata", "NIXL-EP", "elastic MoE", "nixlBackendH", "registerMem", "prepXfer", "createXferReq", "getNotifs", "loadRemoteMD", "fetchRemoteMD", "sendLocalMD", "South Bound API", "GPUDirect Storage cuFile", "RDMA write KV cache", "disaggregated prefill transport", "KV cache transfer engine", "NixlConnector", "Dynamo backend transfer", "nixlUcxSharedThread", "NIXL_ERR_REMOTE_DISCONNECT", "AWS EFA NIXL", "writing a NIXL plugin". For vLLM connector wiring (--kv-transfer-config, K8s pod shape, UCX_TLS) consult vllm-caching first.
NVIDIA Inference Xfer Library (NIXL)
Target audience: operators wiring NIXL into Dynamo/vLLM/SGLang clusters, plugin authors writing new backends, developers using the agent API directly from Python (references/python-api.md). C++/Rust developers: consult src/api/cpp/ headers and examples/{cpp,rust}/ upstream directly — this skill does not carry a C++/Rust API reference. Assumes datacenter-class GPUs (H100/H200/B200/B300) with NVIDIA driver, CUDA 12.8+, RDMA NIC (Mellanox/EFA) for cross-node, and Linux (Ubuntu 22.04/24.04 or Fedora). macOS and Windows are not supported.
What NIXL is — one paragraph
NIXL is a thin abstraction over heterogeneous transport backends. A nixlAgent registers memory regions (DRAM, VRAM, FILE, BLOCK, OBJ), exchanges metadata with peer agents via either ETCD or socket side-channel, then issues asynchronous one-sided READ/WRITE transfers between local and remote registered memory. The agent picks the best backend (UCX for network, GDS for storage, etc.) based on memory types and what both sides have loaded. Same-process loopback, intra-node GPU-to-GPU, and cross-node RDMA are all the same API. Two operations only — read and write — and both are non-blocking with optional notifications.
Version snapshot — verify before recommending
| Item | Value | Source |
|---|---|---|
| Latest release | v1.1.0 (2026-05-12) | gh release list --repo ai-dynamo/nixl |
| Previous releases | v1.0.1 (2026-04-14) — maintenance: NIXL-EP destruction/elastic-scale fixes, libfabric thread-safety + notif-on-repost; v1.0.0 (2026-03-13) — first stable | release notes |
| HEAD pyproject version | 1.2.0 | pyproject.toml |
| PyPI wheels | nixl-cu12, nixl-cu13 (auto-selects at runtime via PyTorch CUDA version since 1.0.1) |
pip install nixl |
| Torch dep pin | torch==2.11.* |
pyproject.toml (1.0.1+) |
| UCX version | 1.20.x tested |
repo README.md |
| Plugins | 13: ucx, libfabric, mooncake, uccl, gpunetio, cuda_gds, gds_mt, posix, hf3fs, obj (S3), azure_blob, gusli, telemetry | src/plugins/ |
| Memory types | DRAM, VRAM, FILE, BLOCK, OBJ |
src/api/python/_api.py |
| Operations | READ, WRITE |
(no SEND/RECV — one-sided) |
For staleness — see references/sources.md for verified URLs. Run freshen mode of skill-improver to refresh.
Decision tree — picking a backend
Is the transfer across nodes?
├─ Yes → Network backend
│ ├─ Standard RDMA (RoCE / IB) on Mellanox NICs? → UCX
│ ├─ AWS EFA? → libfabric (validated) or UCX
│ ├─ Heterogeneous GPU/NIC, software transport (no RDMA)? → UCCL [Preview]
│ ├─ KVCache-centric workload, multi-protocol (TCP/RDMA/CXL/NVMe-oF)? → mooncake [Preview]
│ └─ GPU-driven RDMA (GDAKI, kernel-launched)? → gpunetio (DOCA)
└─ No → Local / storage backend
├─ NVMe / parallel FS via GPUDirect Storage? → cuda_gds (single-thread) or gds_mt (multi-thread)
├─ Plain POSIX file (libaio default, liburing opt)? → posix
├─ DeepSeek 3FS distributed FS? → hf3fs
├─ Block storage via GUSLI shared-mem client? → gusli
├─ S3 (or S3-compatible) object store? → obj (with optional cuobjclient accelerated engine)
└─ Azure Blob? → azure_blob
A single agent can instantiate multiple backends; per-transfer the agent chooses one based on the memory types involved and what the remote side advertises. Pass backends=["UCX","GDS"] to nixl_agent_config (Python) or createBackend calls (C++) to constrain candidates.
Quick start (Python)
import torch
from nixl import nixl_agent, nixl_agent_config
# Agent on each side
config = nixl_agent_config(
enable_prog_thread=True,
enable_listen_thread=True, # socket side-channel
listen_port=5555,
capture_telemetry=False,
backends=["UCX"], # default; use ["UCX","GDS"] etc. to add more
)
agent = nixl_agent("agent-1", config)
# Register a tensor (auto-detects DRAM vs VRAM)
tensor = torch.zeros((10, 16), dtype=torch.float32)
reg = agent.register_memory(tensor)
# After exchanging metadata (see references/python-api.md):
local_descs = agent.get_xfer_descs([tensor[i, :] for i in range(10)])
xfer = agent.initialize_xfer("READ", local_descs, target_descs, "agent-2", b"done")
agent.transfer(xfer)
while agent.check_xfer_state(xfer) == "PROC":
pass # spin or do other work
End-to-end working programs in examples/python/basic_two_peers.py, expanded_two_peers.py, partial_md_example.py. Full surface in references/python-api.md.
Metadata exchange — choose ONE of two modes
Side-channel (default). Each agent runs a TCP listener (enable_listen_thread=True, listen_port=N). One agent calls fetch_remote_metadata(remote_name, ip, port) to pull, or send_local_metadata(ip, port) to push. Good for fixed-pair setups, lab environments. Defaults to port 5555.
ETCD. Set NIXL_ETCD_ENDPOINTS=http://etcd:2379 (comma-separated for HA). Each agent calls sendLocalMD() / fetchRemoteMD(remote_name) (no IP/port args). Required for elastic / dynamic-scaling clusters where peers are not known upfront. ETCD is also how nixlbench discovers workers.
Both modes support send_partial_agent_metadata(descs, inc_conn_info, backends, label=...) — only register-then-send the metadata for specific descriptor lists, useful when memory regions are dynamic or to avoid advertising everything. Example: examples/python/partial_md_example.py.
Plugin Manager + plugin search path
The plugin manager defers loading until first use (#1546, in v1.0.0+) and reads NIXL_PLUGIN_DIR to find dynamic plugins on disk. Static plugins are compiled in. Set this env var explicitly when running from non-system paths:
export NIXL_PLUGIN_DIR=/path/to/nixl/lib/x86_64-linux-gnu/plugins
agent.getAvailPlugins() lists what was discovered. agent.getPluginParams("UCX") returns the param schema + supported memory types for that plugin (use this to discover what to pass to createBackend).
NIXL-EP — elastic Expert Parallel device kernels
examples/device/ep/csrc/ ships device-side CUDA kernels for MoE all-to-all dispatch — nixl_ep_ll.cu (low-latency) and nixl_ep_ht.cu (high-throughput). Two API surfaces, one mode per agent: mixing LL and HT calls on the same agent is a hard error (mode guards added in v1.0.1, #1538). NIXL-EP also supports elastic scale-up (new nodes joining a running deploy), with signaling-buffer fixes in v1.0.1 (#1453). GPU timeouts are configurable (#1520). NIXL-EP is the layer Dynamo's MoE plane will land on; for plain disaggregated prefill (single tensor transfer), use the regular agent API.
Telemetry — two exporters, environment-driven
| Env var | Purpose | Default |
|---|---|---|
NIXL_TELEMETRY_ENABLE |
Master switch (y/yes/1 to enable) |
false |
NIXL_TELEMETRY_EXPORTER |
Plugin name; empty falls back to cyclic buffer if NIXL_TELEMETRY_DIR set |
unset |
NIXL_TELEMETRY_DIR |
Output dir for cyclic buffer files (one file per agent) | unset |
NIXL_TELEMETRY_BUFFER_SIZE |
Events in cyclic buffer | 4096 |
NIXL_TELEMETRY_RUN_INTERVAL |
Flush interval ms | 100 |
Cyclic buffer = static plugin, shared-memory ring; readers in examples/python/telemetry_reader.py and examples/cpp/telemetry_reader.cpp. Prometheus exporter is dynamic + experimental (beta) as of v1.0.x — see src/plugins/telemetry/prometheus/README.md. The Prometheus Exposer is shared across agents in the same process (v1.0.0 PR #1470). Built-in metrics: agent_tx_bytes, agent_rx_bytes, agent_xfer_time (µs), agent_xfer_post_time (µs), agent_memory_registered, etc. — full table in references/architecture.md and per-transfer telemetry via agent.get_xfer_telemetry(handle).
Common gotchas — read before deploying
UCX_TLSis the #1 source of segfaults in vLLM+NIXL deploys.UCX_TLS=tcpalone segfaultsnixlUcxSharedThread::run()after prefill on CUDA-capable images that haven't been told the GPU is reachable. UseUCX_TLS=cuda_copy,sm,tcp(or fullcuda_copy,cuda_ipc,sm,tcp,rcfor cross-node). Symptom:W ucx_utils.cpp:581: memory is detected as host. (Verified invllm-cachingskill 2026-04-25.)- First transfer after pod-ready may fail with
NIXL_ERR_REMOTE_DISCONNECT— handshake race. Retry the second request. Side-channel listener takes a few seconds to bind. - Side-channel host MUST be a real interface, not a service VIP. In Kubernetes, set
VLLM_NIXL_SIDE_CHANNEL_HOST=$(POD_IP)via downward API and use a headless Service (clusterIP: None,publishNotReadyAddresses: true). Same applies to ETCD-mode if pods advertise their own addr. NIXL_PLUGIN_DIRdefaults aren't reliable in non-pip installs. Ifagent.getAvailPlugins()returns[]and an exception fires (No plugins available for NIXL), the env var is wrong. Verify withls $NIXL_PLUGIN_DIR/libplugin_*.so.- Mooncake plugin disables progress thread, has its own metadata system that bypasses NIXL's, and caps transfer requests per handle at 1024 (
kMaxRequestCount). It's[Preview]. Don't compose with strict NIXL-only metadata flows. - GDS requires
cufile.jsonwithallow_compat_mode: trueunless full GDS is wired (kernel module + supported FS). The plugin README has the canonical config andCUFILE_ENV_PATH_JSONexport. - HF3FS needs page-aligned, page-size-multiple memory for the zero-copy
mmap()shared-memory path; otherwise it copies. Passmem_config=dram_zcto fail loud if alignment is wrong. - gpunetio (DOCA) is single-NIC + single-GPU per backend. To use 2 NICs, instantiate 2 backends with different
network_devices.nvshmem-aware bench mode supports VRAM-only transfers. - EFA-only configs in UCX were gated to not poison non-EFA setups in v1.0.1 (#1527). Below that, UCX with EFA-tuned defaults could degrade other systems.
- Telemetry timestamps were removed from events in v1.0.0 (#1522) — readers must derive ordering from ring-insertion order, not event timestamp fields.
Full debugging cookbook in references/gotchas.md.
What this skill does NOT cover (and where to go)
- vLLM
NixlConnectorconfiguration (--kv-transfer-config,kv_role, K8s pod shape, the proxy server) — covered invllm-cachingskill, especiallyreferences/connectors.md"NixlConnector" section. The vllm-caching skill has live-lab-verified recipes for 1P1D Qwen3-4B on consumer hardware including the six non-obvious env vars. - Dynamo deployment and disaggregation orchestration — Dynamo docs at
https://docs.nvidia.com/dynamo/. NIXL is the data plane; Dynamo handles request routing, scheduling, and the rest of the control plane. - vLLM-stack production deploy (LMCache, MooncakeConnector etc.) —
vllm-deploymentandvllm-cachingskills. - SGLang KV transport — SGLang has its own connector; NIXL is one of the available backends. Consult SGLang docs.
When the user is stuck — diagnostic flow
Quick first pass: run python ${CLAUDE_SKILL_DIR}/scripts/check_install.py --backends UCX,LIBFABRIC,GDS (substitute backends used). Validates wheel/CUDA match, plugin discovery, UCX_TLS, ETCD reachability, cufile.json — covers gotchas #1, #2, #4, #5, #7 in references/gotchas.md.
nixl_agentraisesRuntimeError("No plugins available")→NIXL_PLUGIN_DIRis wrong, OR pip wheel mismatch (nixl-cu13on CUDA 12 host etc.). Confirmpip show nixl-cu12 nixl-cu13and which one PyTorch CUDA matches.- Hangs on
fetch_remote_metadata→ side-channel listener didn't start. Checkenable_listen_thread=Trueon the target, port not collided, firewall open. Add a 5 s sleep before fetching. NIXL_ERR_REMOTE_DISCONNECT→ handshake race or network ACL. Retry. CheckUCX_TLS. Confirm peer is listening withss -tlnp | grep <port>.Posting transfer failed.(state == "ERR") → the agent rejected the request. Common causes: descriptor not within registered region, remote agent metadata not loaded yet (callcheck_remote_metadata(name)first), or backend not available on both sides. Checkagent.query_xfer_backend(handle)if it's prepped.- Throughput nowhere near line rate → run
nixlbench --backend UCX --etcd-endpoints ... --initiator_seg_type=VRAM --target_seg_type=VRAMto isolate. Compare against AWS/Mellanox-published numbers. CheckUCX_TLS,UCX_NET_DEVICES,cudaDeviceCanAccessPeer, GDS path withgdscheck. - Correctness bug after upgrade →
git log v1.0.0..v1.0.1and search for the relevant subsystem; the libfabric / NIXL-EP subsystems had several correctness fixes in 1.0.1.
references/gotchas.md has the full diagnostic flow with commands.
References
references/architecture.md— Agent + Memory Section + South Bound API (SB API) + Plugin Manager + descriptor lists + NIXL-EP device kernels + telemetry event catalog.references/plugins.md— All 13 plugins with deps, parameters, capabilities, when-to-pick.references/python-api.md—nixl_agentPython surface with worked examples for every common operation.references/deployment.md— pip install, source build (meson+ninja), Docker, K8s, ETCD setup, env vars,nixlbench,kvbench.references/integrations.md— Dynamo, vLLMNixlConnector, SGLang, observability stack pointers.references/gotchas.md— Debugging cookbook, error catalog, threading/sync model, build pitfalls.references/sources.md— Verified URLs withLast verifieddates.scripts/check_install.py— Sanity-check script: wheel/CUDA match, plugin discovery, UCX_TLS, ETCD reach, cufile.json. Run before debugging deeper.
Authoritative upstream sources
Primary: https://github.com/ai-dynamo/nixl (releases via gh release list --repo ai-dynamo/nixl). Canonical docs live in-tree under docs/ (nixl.md, BackendGuide.md, telemetry.md, python_api.md); Python API source src/api/python/_api.py; per-plugin docs src/plugins/<name>/README.md; examples examples/{python,cpp,rust,device/ep}/; benchmarks benchmark/{nixlbench,kvbench}/. Full URL list with verification dates in references/sources.md.
When asked about a specific backend, read the plugin's README.md under src/plugins/<name>/ first — every plugin documents its own deps, params, and gotchas there. Then map to references/plugins.md for the cross-cutting matrix.