vllm-omni-api - SKILL.md Agent Skill

name: vllm-omni-api description: Integrate with vLLM-Omni using the OpenAI-compatible API for text, image, video, and audio generation. Use when building client applications, calling vllm-omni endpoints, sending requests to the API server, or integrating vllm-omni into an application.

vLLM-Omni API Integration

Overview

vLLM-Omni exposes OpenAI-compatible REST endpoints for all modalities. Existing OpenAI client libraries work with minimal changes. The server supports chat completions, image generation, image editing, and speech synthesis.

Starting the Server

vllm serve <model-name> --omni --port 8091

Diffusion models benefit from multi-thread weight loading (enabled by default), which parallelizes safetensors shard loading for faster startup. See vllm-omni-perf for details.

Core Endpoints

Endpoint	Method	Purpose
`/v1/chat/completions`	POST	Chat-based generation (text, image, audio)
`/v1/images/generations`	POST	Direct image generation
`/v1/images/edits`	POST	Image editing
`/v1/audio/speech`	POST	Text-to-speech (wav/mp3)
`/v1/audio/voice/upload`	POST	Upload custom voice for cloning
/v1/images/edits	POST	Image editing
/v1/videos/generations	POST	Video generation (async poll)
`/health`	GET	Server health check
`/v1/models`	GET	List loaded models

/v1/audio/voice/upload endpoint restored. /v1/audio/speech supports response_format: "wav" with streaming. /v1/audio/speech supports response_format: "wav" with streaming.

/v1/images/generations supports client-side request cancellation via AbortController (or client.cancel() in the openai Python SDK). --max-generated-image-size is enforced on both /v1/images/generations and /v1/images/edits (returns HTTP 400 for oversized requests).

Chat Completions (Universal)

The chat completions endpoint handles all modalities through the message format:

Python (openai SDK)

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8091/v1", api_key="unused")

response = client.chat.completions.create(
    model="Tongyi-MAI/Z-Image-Turbo",
    messages=[{"role": "user", "content": "a sunset over mountains"}],
    extra_body={
        "height": 1024,
        "width": 1024,
        "num_inference_steps": 50,
        "guidance_scale": 4.0,
        "seed": 42,
    },
)

image_b64 = response.choices[0].message.content[0].image_url.url

curl

curl -s http://localhost:8091/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "a sunset over mountains"}],
    "extra_body": {
      "height": 1024,
      "width": 1024,
      "num_inference_steps": 50,
      "guidance_scale": 4.0,
      "seed": 42
    }
  }' | jq -r '.choices[0].message.content[0].image_url.url' \
     | cut -d',' -f2 | base64 -d > sunset.png

Image Generation Endpoint

Supports output_format (png, jpeg, webp) and size in both request and response:

curl -s http://localhost:8091/v1/images/generations \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "a cup of coffee on a table",
    "size": "1024x1024",
    "n": 1,
    "output_format": "png"
  }' | jq '.data[0]'

The response includes output_format and size fields. When output_format is not specified, defaults to png.

Streaming Responses

For models supporting streaming (text/audio outputs):

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-Omni-7B",
    messages=[{"role": "user", "content": "Tell me about AI"}],
    stream=True,
)
for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Multi-modal Input

Send images/audio as input to omni-modality models:

import base64

with open("photo.jpg", "rb") as f:
    img_b64 = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-Omni-7B",
    messages=[{
        "role": "user",
        "content": [
            {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{img_b64}"}},
            {"type": "text", "text": "Describe this image"},
        ],
    }],
)

Error Handling

Status Code	Meaning	Action
200	Success	Process response
400	Bad request	Check request body format
404	Model not found	Verify model name and server config
413	Input too large	Reduce input size or increase limits
500	Server error	Check server logs
503	Server overloaded	Retry with backoff
507	Insufficient storage (OOM)	Reduce resolution/batch or use quantization

Health Check

import requests

resp = requests.get("http://localhost:8091/health")
assert resp.status_code == 200

References

For full endpoint specifications and parameters, see references/endpoints.md