aqua-deployment

star 126

Deploy LLM models on OCI using AI Quick Actions (AQUA) - single model, multi-model, stacked (LoRA), with GPU shape selection, vLLM configuration, streaming, and tool calling. Triggered when user wants to deploy, update, or manage model deployments.

oracle By oracle schedule Updated 2/28/2026

name: aqua-deployment description: Deploy LLM models on OCI using AI Quick Actions (AQUA) - single model, multi-model, stacked (LoRA), with GPU shape selection, vLLM configuration, streaming, and tool calling. Triggered when user wants to deploy, update, or manage model deployments. user-invocable: true disable-model-invocation: false

AQUA Model Deployment

Use this skill when the user wants to deploy, manage, or configure LLM model deployments on OCI Data Science using AI Quick Actions.

Deployment Types

Type Description
Single Model One model per deployment (most common)
Multi-Model Multiple LLMs on one instance via LiteLLM routing
Stacked Base model + multiple LoRA fine-tuned weights sharing inference

Python SDK Usage

Import

from ads.aqua.modeldeployment import AquaDeploymentApp
deployment_app = AquaDeploymentApp()

Create Single Model Deployment

from ads.aqua.modeldeployment.entities import CreateModelDeploymentDetails

details = CreateModelDeploymentDetails(
    model_id="ocid1.datasciencemodel.oc1.iad.xxx",
    instance_shape="VM.GPU.A10.2",
    display_name="llama-3.1-8b-deployment",
    compartment_id="ocid1.compartment.oc1..xxx",
    project_id="ocid1.datascienceproject.oc1.iad.xxx",
    log_group_id="ocid1.loggroup.oc1.iad.xxx",
    log_id="ocid1.log.oc1.iad.xxx",
    env_var={
        "MODEL_DEPLOY_PREDICT_ENDPOINT": "/v1/completions",
        "PARAMS": "--max-model-len 4096",
    },
)
deployment = deployment_app.create(create_deployment_details=details)
print(f"Deployment: {deployment.id} | State: {deployment.state}")

Create with Chat Completions Endpoint

details = CreateModelDeploymentDetails(
    model_id="ocid1.datasciencemodel.oc1.iad.xxx",
    instance_shape="VM.GPU.A10.2",
    display_name="llama-3.1-8b-chat",
    env_var={
        "MODEL_DEPLOY_PREDICT_ENDPOINT": "/v1/chat/completions",
        "PARAMS": "--max-model-len 4096",
    },
)

Create Multi-Model Deployment

from ads.aqua.common.entities import AquaMultiModelRef

details = CreateModelDeploymentDetails(
    models=[
        AquaMultiModelRef(
            model_id="ocid1.datasciencemodel.oc1.iad.model1",
            model_name="llama-3.1-8b",
            gpu_count=1,
        ),
        AquaMultiModelRef(
            model_id="ocid1.datasciencemodel.oc1.iad.model2",
            model_name="mistral-7b",
            gpu_count=1,
        ),
    ],
    instance_shape="VM.GPU.A10.2",
    display_name="multi-model-deployment",
    compartment_id="ocid1.compartment.oc1..xxx",
    project_id="ocid1.datascienceproject.oc1.iad.xxx",
)
deployment = deployment_app.create(create_deployment_details=details)

Create Stacked Deployment (Base + LoRA Fine-Tunes)

from ads.aqua.common.entities import AquaMultiModelRef, LoraModuleSpec

details = CreateModelDeploymentDetails(
    models=[
        AquaMultiModelRef(
            model_id="ocid1.datasciencemodel.oc1.iad.base_model",
            model_name="llama-3.1-8b",
            fine_tune_weights=[
                LoraModuleSpec(
                    model_id="ocid1.datasciencemodel.oc1.iad.ft1",
                    model_name="llama-3.1-8b-customer-support",
                ),
                LoraModuleSpec(
                    model_id="ocid1.datasciencemodel.oc1.iad.ft2",
                    model_name="llama-3.1-8b-summarization",
                ),
            ],
        ),
    ],
    instance_shape="VM.GPU.A10.2",
    display_name="stacked-llama-deployment",
    deployment_type="STACKED",
)
deployment = deployment_app.create(create_deployment_details=details)

List Deployments

deployments = deployment_app.list(compartment_id="ocid1.compartment.oc1..xxx")
for d in deployments:
    print(f"{d.display_name} | {d.state} | {d.endpoint}")

Get Deployment Details

deployment = deployment_app.get(model_deployment_id="ocid1.datasciencemodeldeployment.oc1.iad.xxx")

Get Deployment Config (Recommended Shapes)

config = deployment_app.get_deployment_config(model_id="ocid1.datasciencemodel.oc1.iad.xxx")

List Available Shapes

shapes = deployment_app.list_shapes(compartment_id="ocid1.compartment.oc1..xxx")

Shape Recommendation

recommendation = deployment_app.recommend_shape(model_id="ocid1.datasciencemodel.oc1.iad.xxx")

CLI Usage

Create Deployment

ads aqua deployment create \
  --model_id "ocid1.datasciencemodel.oc1.iad.xxx" \
  --instance_shape "VM.GPU.A10.2" \
  --display_name "llama-3.1-8b-deployment" \
  --compartment_id "ocid1.compartment.oc1..xxx" \
  --project_id "ocid1.datascienceproject.oc1.iad.xxx" \
  --log_group_id "ocid1.loggroup.oc1.iad.xxx" \
  --log_id "ocid1.log.oc1.iad.xxx"

Create Multi-Model Deployment

ads aqua deployment create \
  --models '[{"model_id":"ocid1...model1","model_name":"llama-8b","gpu_count":1},{"model_id":"ocid1...model2","model_name":"mistral-7b","gpu_count":1}]' \
  --instance_shape "VM.GPU.A10.2" \
  --display_name "multi-model"

Create Stacked Deployment

ads aqua deployment create \
  --models '[{"model_id":"ocid1...base","model_name":"llama-8b","fine_tune_weights":[{"model_id":"ocid1...ft1","model_name":"ft-support"}]}]' \
  --instance_shape "VM.GPU.A10.2" \
  --display_name "stacked-deployment" \
  --deployment_type "STACKED"

List / Get

ads aqua deployment list --compartment_id "ocid1.compartment.oc1..xxx"
ads aqua deployment get --model_deployment_id "ocid1.datasciencemodeldeployment.oc1.iad.xxx"

Invoking a Deployed Model

Python SDK (Streaming)

import ads
import oci
import requests

ads.set_auth("resource_principal")
endpoint = "https://modeldeployment.us-ashburn-1.oci.customer-oci.com/ocid1.datasciencemodeldeployment.oc1.iad.xxx"

# Non-streaming
response = requests.post(
    f"{endpoint}/predict",
    json={
        "model": "odsc-llm",
        "prompt": "Write a haiku about clouds",
        "max_tokens": 256,
        "temperature": 0.7,
    },
    auth=oci.auth.signers.get_resource_principals_signer(),
)
print(response.json())

OpenAI-Compatible Client (ADS)

from ads.aqua.client.openai_client import OpenAI

client = OpenAI(
    model_deployment_url="https://modeldeployment.us-ashburn-1.oci.customer-oci.com/ocid1.datasciencemodeldeployment.oc1.iad.xxx",
    auth={"signer": oci.auth.signers.get_resource_principals_signer()},
)
response = client.chat.completions.create(
    model="odsc-llm",
    messages=[{"role": "user", "content": "Hello!"}],
    max_tokens=500,
)
print(response.choices[0].message.content)

GPU Shape Reference

Quick sizing rule: GPU_memory_GB = num_params_billions × 2 for FP16/BF16, plus ~20% for KV cache.

Shape GPUs GPU Memory Fits (FP16)
VM.GPU.A10.1 1 24 GB ≤ 7B
VM.GPU.A10.2 2 48 GB ≤ 13B
BM.GPU.A10.4 4 96 GB ≤ 34B, or 70B quantized
BM.GPU.A100-v2.8 8 640 GB ≤ 70B
BM.GPU.H100.8 8 640 GB ≤ 70B (faster)
BM.GPU.H200.8 8 1128 GB 405B+

For the full shape table, per-model recommendations, multi-model GPU count constraints, and quantization options, see references/shapes.md.

vLLM Configuration Parameters

Set via PARAMS environment variable or --params CLI flag:

Parameter Description Example
--max-model-len Maximum context length 4096, 8192, 32768
--gpu-memory-utilization Fraction of GPU memory for model 0.9 (default), 0.95
--max-num-seqs Max concurrent sequences 256
--quantization Quantization method fp8, bitsandbytes
--tensor-parallel-size Number of GPUs for tensor parallelism 2, 4, 8
--trust-remote-code Allow custom model code from HF (no value needed)
--enable-auto-tool-choice Enable function/tool calling (no value needed)
--tool-call-parser Parser for tool calls llama3_json, granite, hermes
--limit-mm-per-prompt Limit multimodal inputs '{"image": 1}'
--task Model task override embedding, transcribe
--enforce-eager Disable CUDA graphs (no value needed)

Tool Calling / Function Calling

Enable during deployment:

env_var={
    "MODEL_DEPLOY_PREDICT_ENDPOINT": "/v1/chat/completions",
    "PARAMS": "--enable-auto-tool-choice --tool-call-parser llama3_json --max-model-len 4096",
}

Supported parsers: llama3_json, llama4_json, granite, hermes, mistral, jamba, pythonic, internlm.

Advanced Topics

Topic Reference
Shape recommender CLI + JSON output references/shapes.md → Shape Recommendation Tool section
LMCache (KV cache persistence for multi-turn) references/lmcache.md
Private endpoints (no public internet) references/private-endpoints.md
Batch inferencing (offline Job-based) references/batch-inferencing.md

Key Source Files

  • ads/aqua/modeldeployment/deployment.pyAquaDeploymentApp (create, list, get, update)
  • ads/aqua/modeldeployment/entities.pyCreateModelDeploymentDetails, AquaDeployment
  • ads/aqua/common/entities.pyAquaMultiModelRef, LoraModuleSpec
  • ads/aqua/client/openai_client.py — OpenAI-compatible client
  • ads/aqua/shaperecommend/recommend.py — GPU shape recommendation engine
Install via CLI
npx skills add https://github.com/oracle/accelerated-data-science --skill aqua-deployment
Repository Details
star Stars 126
call_split Forks 65
navigation Branch main
article Path SKILL.md
More from Creator