name: aqua-troubleshooting description: Diagnose and fix OCI AI Quick Actions (AQUA) issues including deployment failures, OOM errors, authorization problems, capacity issues, container errors, and policy misconfigurations. Triggered when user encounters errors or needs help debugging AQUA workflows. user-invocable: true disable-model-invocation: false
AQUA Troubleshooting Guide
Use this skill when the user encounters errors or needs help diagnosing issues with OCI AI Quick Actions deployments, fine-tuning, evaluation, or model registration.
Step 1: Check Logs
Always check logs first. Logging must be enabled during deployment creation.
# Watch live logs for a model deployment
ads opctl watch <model_deployment_ocid> --auth resource_principal
# Watch logs for a job run (fine-tuning/evaluation)
ads opctl watch <job_run_ocid> --auth resource_principal
To get the OCID: AQUA > Model Deployments tab > click deployment > copy OCID from details.
Common Deployment Errors
1. Service Timeout Error
Symptom: Model deployment fails during startup - couldn't load the model in time.
Diagnosis: Check logs via ads opctl watch.
Solutions:
- The model may be too large for the selected shape
- Try a larger GPU shape
- Reduce
--max-model-lento decrease memory requirements
2. Out of Memory (OOM) Error
Case A: Model Too Large for GPU
Symptom: CUDA OOM error during model loading.
Solutions (try in order):
- Use a bigger shape (more GPU memory)
- Try FP8 quantization: Add
--quantization fp8toPARAMS - Try 4-bit quantization: Add
--quantization bitsandbytes --load-format bitsandbytestoPARAMS - Reduce context length: Add
--max-model-len <smaller_value>toPARAMS
# Example: Deploy with quantization to fit on smaller GPU
env_var={
"PARAMS": "--quantization fp8 --max-model-len 4096",
}
Case B: KV Cache Too Small
Symptom: Error says "max seq len is larger than maximum tokens in KV cache".
Solution: The error log contains a hint for the max supported --max-model-len. Set it to that value:
env_var={
"PARAMS": "--max-model-len <value_from_log>",
}
3. Trust Remote Code Error
Symptom: Error mentions trust_remote_code=True is required.
Solution: Add --trust-remote-code to PARAMS (leave value blank):
env_var={
"PARAMS": "--trust-remote-code --max-model-len 4096",
}
4. Architecture Not Supported
Symptom: ValueError: Model architectures ['<NAME>'] are not supported for the current vLLM instance.
Solutions:
- Check vLLM supported models
- If not supported by vLLM, use the BYOC (Bring Your Own Container) approach
- For some models, add
--trust-remote-code
5. Capacity Issues
Symptom: "No capacity for the specified shape" or "Out of host capacity".
Solutions:
- Try a different availability domain
- Try a different GPU shape
- Use capacity reservations
- Wait and retry (capacity is dynamic)
Authorization Errors
Root Causes
Authorization errors arise from:
- Missing OCI IAM policies
- Object Storage bucket without versioning enabled
- Notebook session not in the same compartment as the dynamic group
Required Policies
Set up policies via Oracle Resource Manager (ORM) - recommended:
# Go to: AQUA > Policies > Setup via ORM
Or verify with the AQUA Policy Verification tool:
from ads.aqua.verify_policies import AquaVerifyPoliciesApp
verify_app = AquaVerifyPoliciesApp()
result = verify_app.verify()
Policy-to-Operation Mapping
| Operation | Required Policy |
|---|---|
| Create/List Models | manage data-science-models in compartment |
| Create/List Deployments | manage data-science-model-deployments in compartment |
| Create/List Model Version Sets | manage data-science-modelversionsets in compartment |
| Create/List Jobs (FT/Eval) | manage data-science-job-runs in compartment |
| Read Object Storage | read buckets + read objectstorage-namespaces in compartment |
| Write Object Storage | manage object-family in compartment |
| List Log Groups | use logging-family in compartment |
| Use Private Endpoints | use virtual-network-family in compartment |
| Tag Resources | use tag-namespaces in tenancy |
| Evaluation/Fine-Tuning | manage data-science-models + read resource-availability + use virtual-network-family |
Bucket Versioning
Object Storage bucket must have versioning enabled:
# Check versioning status
oci os bucket get -bn <bucket-name> --auth resource_principal | jq ".data.versioning"
# Should return "Enabled"
Environment Setup Issues
Authentication
import ads
# In OCI Notebook Sessions
ads.set_auth("resource_principal")
# Local development with API key
ads.set_auth("api_key")
# Local development with security token
ads.set_auth("security_token")
Required Environment Variables (for local/internal development)
export OCI_IAM_TYPE="security_token"
export OCI_CONFIG_PROFILE=<your-profile>
export OCI_ODSC_SERVICE_ENDPOINT="https://datascience.us-ashburn-1.oci.oraclecloud.com"
HuggingFace Gated Models
export HF_TOKEN=<your_hf_read_token>
# OR
huggingface-cli login
Fine-Tuning Specific Issues
Dataset Format Errors
- Ensure JSONL format (one valid JSON per line)
- All rows must have same schema
- For instruction format:
promptandcompletionkeys required - For conversational format:
messageskey withrole/contentobjects - Verify no trailing commas or invalid JSON
Distributed Training Failures
- VCN + Subnet required for
replica > 1 - Logging required for distributed training
- Multi-node overhead is significant; single replica with multi-GPU shape is preferred
- Check that all nodes can communicate (security lists / NSGs allow traffic)
Evaluation Specific Issues
Evaluation Job Fails
- Ensure deployment is in
ACTIVEstate before running evaluation - Dataset must be JSONL with
promptandcompletionkeys - Report path must be writable Object Storage location
- Block storage size must be sufficient (default: 50 GB)
BERTScore Issues
- BERTScore is not suitable for evaluating code generation tasks
- Consider ROUGE for summarization-focused evaluations
- The evaluation model endpoint must be reachable from the evaluation job
Diagnostic Commands
# Check deployment status
ads aqua deployment get --model_deployment_id <ocid>
# List all deployments (check for failed ones)
ads aqua deployment list --compartment_id <compartment_ocid>
# Check model details
ads aqua model get --model_id <model_ocid>
# Verify policies
ads aqua verify_policies
Key Source Files
ads/aqua/verify_policies/— Policy verification appads/aqua/common/errors.py— Error hierarchy (AquaValueError, AquaRuntimeError, etc.)ads/aqua/training/exceptions.py— Training job exit code mappingsads/aqua/extension/errors.py— HTTP error message templates