aqua-evaluation

star 126

Evaluate LLM model quality using BERTScore, ROUGE, Perplexity, and Text Readability metrics on OCI AI Quick Actions (AQUA). Covers dataset preparation, evaluation job creation, and report interpretation. Triggered when user wants to evaluate or benchmark a model.

oracle By oracle schedule Updated 2/28/2026

name: aqua-evaluation description: Evaluate LLM model quality using BERTScore, ROUGE, Perplexity, and Text Readability metrics on OCI AI Quick Actions (AQUA). Covers dataset preparation, evaluation job creation, and report interpretation. Triggered when user wants to evaluate or benchmark a model. user-invocable: true disable-model-invocation: false

AQUA Model Evaluation

Use this skill when the user wants to evaluate LLM models on OCI Data Science using AI Quick Actions.

Supported Metrics

Metric Description Best For
BERTScore Embedding-based semantic similarity (precision, recall, F1) General text quality, aligns well with human judgement
ROUGE N-gram overlap between generated and reference text Summarization tasks
Perplexity How well the model predicts the text Language modeling quality
Text Readability Reading level / complexity of generated text Content accessibility

Dataset Format

JSONL format with required prompt and completion keys, optional category:

{"prompt": "Summarize this dialog:\nAmanda: I baked cookies...", "completion": "Amanda baked cookies and will bring some for Jerry tomorrow."}
{"prompt": "Translate to French: Hello world", "completion": "Bonjour le monde", "category": "translation"}
{"prompt": "What is 2+2?", "completion": "4", "category": "math"}

The category field dimensions evaluation metrics in the report (e.g., see accuracy per category). When omitted, defaults to "_" (unknown).

Official sample datasets (10 prompts each, math + logic categories):

File Use Case
examples/evaluation-sample-no-sys-message.jsonl Llama-style prompts without system message
examples/evaluation-sample-with-sys-message.jsonl Llama-style prompts with <<SYS>> system message

Python SDK Usage

Import

from ads.aqua.evaluation import AquaEvaluationApp
eval_app = AquaEvaluationApp()

Create Evaluation

from ads.aqua.evaluation.entities import CreateAquaEvaluationDetails

details = CreateAquaEvaluationDetails(
    evaluation_source_id="ocid1.datasciencemodeldeployment.oc1.iad.xxx",  # Deployment OCID
    evaluation_name="llama-3.1-8b-eval-bertscore",
    dataset_path="oci://my-bucket@my-namespace/datasets/eval_data.jsonl",
    report_path="oci://my-bucket@my-namespace/eval-reports/",
    model_parameters={
        "max_tokens": 500,
        "temperature": 0.7,
        "top_p": 0.9,
    },
    shape_name="VM.Standard.E4.Flex",
    block_storage_size=50,
    compartment_id="ocid1.compartment.oc1..xxx",
    project_id="ocid1.datascienceproject.oc1.iad.xxx",
    log_group_id="ocid1.loggroup.oc1.iad.xxx",
    log_id="ocid1.log.oc1.iad.xxx",
    metrics=[
        {"name": "bertscore"},
        {"name": "rouge"},
    ],
)
evaluation = eval_app.create(create_evaluation_details=details)
print(f"Evaluation: {evaluation.id} | State: {evaluation.lifecycle_state}")

Evaluate a Model (not deployment) Directly

You can pass a model OCID instead of a deployment OCID:

details = CreateAquaEvaluationDetails(
    evaluation_source_id="ocid1.datasciencemodel.oc1.iad.xxx",  # Model OCID
    # ... rest of params same as above
)

Evaluate Stacked/Multi-Model Deployment

For stacked or multi-model deployments, specify which model to evaluate:

details = CreateAquaEvaluationDetails(
    evaluation_source_id="ocid1.datasciencemodeldeployment.oc1.iad.xxx",
    model_parameters={
        "max_tokens": 500,
        "temperature": 0.7,
        "model": "llama-3.1-8b-customer-support",  # Target specific model in deployment
    },
    # ... rest of params
)

With Experiment Tracking

details = CreateAquaEvaluationDetails(
    evaluation_source_id="ocid1.datasciencemodeldeployment.oc1.iad.xxx",
    evaluation_name="llama-eval-v2",
    dataset_path="oci://my-bucket@my-namespace/datasets/eval_data.jsonl",
    report_path="oci://my-bucket@my-namespace/eval-reports/",
    model_parameters={"max_tokens": 500, "temperature": 0.7},
    shape_name="VM.Standard.E4.Flex",
    block_storage_size=50,
    experiment_name="llama-evaluations",  # Groups evaluations together
    experiment_description="Llama 3.1 evaluation experiments",
    metrics=[{"name": "bertscore"}, {"name": "rouge"}],
)

List Evaluations

evaluations = eval_app.list(compartment_id="ocid1.compartment.oc1..xxx")
for e in evaluations:
    print(f"{e.display_name} | {e.lifecycle_state}")

Get Evaluation Details

evaluation = eval_app.get(eval_id="ocid1.datasciencemodel.oc1.iad.xxx")

CLI Usage

Create Evaluation

ads aqua evaluation create \
  --evaluation_source_id "ocid1.datasciencemodeldeployment.oc1.iad.xxx" \
  --evaluation_name "llama-eval-bertscore" \
  --dataset_path "oci://my-bucket@my-namespace/datasets/eval_data.jsonl" \
  --report_path "oci://my-bucket@my-namespace/eval-reports/" \
  --model_parameters '{"max_tokens": 500, "temperature": 0.7}' \
  --shape_name "VM.Standard.E4.Flex" \
  --block_storage_size 50 \
  --compartment_id "ocid1.compartment.oc1..xxx" \
  --project_id "ocid1.datascienceproject.oc1.iad.xxx" \
  --metrics '[{"name": "bertscore"}, {"name": "rouge"}]'

List / Get Evaluations

ads aqua evaluation list --compartment_id "ocid1.compartment.oc1..xxx"
ads aqua evaluation get --eval_id "ocid1.datasciencemodel.oc1.iad.xxx"

Interpreting Results

BERTScore

The evaluation produces a Model Catalog entry with:

  • Precision: How much of the generated text is semantically represented in the reference
  • Recall: How much of the reference text is captured by the generated text
  • F1: Harmonic mean of precision and recall

Higher scores = better quality. Scores clustered around the mean indicate consistent performance.

ROUGE

  • ROUGE-1: Unigram overlap
  • ROUGE-2: Bigram overlap
  • ROUGE-L: Longest common subsequence

BERTScore Limitations

  • May favor models mirroring its own architecture
  • Lacks consideration for sentence-level syntax
  • Diminished effectiveness for context beyond word-level (idioms, cultural nuances)
  • Not suitable for evaluating coding models on programming tasks

Key Source Files

  • ads/aqua/evaluation/evaluation.pyAquaEvaluationApp (create, list, get, load_metrics)
  • ads/aqua/evaluation/entities.pyCreateAquaEvaluationDetails, AquaEvalMetrics
  • ads/aqua/config/evaluation/evaluation_service_config.py — Metric configuration
Install via CLI
npx skills add https://github.com/oracle/accelerated-data-science --skill aqua-evaluation
Repository Details
star Stars 126
call_split Forks 65
navigation Branch main
article Path SKILL.md
More from Creator