aqua-evaluation - SKILL.md Agent Skill

name: aqua-evaluation description: Evaluate LLM model quality using BERTScore, ROUGE, Perplexity, and Text Readability metrics on OCI AI Quick Actions (AQUA). Covers dataset preparation, evaluation job creation, and report interpretation. Triggered when user wants to evaluate or benchmark a model. user-invocable: true disable-model-invocation: false

AQUA Model Evaluation

Use this skill when the user wants to evaluate LLM models on OCI Data Science using AI Quick Actions.

Supported Metrics

Metric	Description	Best For
BERTScore	Embedding-based semantic similarity (precision, recall, F1)	General text quality, aligns well with human judgement
ROUGE	N-gram overlap between generated and reference text	Summarization tasks
Perplexity	How well the model predicts the text	Language modeling quality
Text Readability	Reading level / complexity of generated text	Content accessibility

Dataset Format

JSONL format with required prompt and completion keys, optional category:

{"prompt": "Summarize this dialog:\nAmanda: I baked cookies...", "completion": "Amanda baked cookies and will bring some for Jerry tomorrow."}
{"prompt": "Translate to French: Hello world", "completion": "Bonjour le monde", "category": "translation"}
{"prompt": "What is 2+2?", "completion": "4", "category": "math"}

The category field dimensions evaluation metrics in the report (e.g., see accuracy per category). When omitted, defaults to "_" (unknown).

Official sample datasets (10 prompts each, math + logic categories):

File	Use Case
`examples/evaluation-sample-no-sys-message.jsonl`	Llama-style prompts without system message
`examples/evaluation-sample-with-sys-message.jsonl`	Llama-style prompts with `<<SYS>>` system message

Python SDK Usage

Import

from ads.aqua.evaluation import AquaEvaluationApp
eval_app = AquaEvaluationApp()

Create Evaluation

from ads.aqua.evaluation.entities import CreateAquaEvaluationDetails

details = CreateAquaEvaluationDetails(
    evaluation_source_id="ocid1.datasciencemodeldeployment.oc1.iad.xxx",  # Deployment OCID
    evaluation_name="llama-3.1-8b-eval-bertscore",
    dataset_path="oci://my-bucket@my-namespace/datasets/eval_data.jsonl",
    report_path="oci://my-bucket@my-namespace/eval-reports/",
    model_parameters={
        "max_tokens": 500,
        "temperature": 0.7,
        "top_p": 0.9,
    },
    shape_name="VM.Standard.E4.Flex",
    block_storage_size=50,
    compartment_id="ocid1.compartment.oc1..xxx",
    project_id="ocid1.datascienceproject.oc1.iad.xxx",
    log_group_id="ocid1.loggroup.oc1.iad.xxx",
    log_id="ocid1.log.oc1.iad.xxx",
    metrics=[
        {"name": "bertscore"},
        {"name": "rouge"},
    ],
)
evaluation = eval_app.create(create_evaluation_details=details)
print(f"Evaluation: {evaluation.id} | State: {evaluation.lifecycle_state}")

Evaluate a Model (not deployment) Directly

You can pass a model OCID instead of a deployment OCID:

details = CreateAquaEvaluationDetails(
    evaluation_source_id="ocid1.datasciencemodel.oc1.iad.xxx",  # Model OCID
    # ... rest of params same as above
)

Evaluate Stacked/Multi-Model Deployment

For stacked or multi-model deployments, specify which model to evaluate:

details = CreateAquaEvaluationDetails(
    evaluation_source_id="ocid1.datasciencemodeldeployment.oc1.iad.xxx",
    model_parameters={
        "max_tokens": 500,
        "temperature": 0.7,
        "model": "llama-3.1-8b-customer-support",  # Target specific model in deployment
    },
    # ... rest of params
)

With Experiment Tracking

details = CreateAquaEvaluationDetails(
    evaluation_source_id="ocid1.datasciencemodeldeployment.oc1.iad.xxx",
    evaluation_name="llama-eval-v2",
    dataset_path="oci://my-bucket@my-namespace/datasets/eval_data.jsonl",
    report_path="oci://my-bucket@my-namespace/eval-reports/",
    model_parameters={"max_tokens": 500, "temperature": 0.7},
    shape_name="VM.Standard.E4.Flex",
    block_storage_size=50,
    experiment_name="llama-evaluations",  # Groups evaluations together
    experiment_description="Llama 3.1 evaluation experiments",
    metrics=[{"name": "bertscore"}, {"name": "rouge"}],
)

List Evaluations

evaluations = eval_app.list(compartment_id="ocid1.compartment.oc1..xxx")
for e in evaluations:
    print(f"{e.display_name} | {e.lifecycle_state}")

Get Evaluation Details

evaluation = eval_app.get(eval_id="ocid1.datasciencemodel.oc1.iad.xxx")

CLI Usage

Create Evaluation

ads aqua evaluation create \
  --evaluation_source_id "ocid1.datasciencemodeldeployment.oc1.iad.xxx" \
  --evaluation_name "llama-eval-bertscore" \
  --dataset_path "oci://my-bucket@my-namespace/datasets/eval_data.jsonl" \
  --report_path "oci://my-bucket@my-namespace/eval-reports/" \
  --model_parameters '{"max_tokens": 500, "temperature": 0.7}' \
  --shape_name "VM.Standard.E4.Flex" \
  --block_storage_size 50 \
  --compartment_id "ocid1.compartment.oc1..xxx" \
  --project_id "ocid1.datascienceproject.oc1.iad.xxx" \
  --metrics '[{"name": "bertscore"}, {"name": "rouge"}]'

List / Get Evaluations

ads aqua evaluation list --compartment_id "ocid1.compartment.oc1..xxx"
ads aqua evaluation get --eval_id "ocid1.datasciencemodel.oc1.iad.xxx"

Interpreting Results

BERTScore

The evaluation produces a Model Catalog entry with:

Precision: How much of the generated text is semantically represented in the reference
Recall: How much of the reference text is captured by the generated text
F1: Harmonic mean of precision and recall

Higher scores = better quality. Scores clustered around the mean indicate consistent performance.

ROUGE

ROUGE-1: Unigram overlap
ROUGE-2: Bigram overlap
ROUGE-L: Longest common subsequence

BERTScore Limitations

May favor models mirroring its own architecture
Lacks consideration for sentence-level syntax
Diminished effectiveness for context beyond word-level (idioms, cultural nuances)
Not suitable for evaluating coding models on programming tasks

Key Source Files

ads/aqua/evaluation/evaluation.py — AquaEvaluationApp (create, list, get, load_metrics)
ads/aqua/evaluation/entities.py — CreateAquaEvaluationDetails, AquaEvalMetrics
ads/aqua/config/evaluation/evaluation_service_config.py — Metric configuration