xr-createshadow

name: xr-createShadow description: Use when creating, updating, pausing, resuming, or deleting an AzureOpenAI model-pool shadow deployment, especially when the user gives prod and shadow engine definitions and wants the minimum additional inputs needed to keep everything else aligned with prod, reuse or publish the correct shadow model definition, build the shadow payload, and verify the shadow reaches a useful live state. compatibility: Requires Azure CLI login with model-pool Shadow API access and local repo context under `/home/xiaoranli/repo`.

Use this skill to turn a prod engine definition plus a shadow engine definition into a model-pool shadow deployment with the minimum extra operator input.

Default behavior:

infer control-side fields from repo templates first,
ask only for the missing deltas,
write request payloads under AI-gen/,
and execute the model-definition plus shadow-create flow instead of stopping at a draft.

Source of truth

/home/xiaoranli/repo/shadowDeployments.md
/home/xiaoranli/repo/AMLRunbook/AMLDocumentation/DRI Handbook files/TeamDRIs/MPCM/ShadowTestTSG.md
/home/xiaoranli/repo/Mumford.wiki/Mumford/Observability/How-to-setup-Shadow-Model-Deployment.md
/home/xiaoranli/repo/vienna/src/azureml-api/deploy/templates/ev2/model-pool/AzureOpenAI/<region>/
/home/xiaoranli/repo/vienna/src/azureml-api/deploy/templates/ev2/model-pool/ModelDefinitionV2/Request/
/home/xiaoranli/repo/adm-engine-configs/scripts/python/engine/publish_model_definition.py
/home/xiaoranli/repo/xr-request.sh

Required user inputs

Use these normalized field names when asking the user for inputs. The first two are the usual minimum initial input set:

prod_engine_definition Prefer one of:
- checked-in file path,
- full engine-definition JSON,
- or a stable name plus version that can be resolved from local context.
shadow_engine_definition Prefer one of:
- checked-in file path,
- full engine-definition JSON,
- or a stable name plus version that can be resolved from local context.

Then infer or collect the execution-critical fields in this order:

region Prefer explicit input. If missing, infer it only when the prod route or template context is unambiguous.
one production target selector Prefer deployment_group_name when the user wants to pin a known DG. Otherwise accept one of:
- model_pool_name,
- control_model_definition_id,
- or a concrete region-specific file under AzureOpenAI/<region>/.
allotment_id Prefer the full value such as /stamps/AOAI/allotmentgroups/AOAI-3P/allotments/AOAI-3P.Default.
sku This must be the exact shadow deployment SKU.
traffic_percentage If the user does not care, default to 10 only when deployment_group_name is explicit. If deployment_group_name is omitted and planner resolves the active DG, expect the service to normalize the percentage to max(1, floor(100 / deploymentCount)) for the selected DG.

Collect these next only if the repo cannot infer them, the user wants non-default behavior, or the live create path needs them:

shadow_test_id
traffic_group
instance_count
auto_pause_after_days
shadow_model_definition_id
shadow_model_definition_version
header_override_notes when the shadow engine needs different request headers from prod

Critical nuance

Two engine definitions are a strong primary input set, but they are not always sufficient by themselves.

If the target shadow model definition does not already exist, the skill also needs one of these:

a reusable request template from ModelDefinitionV2/Request/,
a shadow engine-definition JSON that already carries the snapshot or model-feed data needed to build a model definition,
or an explicit ShadowModelDefinitionId plus ShadowModelDefinitionVersion that already exists.

Do not pretend that two display-name-only engine definitions are enough to publish a model definition.

Also keep these distinctions straight:

engine definition id/name is not the same thing as model definition id.
ED onboarding may create assets that let you publish a model definition, but it does not guarantee the correct shadow model definition for this test is the prod MD id.
Before publishing a new version under the prod MD id, first check whether a shadow model definition already exists in model-pool under the shadow ED naming pattern.

Resolution rules

ControlModelDefinitionId Prefer reading it from the closest matching file under AzureOpenAI/<region>/. If the prod engine definition clearly maps to a unique prod model definition in local context, use that as a cross-check, not as the sole source of truth.
DeploymentGroupName Prefer explicit user input. If missing, resolve it from the production template, Fleet Scheduler context, active model-pool data, or model-pool name. If an explicit DeploymentGroupName is rejected by InvalidShadowTestDeploymentGroupName, retry without DeploymentGroupName and TrafficGroup while keeping InstanceCount; planner can resolve the active DG from ControlModelDefinitionId.
ShadowModelDefinitionId First check whether a model definition already exists in model-pool for the shadow engine-definition naming pattern. If it does, prefer that existing shadow MD id and version. If only the engine changes and the GPU footprint stays compatible and no separate shadow MD exists, reusing the prod model-definition id with a new version is valid. If the model changes or the engine change alters GPU footprint, prefer a new model-definition id and keep the existing naming style, including GPU count when the nearby templates do that.
Sku If the user says the goal is engine-only comparison, prefer the prod SKU. Only diverge when the shadow engine's GPU footprint forces it.
TrafficGroup API-optional, but prefer explicit input when the deployment group has several traffic groups.
InstanceCount API-optional. If omitted, the service uses the model pool's instancePerDeployment value.
OverrideShadowTrafficPercentage Mutable after creation. It can be updated later while the shadow is Active. If DeploymentGroupName is omitted, planner may rewrite the requested percentage to max(1, floor(100 / deploymentCount)) of the selected largest active DG.
ShadowTestId Must be unique. Deleted ids remain reserved for 30 days, so do not try to reuse them immediately.
AllotmentId Treat this as both the quota selector and the pre-create capacity-check key. If the user gives only a friendly alias or portal label, ask for the full allotment path before executing live calls.
AutoPauseAfterDays Treat this as optional and example-backed. It appears in current examples and TSG responses, but the core parameter table in shadowDeployments.md does not document it as part of the primary contract.

Capacity precheck

Before any live shadow PUT, run a quota gate using resolved region, AllotmentId, Sku, and InstanceCount.

Parse AllotmentId as /stamps/{stamp}/allotmentgroups/{group}/allotments/{name}.
Map the requested SKU to the MPCM VM family at minimum as:
- H100 -> NDH100V5
- H200 -> NDH200V5
- MI300 -> NDMI300XV5
- A100 -> NDAMV4
Query MPCM before creating the shadow: https://westcentralus.api.azureml.ms/model-pool-capacity-manager/v1.0/stamps/{stamp}/allotmentGroups/{group}/allotments/{name}?regions={region}&vmFamilies={vmFamily}&includeRegion=true
If MPCM returns 404, stop. The allotment is missing, stale, or not onboarded for the requested scope.
If available capacity is below the requested InstanceCount, stop and surface the shortfall instead of attempting the shadow create call.
If InstanceCount is omitted, first resolve the model-pool instancePerDeployment value from deployment-group or model-pool context. If that still remains unknown, use a conservative one-instance precheck and say that the live create path may still require more capacity.
Optionally list existing shadows on the same AllotmentId as a cross-check for competing usage.
Treat this precheck as mandatory for live execution. Skip it only when the user explicitly wants a draft payload without API calls.

Context gathering

Follow this order:

Read the repo AGENTS.md.
Read the shadow docs listed in Source of truth.
Search AzureOpenAI/<region>/ for the closest production JSON and extract:
- modelDefinitionId,
- the model-pool naming pattern,
- and any obvious traffic-group hint.
Diff the prod and shadow engine definitions:
- engine.engineId
- engine.skus
- pipeReplicaGroups
- deploymentProperties
- any request-header-relevant envs
Search ModelDefinitionV2/Request/ for the closest request template if a new shadow model definition must be published.
Query model-pool for an existing shadow model definition before publishing anything:
- first try the shadow engine-definition name as a model-definition id,
- then try lkgOrLatest or the explicit version if known. Reuse an existing active shadow MD when it already points at the intended engine version.
If the user supplied full engine-definition JSON, check whether the shadow one already contains snapshot, engine, and SKU data before asking for more. Prefer inheriting unchanged fields from the prod side when the user's intent is engine-only testing.
Prefer /home/xiaoranli/repo/adm-engine-configs/scripts/python/engine/publish_model_definition.py over inventing a new publisher from scratch.
Run the MPCM capacity precheck with resolved AllotmentId, Sku, region, and InstanceCount before the shadow PUT.
If Preparing persists, check:

shadow GET,
deployment-group shadowState,
and deployment-group admin oplog filtered by shadowTestId or deploymentName before blaming the payload.

If the user wants a post-create sanity check, prefer /home/xiaoranli/repo/xr-request.sh or a user-supplied payload.
If the user asks whether a newly created deployment or endpoint is actually receiving traffic, wants return-code distribution, or wants direct ADX links, hand off to xiaoranli-kuda's endpoint/deployment quick-check path. Treat that as the default deployment-verification workflow, with shadow as just one example.

Execution flow

Resolve the control-side fields from repo context first.
Diff prod and shadow engine definitions and keep all non-engine fields aligned with prod unless the shadow engine forces a change.
Decide whether an existing shadow model definition already exists in model-pool.
If it exists, prefer reusing that shadow MD id and version instead of republishing under the prod MD id.
If it does not exist, build and publish the model definition. Default behavior for engine-only tests:
- inherit prod model-definition shape,
- replace only the engine-side field or version that must change,
- preserve prod-compatible SKU unless footprint changed.
If DeploymentGroupName is unknown or stale, prefer a create request without DeploymentGroupName and TrafficGroup, but keep InstanceCount so planner can resolve the active DG.
Run the MPCM capacity precheck. If it fails, stop before the shadow create call and report the exact allotment, VM family, and available-vs-requested capacity.
Write generated payloads under:
- AI-gen/<shadow-test-id>-model-definition-request.json
- AI-gen/<shadow-test-id>-shadow-create-request.json
Call the model-definition PUT only when required.
Call the shadow PUT.
Poll the shadow GET until it reaches Active, Failed, or a clearly stable nonterminal state.
If Active, optionally run a sanity request or guide the user to compare shadow logs against prod. For deployment-level traffic or return-code verification, prefer the xiaoranli-kuda quick-check helper using the exact endpoint/deployment pair and a start time anchored to creation time. If the next question is exact shadow ITL, TPOT, or "有没有 traffic 进来" on a mirrored rollout, hand off to xiaoranli-kuda and follow its Fixed Shadow ITL Workflow instead of guessing from FrontDoor or Nexus requests.
If it remains nonterminal, report:
- deploymentName,
- resolved DeploymentGroupName,
- resolved traffic percentage,
- deployment-group shadowState,
- and admin oplog state history.
If Failed, lead with capacityDiagnostics, deploymentName, and the TSG branch that matches the failure class.

Status and observability

When checking or debugging a shadow, always start with a compact status summary:

shadowTestId
status
shadowModelDefinitionId and shadowModelDefinitionVersion
controlModelDefinitionId
deploymentGroupName
deploymentName
instanceCount
overrideShadowTrafficPercentage

Always include the Shadow Test Status dashboard link using the resolved shadowTestId and deploymentGroupName:

https://dataexplorer.azure.com/dashboards/0f7089f3-5ba9-4128-ae02-155e4887c610?p-_startTime=3hours&p-_endTime=now&p-Region=all&p-_ShadowTestId=v-{shadowTestId}&p-_DeploymentGroupName=v-{deploymentGroupName}&p-_Status=all&p-Stamp=all&p-Allotment+Group=all&p-Allotment=all&p-ShadowModelPoolName=all&p-_operationId=all&p-_Endpoint=all&p-_Deployment=all#6144fdd7-0180-4449-ba88-ca9771186679

If deploymentName is empty, do not jump straight to container logs. Stay on planner-facing checks first:

shadow GET
deployment-group capacity.shadowState
planner/admin oplog

If deploymentName is present, derive these identifiers before changing payloads:

endpointName: replace -dg- with -oe- in deploymentGroupName
resourceInstanceId: get it from scheduler traces or the first matching state-transition record

Use ADX or kusto-local against cluster https://aiscprodkusto.westus2.kusto.windows.net, database logs, when direct log queries are needed.

Use scheduler traces to separate planner/scheduling issues from container startup issues:

SchedulerStateTransitions
| where TIMESTAMP > ago(1h)
| where message has "{deploymentName}"
| project TIMESTAMP, CurrentState, ToState, ApplicationName, message
| order by TIMESTAMP asc
| take 50

Then use container traces for engine startup and health details:

ContainerTraces
| where TIMESTAMP > ago(1h)
| where ApplicationName == "{resourceInstanceId}"
| where Level >= 3 or RawMessage has "model" or RawMessage has "engine" or RawMessage has "ready" or RawMessage has "error" or RawMessage has "loading" or RawMessage has "started"
| project TIMESTAMP, CodePackageName, RawMessage, Level
| order by TIMESTAMP asc
| take 100

If there are no container logs yet, treat that as "container has not started" rather than proof that the payload is wrong.

Failure handling

Failed Start with capacityDiagnostics. Distinguish capacity or user error from engine or container failure before changing the payload. Bucket the failure class before recommending a fix:
- SCHEDULING_FAILURE when scheduler traces show capacity, quota, or unschedulable placement issues.
- CONTAINER_CRASH or HEALTH_CHECK_TIMEOUT when deploymentName exists but the deployment never becomes healthy.
- OOM, CUDA_ERROR, or MODEL_LOAD_FAILURE when container traces show engine startup failure signatures.
- UNKNOWN only when neither capacityDiagnostics nor traces give a usable cause.
Preparing or Pausing stuck for a long time Check deployment-group shadowState, planner behavior, and the admin oplog before retrying blindly. If oplog only shows Creating and deployment-group shadowState.currentStateInstanceCount is still below goal, treat it as underlying compute still provisioning, not as payload validation failure. If deploymentName is already set, inspect scheduler and container traces before retrying the API call.
InvalidShadowTestDeploymentGroupName Treat this as a selector problem, not an engine problem. Retry without DeploymentGroupName and TrafficGroup while preserving ControlModelDefinitionId, InstanceCount, Sku, and AllotmentId.
delete workflow Pause first, wait for Paused, then delete.
admin cleanup Use the admin forceTerminate path only for explicit DRI or livesite cleanup, not as the default delete path.

Ask strategy

Do not ask for every payload field at once.

When the user asks what else is needed, reply with the unresolved subset from Required user inputs using those exact normalized field names. Keep the reply to the smallest sufficient subset.

Start with:

prod_engine_definition
shadow_engine_definition
region
deployment_group_name or model_pool_name
allotment_id
sku
traffic_percentage

Then mine the repo for control_model_definition_id, request-template candidates, and naming patterns. If the user's stated goal is "only test engine definition", default to preserving prod-aligned sku, traffic_group, instance_count, and request-header behavior unless the diff proves they must change.

Before asking the user for a new shadow MD id, check whether the shadow ED name already exists as a model definition in model-pool. If the explicit deployment group is missing or rejected, prefer the no-DG fallback before asking the user for a replacement DG.

Ask for shadow_test_id, traffic_group, shadow_model_definition_id, shadow_model_definition_version, or header_override_notes only when they remain unresolved.