name: npu-torchair-infer description: Migrate any HuggingFace model to Ascend NPU torchair graph mode (torch.compile) and benchmark it for accuracy and performance against NPU eager and CPU eager. Use when running, compiling, or benchmarking HF models (vision, text, image-text encoders such as SigLIP2, DINOv3, ViT, CLIP, Qwen-VL, SAM) on Ascend 910B/CANN with torch_npu and torchair; when a torch.compile graph-mode run on NPU fails (Dynamo TorchRuntimeError, unsupported op, interpolate/contiguous errors); or when comparing torchair vs npu_eager vs cpu with cosine similarity, max abs diff, and p50/p95/p99 latency.
NPU TorchAir Inference & Migration (主流程)
Overview
Bring an arbitrary HuggingFace model up on Ascend NPU torchair graph mode,
verify it numerically against CPU/NPU-eager, and measure its speedup. The bundled
scripts are model-agnostic (config-driven inputs, recursive output comparison),
so the same flow applies to the next model with only a changed
--model_name_or_path.
Bundled assets:
scripts/torch_air_infer.py— minimal single-backend inference + latency stats.scripts/benchmark.py— 3-backend (torchair / npu_eager / cpu) accuracy + perf → JSON + Markdown.scripts/run_benchmark.sh— generic driver: prepare model → sanity infer → benchmark.
The one rule you cannot break
torch_npu MUST be imported before torchair, or graph mode does not engage.
import torch
import torch_npu # before torchair
import torchair
config = torchair.CompilerConfig()
npu_backend = torchair.get_npu_backend(compiler_config=config)
model = torch.compile(model, backend=npu_backend, dynamic=False)
Details, version matrix, and CompilerConfig knobs: see references/environment.md.
Main flow (run in order)
source $ASCEND_HOME/ascend-toolkit/set_env.sh # CANN runtime first
export HF_ENDPOINT=https://hf-mirror.com # China-friendly mirror
# 0) Sanity on CPU (fast) — confirms inputs/outputs resolve for this model
python scripts/torch_air_infer.py --model_name_or_path <path-or-id> --device cpu --backend eager
# 1) NPU eager — confirms the model runs on NPU at all
python scripts/torch_air_infer.py --model_name_or_path <path-or-id> --device npu --backend eager --dtype float16
# 2) NPU torchair graph mode — the migration target
python scripts/torch_air_infer.py --model_name_or_path <path-or-id> --device npu --backend torchair --dtype float16
# 3) Full 3-backend accuracy + performance comparison (writes JSON + Markdown)
python scripts/benchmark.py --model_name_or_path <path-or-id> \
--backends torchair,npu_eager,cpu --device npu \
--dtype float16 --batch_size 1 --warmup 10 --iterations 100 \
--output_dir benchmark_results/<model>
# strict <1e-3 parity check (fits-in-memory models): re-run step 3 with --dtype float32
Or do steps 2+3 in one shot: bash scripts/run_benchmark.sh <path-or-id> [out_subdir].
Decision tree
Step 0 (CPU) fails to build inputs? → model needs a custom input builder (see references/methodology.md, BaseModelLoader.build_inputs).
Step 1 (NPU eager) fails? → environment problem: re-check import order + set_env.sh (references/environment.md).
Step 2 (torchair) fails to compile? → walk references/troubleshooting.md; fall back to npu_eager and keep going.
Step 3 accuracy not within gate? → compare torchair vs npu_eager first (references/methodology.md): same ⇒ fp16 rounding (PASS_COSINE); differ ⇒ graph bug.
Weights gated / undownloadable? → build a config-init random stand-in (references/methodology.md); valid for backend comparison, not task accuracy.
Accuracy & performance gates (summary)
- Golden = CPU fp32. Metrics: cosine sim, max abs diff, max rel err.
PASS=max_abs_diff ≤ 1e-3andcosine ≥ 0.999;PASS_COSINE= cosine passes (expected in fp16); elseFAIL.- Perf: first iteration = compile cost (reported separately); steady p50/p95/p99 over ≥100 iters with
torch.npu.synchronize()before each timer stop; report throughput + peak HBM.
Full methodology, the model interface, gated-weights handling, and expected
output: references/methodology.md.
References (load as needed)
references/environment.md— import order, CANN/torch_npu/torchair/transformers compatibility matrix, pre-flight checklist, CompilerConfig.references/troubleshooting.md— compile-failure decision tree, known op failures (interpolate/contiguous), fallback order, bug localization.references/methodology.md—BaseModelLoaderinterface, input synthesis, accuracy/perf methodology, config-init stand-ins, expected output.
Reuse checklist for the next model
- Step 0 on CPU — inputs/outputs resolve (fast).
- Step 1 NPU eager works.
- Step 2 torchair — compiles ⇒ done; else walk
references/troubleshooting.md. - Step 3 benchmark — gate on cosine (fp16) or
<1e-3(fp32). - If inputs could not be inferred, add a small
build_inputsoverride (references/methodology.md).