name: ascend-computation-analysis description: Analyze Ascend NPU computation-side profiling data for single-card runs or a selected rank from multi-card runs. Use this skill when the user asks to diagnose computation bottlenecks, AI Core / AI Vector / AICPU hotspots, dynamic shape overhead, block dim issues, redundant TransData/Transpose/Cast, cross-stream waits, frequency/runtime-state issues, or fusion opportunities.
Ascend Computation Analysis
This skill diagnoses Ascend NPU computation-side performance problems and turns profiling evidence into prioritized optimization directions.
Keep the main workflow evidence-first. Computation problems often have multiple contributing factors, so do not force every case into one root-cause category. Use the failure handbook as a reference after evidence has been collected.
Scope
Use this skill for:
- Single-card computation profiling analysis.
- Multi-card profiling when the task is to select one computation-heavy rank and analyze it.
- AI Core, AI Vector, AICPU, dynamic shape, block dim, frequency, fusion, data movement, and cross-stream wait analysis.
- Interpreting computation-related
msprof-analyze advisorfindings.
Do not use this skill as the primary workflow for:
- Communication matrix, slow link, bandwidth, or HCCL root-cause analysis.
- Low-level DB schema exploration or ad hoc SQL design.
- Full cluster fast/slow rank diagnosis beyond selecting the computation-heavy rank for local drill-down.
Input Priority
Prefer evidence in this order:
ascend_pytorch_profiler_{rank_id}.dbkernel_details.csvop_statistic.csvprofiler_info.jsonstep_trace_time.csvoranalysis.db/StepTraceTimemsprof-analyze advisorstdout, HTML, or XLSX artifacts
Use a DB-first, CSV-fallback, advisor-cross-check approach.
If both DB and CSV are present, prefer DB for flexible queries and use CSV for quick summaries or schema-compatible fallback. If only CSV is present, continue with degraded analysis instead of blocking.
Tooling Boundary
This skill owns the reasoning workflow:
- Input routing.
- Evidence selection.
- Hypothesis building.
- Contributing-factor analysis.
- Recommendation priority.
Use msprof_mcp for stable data access when available, especially:
get_profiler_configanalyze_kernel_detailsanalyze_op_statisticget_operator_detailsexecute_sqlexecute_sql_to_csvmsprof_analyze_advisor
Do not create narrowly scoped scripts for every issue type. If a reusable capability is missing, prefer direct DB/CSV analysis in the current task and note the missing reusable capability as future msprof_mcp work.
Workflow Overview
Follow this order:
- Validate and route input.
- Select the analysis target.
- Build the computation overview.
- Drill down into operators and sequences.
- Use advisor as a cross-check.
- Consult the failure handbook and fusion rules.
- Prioritize findings and produce recommendations.
Do not jump directly to a single Top operator before building the overview. Do not force a single root cause when evidence points to multiple contributing factors.
Step 1: Validate And Route Input
Identify the available data and decide the analysis mode.
Check:
- Whether the input is single-card or multi-card.
- Whether
ascend_pytorch_profiler_{rank_id}.dbexists. - Whether
kernel_details.csvorop_statistic.csvexists. - Whether
profiler_info.jsonexists. - Whether advisor artifacts already exist.
Read profiler_info.json when available and determine profiler level.
Interpret profiler level this way:
level1or higher: shape, dtype, format, op type, block dim, and ratio-style fields may be available; detailed evidence review is possible.level0: analysis is limited. Op type, shape, dtype, format, block dim, and ratio fields may be missing. State which checks are degraded.
Output of this step:
- Data source selected.
- Analysis mode selected.
- Field limitations.
- Whether conclusions need lower confidence because of missing fields.
Step 2: Select The Analysis Target
For single-card input, analyze that card directly.
For multi-card input, keep the selection lightweight:
- Use cluster-level
ClusterStepTraceTimetable incluster_analysis_output/cluster_analysis.db, or aggregate rank-levelstep_trace_time.csvorStepTraceTimetable inanalysis.dbwhen available. - Select the rank with the longest computation time as the main analysis target.
- If computation time is unavailable, select the rank with the largest device compute operator time.
Do not perform full cluster fast/slow rank diagnosis in this skill. Once the target rank is selected, continue as a single-rank computation analysis.
Output of this step:
- Target rank.
- Selection evidence.
- Missing rank-selection evidence, if any.
Step 3: Build The Computation Overview
Before drilling into any operator, build an overview from the selected data source.
Required summaries:
- Operator type summary: total duration, count, average duration, and Top types.
- Accelerator core summary:
time share of
AI_CORE,AI_VECTOR,AI_CPU,MIX_AIC,MIX_AIV, or equivalent categories. - Top operator summary: total duration, count, average duration, and max duration.
- Stream or task category summary when available:
compute streams, communication streams, auxiliary streams, waits, and
OTHERtasks.
The overview should produce observations, not final classification.
Output of this step:
- Computation time distribution.
- Initial observations.
- Drill-down candidates.
Step 4: Drill Down Into Operators And Sequences
Use two complementary drill-down modes.
Operator Drill-Down
Use this for dominant single operators or operator types.
Inspect when available:
- Name and op type.
- Total, count, average, max duration.
- Shape clusters.
- Input/output dtype and format.
- Accelerator core.
- Block dim.
- Ratio fields such as
mac_ratioandmte2_ratio. - PMU-like metrics if already available.
For AI Core operators:
- A high
mac_ratiomay suggest compute-side saturation. - A high
mte2_ratiomay suggest data movement pressure. - Low ratios across the board may indicate shape, launch, fusion, or sequence-level problems.
For small-M MatMul/Cube cases:
- Do not reason only from FLOPs.
- Weight movement and repeated loading may dominate.
- Consider micro-batch, partitioning, and repeated-weight-movement changes.
Sequence Drill-Down
Use this for repeated small operators, no-bound operators, conversion-heavy paths, or fusion opportunities.
For computation-side operator sequences, first group by the same stream_id. Do not infer a compute sequence by globally sorting timestamps across multiple streams; operators that are close in time but run on different streams may be parallel, independent, or connected only through explicit wait/sync edges.
Look for:
Cast -> TransData -> ComputeTranspose -> Compute- repeated
Cast - many short vector operators
- format conversion before and after AI Core kernels
- synchronization or wait around auxiliary streams
Sequence-level findings often produce better recommendations than isolated single-op tuning.
Output of this step:
- Operator-level findings.
- Sequence-level findings.
- Candidate contributing factors.
- Confidence notes.
Step 5: Use Advisor As Cross-Check
If advisor artifacts are not already available and msprof-analyze can be executed, run it on the selected single-card profiling directory:
msprof-analyze advisor all -d <single_card_profiling_path> -o <advisor_output_path>
Advisor execution rules:
- Always pass
-o. Advisor creates new deliverables on every execution, so do not rely on the default output location. - Make
<advisor_output_path>descriptive enough to identify the analyzed data. Include useful features such as rank id, scenario, for exampleadvisor_rank3_compute,advisor_rank0_prefill, oradvisor_rank7_decode. - For multi-card input,
<single_card_profiling_path>must be the selected target-rank profiling directory, not the cluster root. - Record the exact command and output path used.
Advisor commonly emits both HTML and XLSX artifacts. Prefer XLSX as the primary machine-readable source. Use HTML only when the XLSX file is missing, incomplete, or unreadable.
Use advisor to:
- Find known rule-based problems.
- Provide additional evidence for optimization direction.
Do not use advisor to:
- Replace the overview.
- Decide priority without checking time impact.
- Copy generated text into the final answer.
Relevant computation-related advisor sheets or issue names include:
AICPU IssuesOperator Dynamic Shape IssuesAI Core Performance AnalysisCube Operator Perf AnalysisFA Operator Perf AnalysisVector Operator Perf AnalysisBlock Dim IssuesOperator No Bound IssuesAI Core Frequency IssuesAffinity API Issues
Advisor reading rules:
- Read the overview or problem summary first, the sheet name should be
问题综述orproblems. It also contains suggestions for every problem. - List actual sheet names before assuming a sheet exists.
- Read only triggered computation-related sheets by default.
- Re-rank advisor findings by measured time impact.
Output of this step:
- Advisor findings that support existing evidence.
- Advisor findings that add new hypotheses.
- Advisor findings that appear low-impact or unsupported.
Step 6: Consult References
After operator and sequence drill-down, consult references as needed:
- Read
references/failure_handbook.mdto map evidence to likely contributing factors and to avoid common false positives. - Read
references/fusion_rules.mdonly when repeated operator sequences, small-operator overhead, vector-heavy sequences, or fusion opportunities are observed.
Use them as a handbook for interpreting evidence. A finding may match multiple patterns, and some cases may remain inconclusive.
Step 7: Prioritize Findings And Recommend Next Actions
Prioritize by:
- Time impact.
- Evidence strength.
- Likelihood that the proposed change is actionable.
- Risk or invasiveness of the change.
Prefer recommendations framed as next experiments:
- What to change.
- What metric should improve.
- What evidence would confirm or reject the hypothesis.
Output Format
The final answer should follow this shape:
- Summary: main performance story and confidence.
- Evidence: key timings, counts, ratios, ranks, streams, operators, or advisor findings.
- Likely contributing factors: multiple factors are allowed; avoid forcing one root cause.
- Recommendations: concrete next experiments or code/model changes, ordered by expected impact.
- Limitations: missing fields, degraded profiler level, unavailable advisor sheets, or incomplete rank data.
Avoid:
- Full raw tables.
- SQL dumps.
- Advisor text replay.
- Recommendations without time-impact evidence.
- Overstating conclusions when fields are missing.
Degraded Modes
If ideal data is missing, continue with a clear limitation.
- No DB but CSV exists: use CSV summaries.
level0data: avoid strong shape, dtype, format, ratio, and block-dim conclusions.- No advisor output: continue with raw profiling evidence.
Never block the analysis solely because one preferred artifact is missing.