xevent-analysis - SKILL.md Agent Skill

name: xevent-analysis description: >- Parse SQL Server system_health XEvent (.xel) files to extract waits, errors, scheduler pressure, and sp_server_diagnostics alerts. Cross-correlate with ERRORLOG findings. Use when the user says "analyze xevent", "parse xel", "分析 XEvent", provides .xel file paths, or when system_health*.xel files are found alongside ERRORLOG. context: fork

XEvent Analysis (system_health)

Overview

Two analysis paths are available. Choose based on environment:

Path	Method	Speed	When to use
A — SQL Server import	`fn_xe_file_target_read_file` → physical tables → T-SQL queries	Seconds (200 MB XEL → ~30s)	Local SQL Server available (`sqlcmd`)
B — PowerShell + Node.js	`extract_xel.ps1` → JSON → `parse_xevent.js`	Minutes (200 MB XEL → 5-15 min)	No local SQL Server

Always prefer Path A when a SQL Server instance is accessible.

Step 0: Check Inputs

Detect XEL session type by filename:
- system_health*.xel → system_health (default, richest data)
- AlwaysOn_health*.xel → AG health session
- *SQLDIAG*.xel → sqldiag / component_health (different event names — see Known Limitations)
- Other *.xel → custom session
Determine if a local SQL Server is available:
- Test sqlcmd -S localhost -E -Q "SELECT 1"
- If yes → Path A. If no → Path B.

Path A: SQL Server Import (Recommended)

A.1 Run Import Script

scripts/import_xel_to_sql.sql

The script creates database [xevent_analyze] with schema [xe] and these physical tables:

Table	Contents	Key columns
`xe.raw_events`	All events with full XML	`case_id`, `event_name`, `event_time`, `event_data` (XML)
`xe.errors`	Shredded `error_reported`	`error_number`, `severity`, `state`, `message`, `database_name`, `session_id`
`xe.waits`	Shredded `wait_info` / `wait_info_external`	`wait_type`, `duration_ms`, `signal_duration_ms`, `wait_resource`
`xe.diagnostics`	Shredded `sp_server_diagnostics_component_result`	`component`, `state_desc`, `data_xml`
`xe.scheduler`	Shredded `scheduler_monitor_*`	`sql_cpu_pct`, `system_idle_pct`, `nonyielding_count`
`xe.deadlocks`	Shredded `xml_deadlock_report`	`deadlock_xml`
`xe.connectivity`	Shredded `connectivity_ring_buffer_recorded`	`conn_type`, `total_login_time_ms`, `sspi_processing_ms`, `ssl_processing_ms`, + 15 sub-timer fields
`xe.login_timers`	Shredded `process_login_finish` (custom session)	`is_success`, `total_login_time_ms`, `sspi_`, `ssl_`, `fedauth_*`, `application_name`

Usage:

# Edit @xel_path, @days, @case_id at top of script, then:
sqlcmd -S localhost -E -i scripts/import_xel_to_sql.sql

Configuration variables (lines 27-29 of script):

DECLARE @xel_path NVARCHAR(500) = N'C:\Temp\system_health_0_*.xel';
DECLARE @days INT = 7;      -- 0 = all data
DECLARE @case_id NVARCHAR(50) = N'{case_id}';

The script is idempotent — re-running with the same case_id deletes old rows first. Multiple cases can coexist in the same database (filtered by case_id).

⚠ CRITICAL: UTC → Local Time Conversion (During Analysis)

XEL files store @timestamp in UTC. ERRORLOG uses source SQL Server local time. The import script stores timestamps as-is in UTC. You MUST convert to the source server's local time during analysis queries to align with ERRORLOG.

How to determine the offset:

Check the source server's timezone from ERRORLOG server info or SELECT SYSDATETIMEOFFSET().
Common offsets: China (UTC+8), US Eastern (UTC-5/-4), US Pacific (UTC-8/-7).
If unknown, compare a known event that appears in both ERRORLOG and XEL to calculate the delta.

Apply offset in every analysis query:

-- Example: source server is UTC+8 (China)
SELECT DATEADD(HOUR, 8, event_time) AS local_time, ...
FROM xe.errors WHERE ...

-- Or create a view for convenience:
CREATE OR ALTER VIEW xe.v_errors AS
SELECT *, DATEADD(HOUR, 8, event_time) AS local_time FROM xe.errors;

Never compare raw XEL UTC timestamps directly with ERRORLOG local times. Always apply the offset first, or you will get wrong cross-correlation results.

A.2 Run Analysis Queries

After import, query via sqlcmd. Since database is per-case ([xevent_{case_id}]), use -d xevent_{case_id} or USE [xevent_{case_id}] — no three-part names needed.

Top errors:

SELECT error_number, severity, COUNT(*) AS cnt,
       MIN(event_time) AS first_seen, MAX(event_time) AS last_seen
FROM xe.errors WHERE case_id = '{case_id}'
GROUP BY error_number, severity ORDER BY cnt DESC

Top waits by total duration:

SELECT wait_type, COUNT(*) AS cnt,
       SUM(duration_ms) AS total_ms, AVG(duration_ms) AS avg_ms, MAX(duration_ms) AS max_ms,
       SUM(signal_duration_ms) AS total_signal_ms
FROM xe.waits WHERE case_id = '{case_id}'
GROUP BY wait_type ORDER BY total_ms DESC

sp_server_diagnostics WARNING/ERROR states:

SELECT component, state_desc, COUNT(*) AS cnt,
       MIN(event_time) AS first_seen, MAX(event_time) AS last_seen
FROM xe.diagnostics
WHERE case_id = '{case_id}' AND state_desc IN ('warning','error')
GROUP BY component, state_desc ORDER BY cnt DESC

Non-yielding and high-CPU scheduler events:

SELECT event_name, event_time, sql_cpu_pct, system_idle_pct, scheduler_id, nonyielding_count
FROM xe.scheduler
WHERE case_id = '{case_id}'
  AND (event_name LIKE '%non_yielding%' OR sql_cpu_pct > 75 OR nonyielding_count > 0)
ORDER BY event_time

Deadlocks:

SELECT event_time, deadlock_xml FROM xe.deadlocks
WHERE case_id = '{case_id}' ORDER BY event_time

Connectivity summary:

SELECT sni_consumer_error, os_error, conn_type, tds_flags, COUNT(*) AS cnt
FROM xe.connectivity WHERE case_id = '{case_id}'
GROUP BY sni_consumer_error, os_error, conn_type, tds_flags ORDER BY cnt DESC

Login timer summary (slow logins from connectivity ring buffer):

SELECT TOP 20
    event_time, state, total_login_time_ms,
    login_task_enqueued_ms AS enqueue, network_reads_ms AS net_read,
    ssl_processing_ms AS ssl, ssl_secure_calls_ms AS ssl_api,
    sspi_processing_ms AS sspi, sspi_secure_calls_ms AS sspi_api,
    find_login_ms, logon_triggers_ms, exec_classifier_ms,
    sni_consumer_error AS error, remote_host
FROM xe.connectivity
WHERE case_id = '{case_id}' AND total_login_time_ms > 0
ORDER BY total_login_time_ms DESC

Login timer bottleneck distribution:

SELECT
    CASE
        WHEN sspi_processing_ms > total_login_time_ms * 0.5 THEN 'SSPI/AD Auth'
        WHEN ssl_processing_ms > total_login_time_ms * 0.5 THEN 'SSL/TLS'
        WHEN network_reads_ms > total_login_time_ms * 0.5 THEN 'Network Read'
        WHEN login_task_enqueued_ms > total_login_time_ms * 0.5 THEN 'Thread Starvation'
        WHEN login_trigger_and_rg_ms > total_login_time_ms * 0.5 THEN 'Login Trigger/RG'
        ELSE 'Mixed/Other'
    END AS bottleneck,
    COUNT(*) AS cnt, AVG(total_login_time_ms) AS avg_ms, MAX(total_login_time_ms) AS max_ms
FROM xe.connectivity
WHERE case_id = '{case_id}' AND total_login_time_ms > 100
GROUP BY CASE
    WHEN sspi_processing_ms > total_login_time_ms * 0.5 THEN 'SSPI/AD Auth'
    WHEN ssl_processing_ms > total_login_time_ms * 0.5 THEN 'SSL/TLS'
    WHEN network_reads_ms > total_login_time_ms * 0.5 THEN 'Network Read'
    WHEN login_task_enqueued_ms > total_login_time_ms * 0.5 THEN 'Thread Starvation'
    WHEN login_trigger_and_rg_ms > total_login_time_ms * 0.5 THEN 'Login Trigger/RG'
    ELSE 'Mixed/Other'
END ORDER BY cnt DESC

process_login_finish analysis (custom session — includes successful logins):

-- Success/failure distribution
SELECT is_success, COUNT(*) AS cnt,
       AVG(total_login_time_ms) AS avg_ms, MAX(total_login_time_ms) AS max_ms
FROM xe.login_timers WHERE case_id = '{case_id}'
GROUP BY is_success

-- Top 20 slowest logins (all, including successful)
SELECT TOP 20
    event_time, is_success, error, spid, total_login_time_ms,
    login_task_enqueued_ms AS enqueue, network_reads_ms AS net_read,
    ssl_processing_ms AS ssl, sspi_processing_ms AS sspi,
    sspi_secure_calls_ms AS sspi_api, find_login_ms,
    application_name, driver_name, client_hostname
FROM xe.login_timers
WHERE case_id = '{case_id}' AND total_login_time_ms > 0
ORDER BY total_login_time_ms DESC

-- Bottleneck distribution for successful logins
SELECT
    CASE
        WHEN sspi_processing_ms > total_login_time_ms * 0.5 THEN 'SSPI/AD Auth'
        WHEN ssl_processing_ms > total_login_time_ms * 0.5 THEN 'SSL/TLS'
        WHEN network_reads_ms > total_login_time_ms * 0.5 THEN 'Network Read'
        WHEN login_task_enqueued_ms > total_login_time_ms * 0.5 THEN 'Thread Starvation'
        WHEN fedauth_processing_ms > total_login_time_ms * 0.5 THEN 'FedAuth/AAD'
        ELSE 'Mixed/Other'
    END AS bottleneck,
    COUNT(*) AS cnt, AVG(total_login_time_ms) AS avg_ms
FROM xe.login_timers
WHERE case_id = '{case_id}' AND is_success = 1 AND total_login_time_ms > 100
GROUP BY CASE
    WHEN sspi_processing_ms > total_login_time_ms * 0.5 THEN 'SSPI/AD Auth'
    WHEN ssl_processing_ms > total_login_time_ms * 0.5 THEN 'SSL/TLS'
    WHEN network_reads_ms > total_login_time_ms * 0.5 THEN 'Network Read'
    WHEN login_task_enqueued_ms > total_login_time_ms * 0.5 THEN 'Thread Starvation'
    WHEN fedauth_processing_ms > total_login_time_ms * 0.5 THEN 'FedAuth/AAD'
    ELSE 'Mixed/Other'
END ORDER BY cnt DESC

Ad-hoc: shred any event from raw XML (example — security_error_ring_buffer_recorded):

SELECT event_time,
       event_data.value('(event/data[@name="error_code"]/value)[1]', 'int') AS error_code,
       event_data.value('(event/data[@name="api_name"]/value)[1]', 'nvarchar(100)') AS api_name
FROM xe.raw_events
WHERE case_id = '{case_id}' AND event_name = 'security_error_ring_buffer_recorded'
ORDER BY event_time DESC

A.3 Interpret Results

Wait Analysis — Red Flag Thresholds

See full reference: .github/references/wait-types.md for benign/ignorable waits and the decision tree.

Wait type	Red flag	Root cause
`RESOURCE_SEMAPHORE`	avg > 5000 ms or any occurrence	Queries waiting for memory grant — server memory pressure
`LATCH_EX` / `LATCH_SH`	total > 10M ms	Non-buffer latch contention (ACCESS_METHODS_*, HOBT, etc.)
`LCK_M_*`	max > 60000 ms	Long lock waits — blocking chains
`PAGEIOLATCH_*`	avg > 15 ms	Storage I/O latency
`ASYNC_IO_COMPLETION`	max > 60000 ms	Severe storage stall
`WRITELOG`	avg > 5 ms	Transaction log write latency
`THREADPOOL`	any occurrence	Worker thread exhaustion
`PREEMPTIVE_OS_*`	avg > 5000 ms	External OS calls slow (AD, file system, etc.)

sp_server_diagnostics States

State	Meaning
clean	Healthy
warning	Component under pressure — investigate
error	Component failing — critical, usually triggers AG failover

Non-yielding Scheduler Events

Non-yielding with low CPU (< 30%) → thread stuck on I/O, latch, or memory grant (not CPU starvation)
Non-yielding with high CPU (> 75%) → CPU starvation, runaway query, or spinlock contention
Multiple non-yielding in a short window (< 5 min) → severe systemic issue

Analysis Methodology (Path A)

After importing data to [xevent_analyze].[xe].*, follow this structured analysis path.

Phase 1: Overview — What event types exist and how many?

SELECT event_name, COUNT(*) AS cnt FROM xe.raw_events
WHERE case_id = '{case_id}' GROUP BY event_name ORDER BY cnt DESC

This tells you which tables will have data and which will be empty. If security_error_ring_buffer_recorded or memory_broker_ring_buffer_recorded dominate, they may be the key signal (not just noise).

Phase 2: Per-Table Analysis

Analyze each shredded table in this order (highest signal first):

2.1 xe.errors — Error Landscape

-- Top errors by count
SELECT error_number, severity, COUNT(*) AS cnt,
       MIN(event_time) AS first_seen, MAX(event_time) AS last_seen
FROM xe.errors WHERE case_id = '{case_id}'
GROUP BY error_number, severity ORDER BY cnt DESC

-- Hourly distribution (baseline vs event?)
SELECT CAST(event_time AS DATE) AS day, DATEPART(HOUR, event_time) AS hr,
       COUNT(*) AS cnt, SUM(CASE WHEN severity >= 20 THEN 1 ELSE 0 END) AS fatal
FROM xe.errors WHERE case_id = '{case_id}'
GROUP BY CAST(event_time AS DATE), DATEPART(HOUR, event_time) ORDER BY day, hr

Look for: Is error rate constant (baseline problem) or spiking at specific hours (event-driven)?

2.2 xe.waits — Wait Profile

-- Top waits by total duration (NOT by count)
SELECT wait_type, COUNT(*) AS cnt,
       SUM(duration_ms) AS total_ms, AVG(duration_ms) AS avg_ms,
       MAX(duration_ms) AS max_ms, SUM(signal_duration_ms) AS total_signal_ms
FROM xe.waits WHERE case_id = '{case_id}'
GROUP BY wait_type ORDER BY total_ms DESC

-- Hourly pattern for top wait type
SELECT CAST(event_time AS DATE) AS day, DATEPART(HOUR, event_time) AS hr,
       COUNT(*) AS cnt, SUM(duration_ms) AS total_ms
FROM xe.waits WHERE case_id = '{case_id}' AND wait_type = '{top_wait}'
GROUP BY CAST(event_time AS DATE), DATEPART(HOUR, event_time) ORDER BY day, hr

-- Per-session distribution (1 big query or many concurrent?)
SELECT session_id, COUNT(*) AS cnt, SUM(duration_ms) AS total_ms,
       AVG(duration_ms) AS avg_ms, MAX(duration_ms) AS max_ms
FROM xe.waits WHERE case_id = '{case_id}' AND wait_type = '{top_wait}'
GROUP BY session_id ORDER BY total_ms DESC

Look for:

Total duration matters more than count (1 wait of 1M ms > 1000 waits of 100 ms)
signal_duration_ms / total_ms ratio — if signal is high proportion, CPU scheduling is slow
Hourly spikes vs flat → periodic job vs persistent issue
Session distribution → 1-2 sessions = single bad query; 20+ sessions = systemic pressure

2.3 xe.diagnostics — Health State + Embedded Data

-- State distribution
SELECT component, state_desc, COUNT(*) AS cnt,
       MIN(event_time) AS first_seen, MAX(event_time) AS last_seen
FROM xe.diagnostics WHERE case_id = '{case_id}'
GROUP BY component, state_desc ORDER BY component, state_desc

-- WARNING/ERROR records with data_xml
SELECT event_time, component, state_desc, LEFT(data_xml, 3000) AS data_preview
FROM xe.diagnostics
WHERE case_id = '{case_id}' AND state_desc IN ('warning','error')
ORDER BY event_time

For QUERY_PROCESSING WARNING: The data_xml contains <blockingTasks> with actual SQL text of blocking queries. This is direct evidence of what's causing pressure.

For RESOURCE WARNING: See Memory Verification below.

For SYSTEM WARNING: Check timing — if it starts after an AG failure, it's a symptom, not a cause.

2.4 xe.scheduler — CPU and Non-Yielding

SELECT event_name, event_time, sql_cpu_pct, system_idle_pct, nonyielding_count
FROM xe.scheduler
WHERE case_id = '{case_id}'
  AND (event_name LIKE '%non_yielding%' OR sql_cpu_pct > 75 OR nonyielding_count > 0)
ORDER BY event_time

-- Hourly CPU trend
SELECT CAST(event_time AS DATE) AS day, DATEPART(HOUR, event_time) AS hr,
       AVG(sql_cpu_pct) AS avg_cpu, MAX(sql_cpu_pct) AS max_cpu,
       SUM(CASE WHEN event_name LIKE '%non_yielding%' THEN 1 ELSE 0 END) AS ny_cnt
FROM xe.scheduler WHERE case_id = '{case_id}'
GROUP BY CAST(event_time AS DATE), DATEPART(HOUR, event_time) ORDER BY day, hr

Look for: Non-yielding events with low CPU → thread stuck (I/O, memory grant, latch), not CPU starvation.

2.5 xe.connectivity — Connection Errors + Login Timers

-- Error pattern
SELECT sni_consumer_error, os_error, conn_type, tds_flags, COUNT(*) AS cnt
FROM xe.connectivity WHERE case_id = '{case_id}'
GROUP BY sni_consumer_error, os_error, conn_type, tds_flags ORDER BY cnt DESC

-- Top source IPs
SELECT remote_host, COUNT(*) AS cnt,
       MIN(event_time) AS first_seen, MAX(event_time) AS last_seen
FROM xe.connectivity WHERE case_id = '{case_id}' AND sni_consumer_error > 0
GROUP BY remote_host ORDER BY cnt DESC

-- Slow login analysis (full timer breakdown)
SELECT TOP 20
    event_time, total_login_time_ms,
    login_task_enqueued_ms AS enqueue, network_reads_ms AS net_read,
    ssl_processing_ms AS ssl, ssl_secure_calls_ms AS ssl_api,
    sspi_processing_ms AS sspi, sspi_secure_calls_ms AS sspi_api,
    find_login_ms, logon_triggers_ms, remote_host
FROM xe.connectivity
WHERE case_id = '{case_id}' AND total_login_time_ms > 100
ORDER BY total_login_time_ms DESC

Login Timer Interpretation:

Field	High value indicates
`login_task_enqueued_ms`	SQL Server thread starvation (no worker threads)
`network_reads_ms`	Network latency or client not responding
`ssl_processing_ms`	TLS issues (cert chain, CRL check)
`ssl_secure_calls_ms`	SSL API slow
`sspi_processing_ms`	AD/Kerberos/NTLM slow
`sspi_secure_calls_ms`	Domain controller slow/unreachable
`find_login_ms`	Login validation slow
`logon_triggers_ms`	Logon trigger overhead
`exec_classifier_ms`	Resource Governor classifier slow

Note: connectivity_ring_buffer_recorded only fires on connection failure/abnormal close. For successful login timing, need process_login_finish from custom XEvent session → xe.login_timers.

Look for:

os_error = 10054 (connection reset by peer) → network/client issue
os_error = 10060 (connection timeout) → network latency
os_error = 109 (pipe ended) → named pipe disconnect
Concentrated on 1-2 IPs → client-specific problem; many IPs → server-side cause

2.6 xe.security_errors — Security API Failures

SELECT error_code, api_name, calling_api, COUNT(*) AS cnt
FROM xe.security_errors WHERE case_id = '{case_id}'
GROUP BY error_code, api_name, calling_api ORDER BY cnt DESC

⚠️ CRITICAL: Always verify error_code meaning before assuming OOM:

[ComponentModel.Win32Exception]::new(<error_code>).Message

Common error codes in this table:

error_code	Actual meaning	NOT
5023	`ERROR_INVALID_STATE` — resource not in correct state	~~ERROR_OUTOFMEMORY~~
1332	`ERROR_NO_SUCH_MEMBER` — AD account/group not found	—
0x8009030C	`SEC_E_LOGON_DENIED` — SSPI logon denied	—
0x80090301	`SEC_E_INSUFFICIENT_MEMORY` — this one IS memory	—

2.7 xe.memory_broker — Memory Trend

SELECT CAST(event_time AS DATE) AS day, DATEPART(HOUR, event_time) AS hr,
       AVG(memory_ratio) AS avg_ratio, MIN(memory_ratio) AS min_ratio, COUNT(*) AS samples
FROM xe.memory_broker WHERE case_id = '{case_id}'
GROUP BY CAST(event_time AS DATE), DATEPART(HOUR, event_time)
HAVING MIN(memory_ratio) < 90 ORDER BY day, hr

Look for: Periodic dips in memory_ratio correlating with RESOURCE_SEMAPHORE spikes.

2.8 xe.ag_events — AG State Changes (if sqldiag imported)

SELECT event_name, event_time, ag_name, reason, target_state, failure_condition
FROM xe.ag_events WHERE case_id = '{case_id}' ORDER BY event_time

Look for: reason = LEASEEXPIRED → AG lease worker couldn't be scheduled in time.

Phase 3: Memory Verification (RESOURCE WARNING Deep-Dive)

When xe.diagnostics shows RESOURCE WARNING, extract the embedded memory report from xe.raw_events XML to verify whether real OOM occurred.

Step 1 — Extract memory report:

SELECT event_time,
    event_data.value('(event/data[@name="data"]/value/resource/@lastNotification)[1]', 'nvarchar(50)') AS last_notification,
    event_data.value('(event/data[@name="data"]/value/resource/@outOfMemoryExceptions)[1]', 'int') AS oom_exceptions,
    event_data.value('(event/data[@name="data"]/value/resource/@isAnyPoolOutOfMemory)[1]', 'int') AS pool_oom,
    event_data.value('(event/data[@name="data"]/value/resource/@processOutOfMemoryPeriod)[1]', 'int') AS oom_period
FROM xe.raw_events
WHERE case_id = '{case_id}' AND event_name LIKE '%diagnostics_component_result'
  AND event_data.value('(event/data[@name="component"]/text)[1]', 'nvarchar(50)') = 'RESOURCE'
  AND event_data.value('(event/data[@name="state"]/text)[1]', 'nvarchar(20)') = 'WARNING'
ORDER BY event_time

If XPath extraction returns NULL (sqldiag uses /value instead of /text), read full XML:

SELECT event_time, CAST(event_data AS NVARCHAR(MAX)) AS full_xml
FROM xe.raw_events
WHERE case_id = '{case_id}' AND event_name = 'component_health_result'
  AND event_data.value('(event/data[@name="component"]/value)[1]', 'nvarchar(50)') = 'resource'
  AND event_data.value('(event/data[@name="state_desc"]/value)[1]', 'nvarchar(20)') = 'warning'
ORDER BY event_time ASC

Step 2 — Verify: OS-level or SQL-level pressure?

Indicator	Value=1 means	Check
`System physical memory low`	OS has low physical memory	OS-level problem
`System physical memory high`	OS has plenty of memory	OS is fine
`Process physical memory low`	SQL Server process sees low memory	SQL-internal pressure
`Process virtual memory low`	VA space low (rare on 64-bit)	Usually not relevant

Step 3 — Verify: SQL Server internal state

Indicator	What to check
`Target Committed` vs `Current Committed`	Equal → SQL used all of max server memory
`Locked Pages Allocated`	Large value → LPIM in use
`Pages Free`	Trend over time: declining → pressure worsening
`Available Physical Memory`	OS-level free — if large but Process low=1 → SQL hogging

Step 4 — Verify: Did actual OOM happen?

Indicator	If 0	If > 0
`outOfMemoryExceptions`	No actual OOM in SQL process	Confirmed OOM
`isAnyPoolOutOfMemory`	No pool exhausted	A memory pool ran out
`processOutOfMemoryPeriod`	Not in OOM state	Duration of OOM state
`Last OOM Factor`	No recent OOM	OOM occurred (check factor value)

Step 5 — Decision:

outOfMemoryExceptions = 0 AND isAnyPoolOutOfMemory = 0?
├── YES → Memory pressure exists but NO actual OOM failure
│   └── Look for RESOURCE_SEMAPHORE waits as the real symptom of memory pressure
└── NO  → Confirmed OOM
    └── Check which clerk/pool ran out, check max server memory vs physical RAM

⚠️ Common Pitfall: Do NOT assume lastNotification = RESOURCE_MEMPHYSICAL_LOW means "out of memory". It means the memory broker detected pressure, but the OOM counters tell you whether allocations actually failed. Also verify any security_error error codes — they may not be OOM-related (e.g. 5023 = invalid state).

Phase 4: Time-Axis Cross-Correlation

Overlay all event types on the same hourly time axis:

SELECT day, hr,
       MAX(rs_cnt) AS resource_sem, MAX(err_cnt) AS errors,
       MAX(conn_cnt) AS conn_errors, MAX(sec_cnt) AS sec_errors,
       MAX(avg_cpu) AS avg_cpu, MAX(ny_cnt) AS non_yielding,
       MAX(warn_cnt) AS diag_warnings, MAX(min_ratio) AS mem_ratio_min
FROM (
    SELECT CAST(event_time AS DATE) day, DATEPART(HOUR,event_time) hr,
           COUNT(*) rs_cnt, NULL err_cnt, NULL conn_cnt, NULL sec_cnt,
           NULL avg_cpu, NULL ny_cnt, NULL warn_cnt, NULL min_ratio
    FROM xe.waits WHERE case_id='{case_id}' AND wait_type='RESOURCE_SEMAPHORE'
    GROUP BY CAST(event_time AS DATE), DATEPART(HOUR,event_time)
    UNION ALL
    SELECT CAST(event_time AS DATE), DATEPART(HOUR,event_time),
           NULL, COUNT(*), NULL, NULL, NULL, NULL, NULL, NULL
    FROM xe.errors WHERE case_id='{case_id}'
    GROUP BY CAST(event_time AS DATE), DATEPART(HOUR,event_time)
    -- ... add more UNION ALL for each table
) x GROUP BY day, hr ORDER BY day, hr

Look for:

Patterns that correlate in time → likely same root cause
Patterns that are constant while others spike → independent baseline issue
Events that start after a failure → symptoms, not causes

Phase 5: Causal Chain Construction

Work backwards from the final symptom:

1. What was the final symptom? (e.g. AG LEASEEXPIRED)
2. What could cause lease expiry? → Lease worker not scheduled
3. What blocks scheduling? → Non-yielding? CPU starvation? Memory grant wait?
4. Is there evidence of that blocker? → Check xe.scheduler, xe.waits
5. What caused the blocker? → Check xe.diagnostics data_xml for blocking queries
6. Is it periodic or constant? → Check hourly distribution
7. Are there independent baseline issues? → Check constant-rate events separately

For each conclusion, record:

Evidence source (which table, which query)
Strength: ✅ Direct evidence / ⚠️ Inference / ❌ Assumption
If inference, note what data would confirm or refute it

Path B: PowerShell + Node.js (Fallback)

Use when no local SQL Server is available.

B.1 Extract XEL → JSON (PowerShell)

scripts/extract_xel.ps1

Supports -EventName filter to skip irrelevant events (major performance improvement):

# Filtered extraction (recommended — 3-5x faster)
powershell -File scripts/extract_xel.ps1 `
  -Path "C:\Temp\system_health*.xel" `
  -Days 7 `
  -EventName error_reported,wait_info,wait_info_external,sp_server_diagnostics_component_result,scheduler_monitor_system_health_ring_buffer_recorded,scheduler_monitor_non_yielding_ring_buffer_recorded,xml_deadlock_report,connectivity_ring_buffer_recorded `
  -Output reports/{case_id}_xevent_extract.json

# Unfiltered (all events — slower, larger JSON)
powershell -File scripts/extract_xel.ps1 -Path "*.xel" -Days 7 -Output reports/{case_id}_xevent_extract.json

Note on -EventName: When invoked via powershell -File, comma-separated values are passed as a single string. The script splits on commas internally — this is by design.

B.2 Analyze XEvent JSON (Node.js)

scripts/parse_xevent.js

# With ERRORLOG cross-correlation
node scripts/parse_xevent.js reports/{case_id}_xevent_extract.json \
  --errorlog reports/{case_id}_errorlog_findings.json \
  --json --output reports/{case_id}_xevent_findings.json

Output JSON contains: errors, wait_analysis, scheduler_events, deadlocks, patterns, timeline, correlation.

B.3 Cross-Correlation

The --errorlog flag produces:

errors_in_both — confirmed by two independent sources (high confidence)
xevent_only_errors — may be lower severity or suppressed from ERRORLOG
errorlog_only_errors — not captured by XEvent session configuration

Known Limitations

sqldiag / component_health sessions

Sessions named *SQLDIAG* or *hadr_health* use different event names than system_health. Notably:

component_health_result instead of sp_server_diagnostics_component_result
No error_reported, wait_info, scheduler_monitor_* events
Rich data (topWaits, blockingTasks, AG state) is embedded as XML inside the data field of component_health_result events

For these sessions:

Path A with xe.raw_events is superior — you can write custom XQuery to shred the embedded XML
Path B's parse_xevent.js will return empty results for waits/errors/scheduler (parser does not parse embedded XML payloads)

Event name inventory

Before filtering, discover what events exist in the XEL:

-- Path A
SELECT event_name, COUNT(*) AS cnt FROM xe.raw_events
WHERE case_id = '{case_id}' GROUP BY event_name ORDER BY cnt DESC

# Path B (quick sample)
powershell -Command "Import-Module SqlServer; Read-SqlXEvent -FileName 'file.xel' | Select-Object -First 500 | Group-Object Name | Sort-Object Count -Descending | Select-Object Name, Count"

Auto-Detect XEL Files

When the user provides an ERRORLOG directory, automatically probe for:

{errorlog_dir}/system_health*.xel
{errorlog_dir}/AlwaysOn_health*.xel
{errorlog_dir}/*SQLDIAG*.xel

If found, inform the user and include in analysis.

Output to Orchestrator

When called from the sql-csi full-analysis pipeline, return:

Absolute paths of output files (JSON findings, or note that data is in [xevent_analyze] database)
Top errors with cross-correlation flags
Top waits by total duration with red-flag annotations
sp_server_diagnostics WARNING/ERROR summary
Non-yielding event count + cluster timestamps
Deadlock count
Anomalies (session type mismatch, empty event categories, etc.)