name: sqldiskio-review description: Analyze SQL Server file-level I/O latency and auto-growth events using sys.dm_io_virtual_file_stats, sys.master_files, and default trace auto-growth records. Applies 15 checks (Z1–Z15) covering data and log file latency thresholds, hot file detection, stall ratio analysis, data and log placement on the same volume, TempDB co-location with user databases, auto-growth event frequency and sizing, file growth during production hours, system drive file placement, and multi-snapshot I/O trend analysis. Use this skill whenever a DBA suspects slow I/O, queries show PAGEIOLATCH or WRITELOG waits, or a file grew unexpectedly. Trigger when pasting output from sys.dm_io_virtual_file_stats or sys.master_files. triggers: - /sqldiskio-review - /diskio-review - /io-latency
SQL Server Disk I/O Review Skill
Purpose
Analyze SQL Server file-level I/O performance and storage configuration issues. Applies 15 checks (Z1–Z15) across three categories:
- Z1–Z5 — Latency and stall analysis: data file read/write latency, log file write latency, stall ratio per file, and hot file detection
- Z6–Z10 — Storage placement and configuration: data and log on the same volume, TempDB co-location, system drive placement, file count imbalance, and TempDB log latency
- Z11–Z15 — Auto-growth patterns: auto-growth events in recent hours, fixed-MB growth on data files, log file growth too small, growth events during peak hours, and multi-snapshot I/O trend worsening
Input
Accept any of:
- A snapshot pair from
sys.dm_io_virtual_file_statscapture query below — two captures taken seconds/minutes apart with the delta calculated (preferred) - A single raw output from
sys.dm_io_virtual_file_stats(cumulative since startup); note that single captures reflect all-time averages since the last SQL Server restart - Output from
sys.master_filesfor file placement and auto-growth configuration - Default trace query output showing recent auto-growth events
- A natural language description of symptoms ("log file grew three times today, data drive showing 80ms reads")
Recommended capture queries
-- 1. I/O latency snapshot (cumulative since restart or since last baseline)
-- Best practice: capture twice 60 seconds apart and subtract to get interval stats
SELECT
DB_NAME(vfs.database_id) AS database_name,
mf.physical_name,
mf.type_desc,
vfs.io_stall_read_ms,
vfs.num_of_reads,
vfs.io_stall_write_ms,
vfs.num_of_writes,
vfs.io_stall,
vfs.num_of_bytes_read / 1048576 AS mb_read,
vfs.num_of_bytes_written / 1048576 AS mb_written,
CASE WHEN vfs.num_of_reads > 0
THEN vfs.io_stall_read_ms / vfs.num_of_reads ELSE 0 END AS avg_read_ms,
CASE WHEN vfs.num_of_writes > 0
THEN vfs.io_stall_write_ms / vfs.num_of_writes ELSE 0 END AS avg_write_ms,
CASE WHEN (vfs.num_of_reads + vfs.num_of_writes) > 0
THEN vfs.io_stall / (vfs.num_of_reads + vfs.num_of_writes)
ELSE 0 END AS avg_stall_per_io_ms
FROM sys.dm_io_virtual_file_stats(NULL, NULL) AS vfs
JOIN sys.master_files AS mf
ON vfs.database_id = mf.database_id
AND vfs.file_id = mf.file_id
ORDER BY vfs.io_stall DESC;
-- 2. File configuration (placement and auto-growth settings)
SELECT
DB_NAME(database_id) AS database_name,
name AS logical_name,
physical_name,
type_desc,
size * 8 / 1024 AS size_mb,
CASE is_percent_growth
WHEN 1 THEN CAST(growth AS VARCHAR) + '%'
ELSE CAST(growth * 8 / 1024 AS VARCHAR) + ' MB'
END AS growth_setting,
max_size,
is_read_only
FROM sys.master_files
ORDER BY database_id, type;
-- 3. Auto-growth events from default trace (last 24 hours)
DECLARE @tracefile NVARCHAR(500);
SELECT @tracefile = REVERSE(SUBSTRING(REVERSE(path), CHARINDEX('\', REVERSE(path)), 500))
+ N'log.trc'
FROM sys.traces WHERE is_default = 1;
SELECT
DatabaseName,
FileName,
CASE EventClass WHEN 92 THEN 'Data Auto-Grow'
WHEN 93 THEN 'Log Auto-Grow' END AS event_type,
Duration AS duration_ms, -- Duration is reported in milliseconds for event classes 92/93
IntegerData * 8 / 1024 AS growth_mb,
StartTime
FROM fn_trace_gettable(@tracefile, DEFAULT)
WHERE EventClass IN (92, 93)
AND StartTime >= DATEADD(HOUR, -24, GETDATE())
ORDER BY StartTime DESC;
Thresholds Reference
| Metric | Good | Warning | Critical |
|---|---|---|---|
| Data file avg read latency | < 10 ms | 10–20 ms | > 20 ms |
| Data file avg write latency | < 10 ms | 10–20 ms | > 20 ms |
| Log file avg write latency | < 5 ms | 5–10 ms | > 10 ms |
| Stall ratio (avg stall ms per I/O operation) | < 5 ms | 5–15 ms | > 15 ms |
| Hot file: single file share of total I/O | < 60% | 60–80% | > 80% |
| Auto-growth events in last 24 h | 0 | 1–3 | > 3 |
| Auto-growth fixed-size increment | ≥ 256 MB | 64–255 MB | < 64 MB |
Threshold provenance: The data/log avg-latency bands (Z1–Z3) reflect widely-cited storage-latency guidance. The remaining cutoffs — Z4 (stall ratio), Z5 (hot-file share), and Z11–Z15 (auto-growth event count and increment sizing) — are reasonable operational heuristics, not Microsoft-documented values. Use them to prioritize where to look; confirm an actual problem against the workload's own baseline rather than the number alone.
Latency and Stall Checks (Z1–Z5)
Run these first — latency is the direct measure of disk I/O quality.
Z1 — Data File Read Latency
- Trigger:
avg_read_msfor any data file (type_desc =ROWS) > 20 ms - Severity: Warning if 10–20 ms; Critical if > 20 ms
- Fix: Data read latency above 20 ms indicates the storage subsystem is not keeping up with SQL Server's I/O demands. Root causes: (1) insufficient IOPS on the volume — check if the storage tier (HDD, SSD, NVMe) is appropriate for the workload; (2) storage contention from another workload on the same spindles/LUN; (3) RAID rebuild in progress; (4) the working set is larger than the buffer pool (pair with
/sqlmemory-reviewO1 for low PLE). On Azure VMs, verify the disk is Premium SSD and the VM I/O bandwidth cap is not being hit.
Z2 — Data File Write Latency
- Trigger:
avg_write_msfor any data file (type_desc =ROWS) > 20 ms - Severity: Warning if 10–20 ms; Critical if > 20 ms
- Fix: Write latency above 20 ms on data files indicates checkpoint writes are stalling. Causes: (1) write cache disabled or write-through on the storage controller; (2) RAID write penalty (RAID-5/6 should be avoided for SQL data files); (3) storage controller cache is full due to heavy write workload. Enable write-back caching on the storage controller (with a battery-backed unit or supercapacitor to protect against power loss). For Azure, check if Azure Write Accelerator is appropriate for log files.
Z3 — Log File Write Latency
- Trigger:
avg_write_msfor any transaction log file (type_desc =LOG) > 10 ms - Severity: Warning if 5–10 ms; Critical if > 10 ms
- Fix: Log writes are synchronous — every committed transaction waits for the log write to complete. High log latency directly translates to transaction latency and WRITELOG waits (see
/sqlwait-reviewV6). The transaction log should be on a separate volume from data files, on low-latency storage (NVMe preferred). Check: (1) is the log file on its own dedicated volume? (Z6); (2) is the volume on a spinner? Migrate to SSD; (3) is another workload sharing the log volume?
Z4 — Hot Data File
- Trigger: A single data file receives ≥ 60% of all I/O operations (reads + writes) across all data files in the same database
- Severity: Warning if 60–79%; Critical if ≥ 80%
- Fix: One file is serving the majority of I/O, indicating the database's working set is concentrated in that file's extent ranges. Solutions: (1) add secondary data files to distribute I/O — SQL Server distributes new allocations proportionally across equally-sized files; (2) check that existing data files are the same size — unequal sizes cause disproportionate fill (the larger file fills first). This is the most common TempDB issue — also check Z7 for TempDB-specific guidance.
Z5 — High Stall Ratio
- Trigger:
avg_stall_per_io_msfor any file > 15 ms (combinedio_stall / (num_of_reads + num_of_writes)— average time each I/O operation spent stalled) - Severity: Warning if 5–15 ms; Critical if > 15 ms
- Fix: The stall ratio measures average wait per I/O operation across reads and writes combined. A high stall ratio independent of latency can indicate: storage QoS throttling, LUN queue depth saturation, or iSCSI/FC network congestion. If latency is normal but stall ratio is high, investigate the storage path (HBA driver, multipath I/O configuration, SAN QoS policy).
Storage Placement and Configuration Checks (Z6–Z10)
Z6 — Data and Log on Same Volume
- Trigger: Any database has data files and log files sharing the same drive letter or mount point path prefix (comparing
physical_namefromsys.master_files) - Severity: Warning
- Fix: Data and log files compete for I/O when co-located, and a data file growth event can fill the volume and immediately prevent log writes. Place log files on a dedicated volume. Separate physical spindles or SSD tiers ensure that log writes (which must be sequential and low-latency) do not compete with random data read/write traffic.
Z7 — TempDB on Same Volume as User Database Files
- Trigger: TempDB data files share a volume with user database data or log files
- Severity: Warning
- Fix: TempDB is heavily used for sort spills, hash joins, temporary tables, and row versioning. Co-locating it with user databases creates I/O interference in both directions. Place TempDB data files on a dedicated, low-latency volume. On systems with NVMe drives, TempDB is an ideal candidate for local NVMe storage since it is recreated at startup and does not need to survive restarts.
Z8 — TempDB Log File Latency
- Trigger:
avg_write_msfor the TempDB log file > 10 ms - Severity: Warning if 5–10 ms; Critical if > 10 ms
- Fix: TempDB log latency is particularly impactful because version store operations and row-level locking all pass through the TempDB log. High TempDB log latency stalls even simple update workloads. Place TempDB on NVMe or dedicated SSD. Also check Z7 for volume co-location issues.
Z9 — File Count Imbalance (Single File Handles Most I/O)
- Trigger: When a database (especially TempDB) has multiple data files, one file handles > 70% of total data file I/O
- Severity: Warning
- Fix: Files must be the same size for SQL Server's proportional fill algorithm to distribute I/O evenly. Shrink and resize files to equal sizes, then grow them together. For TempDB specifically, a common misconfiguration is having the primary file larger because it was created at setup before secondary files were added. Use
DBCC SHRINKFILE+ALTER DATABASE ... MODIFY FILEto equalize sizes. Note: for TempDB,sys.master_filesshows the configured startup size, not the current size — querytempdb.sys.database_filesfor current file sizes.
Z10 — Database File on System Drive
- Trigger: Any database file (data or log) has
physical_namestarting with the system drive letter (typicallyC:\) - Severity: Warning
- Fix: Database files on the system drive compete for space and I/O with the OS, Windows paging file, and application logs. A data file filling the system drive can freeze the OS. Move user database files to a dedicated data volume using
ALTER DATABASE ... MODIFY FILE(then detach/reattach or useALTER DATABASE ... SET OFFLINE+ file move). System databases (master, msdb, model) may be on the system drive but should be assessed for growth risk.
Auto-Growth Pattern Checks (Z11–Z15)
Z11 — Auto-Growth Events in Last 24 Hours
- Trigger: Default trace shows ≥ 1 auto-growth event for any database file in the past 24 hours
- Severity: Warning if 1–3 events; Critical if > 3 events
- Fix: Auto-growth is a reactive, blocking operation — queries stall while the file grows. The goal is to pre-size files to avoid auto-growth during normal operations. Identify the database growing, estimate its daily growth rate, and pre-grow by 1–4 weeks of projected growth. Auto-growth should be a safety net for unusual spikes, not a routine occurrence. Monitor with a SQL Agent job or custom XE session targeting EventClass 92/93.
Z12 — Data File Auto-Growth Too Small (Fixed MB)
- Trigger: Any data file in
sys.master_fileshasis_percent_growth = 0ANDgrowth * 8 / 1024< 256 MB (i.e., grows in increments < 256 MB) - Severity: Warning if 64–255 MB; Critical if < 64 MB
- Fix: Small fixed-MB increments mean the file will grow frequently in small steps, each causing a stall. SQL Server databases should grow in large increments (256 MB to 1 GB or more for large databases) or use percentage growth only on small databases (< 10 GB). Avoid percent growth on large databases — a 10% growth on a 1 TB database triggers a 100 GB allocation event that can take minutes. Use:
ALTER DATABASE [name] MODIFY FILE (NAME = logical_name, FILEGROWTH = 512MB).
Z13 — Log File Auto-Growth Too Small
- Trigger: Any log file has
growth * 8 / 1024< 128 MB (if fixed) ORis_percent_growth = 1ANDgrowth> 10 (percent growth on a large log) - Severity: Warning
- Fix: Log auto-growth events are especially impactful because they are synchronous — the writing session must wait. Set log file growth to a fixed increment large enough to avoid frequent events: typically 256 MB for small databases, 1 GB+ for databases with high transaction volume. Avoid percentage-based growth on log files as the increments become unmanageably large.
Z14 — Auto-Growth During Production Hours
- Trigger: Default trace shows auto-growth events occurring between 07:00 and 21:00 local time (configurable), especially on high-transaction databases
- Severity: Warning
- Fix: Auto-growth during peak hours directly impacts user-facing transactions. This indicates the database was not pre-sized appropriately for the day's expected load. Pre-grow the file in a nightly maintenance window using a SQL Agent job:
ALTER DATABASE [name] MODIFY FILE (NAME = logical_name, SIZE = <new_size>MB). Monitor growth trends over 30–90 days to project future needs.
Z15 — I/O Stall Trend Worsening Across Snapshots
- Trigger: When ≥ 3 time-based snapshots of
sys.dm_io_virtual_file_statsare provided,avg_read_msoravg_write_msfor the same file shows a monotonically increasing pattern across snapshots - Severity: Warning if worsening over 2 consecutive snapshots; Critical if worsening over 3+ consecutive snapshots
- Fix: Worsening I/O latency over time suggests: storage subsystem degradation (drive health — check Windows Event Log for disk errors), growing data volume outpacing storage IOPS capacity, or a storage maintenance operation (RAID rebuild, snapshot replication). Pull storage controller event logs and correlate with the latency onset time. For cloud storage (Azure Managed Disks, AWS EBS), check the cloud provider's disk metrics for throttling events.
Output Format
Present findings in this order:
- I/O Health Summary — one sentence: is I/O within acceptable latency bounds for all files?
- Findings table — one row per triggered check:
| Check | Severity | File | Metric | Finding | Fix |
|---|---|---|---|---|---|
| Z3 | Critical | AdventureWorks.ldf | avg_write_ms = 42 ms | Log write latency 4× threshold | Move log to dedicated NVMe volume |
- Top I/O offenders — list the 3 files with highest latency or stall count.
- Auto-growth summary — count of events in 24 h, files affected, growth sizes.
- Recommended next steps — ordered action list with companion skill references.
Analyzed by:
sqldiskio-review(Z1–Z15)
Companion Skills
/sqlwait-review— PAGEIOLATCH_SH/EX (V1–V3) and WRITELOG (V6) waits are the live signal of disk I/O pressure that this skill explains at the file level; run both together/sqlmemory-review— low PLE (O1) often manifests as high read latency (Z1) because the buffer pool can't keep the working set in memory; fix memory first, then reassess I/O/sqltrace-review— identifies which queries are generating the most physical I/O, pairing trace patterns with file-level stats/sqlplan-review— scan operators (N17–N19) explain why a single query is driving high read I/O; key lookups (N3) translate to random I/O on data files
VERSION_COMPATIBILITY
See skills/VERSION_COMPATIBILITY.md for the full compatibility matrix.
| Check | 2008 R2 | 2012 | 2014 | 2016 | 2017 | 2019 | 2022 | Azure SQL |
|---|---|---|---|---|---|---|---|---|
| Z1 Data read latency | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Z2 Data write latency | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Z3 Log write latency | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Z4 Hot data file | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Z5 Stall ratio | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Z6 Data+log same vol | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | Managed |
| Z7 TempDB co-location | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | N/A |
| Z8 TempDB log latency | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | N/A |
| Z9 File imbalance | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Z10 System drive | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | N/A |
| Z11 Auto-grow 24h | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | Partial |
| Z12 Growth increment | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Z13 Log growth | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Z14 Peak-hour growth | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | Partial |
| Z15 Trend worsening | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |