If you’ve been staring at your error log and seeing things like:
SQL Server has encountered 122 occurrence(s) of I/O requests taking longer than 15 seconds to complete...
or
Long Sync IO: Scheduler 5 had 1 Sync IOs in nonpreemptive mode longer than 1000 ms
…you’re not alone. This is SQL Server waving a giant red flag that something’s wrong with your I/O path — and if you’re running Always On Availability Groups (AGs), the plot gets even more interesting.
Let’s break down what’s happening, why it matters, and exactly how to fix it.
📍 Understanding the Messages
Error 833 – The Big Stall
When SQL logs an 833, it means a disk operation (read/write) took more than 15 seconds to finish. This isn’t a slow query — it’s a storage stall. Healthy latency is:
- Data file reads: < 20–30 ms
- Log writes: < 5–10 ms
833 is SQL saying: “Your storage is choking. Fix it now.”
Long Sync IO – The Early Warning
This one logs when a synchronous I/O takes >1 second while SQL is running in its own scheduler. It’s the canary in the coal mine — ignore it, and you’ll probably see 833s next.
💡 Why This Matters Even More in AGs
In synchronous commit mode:
- Primary writes to its log.
- Sends the log block to the secondary.
- Secondary hardens it to its own log file and ACKs.
- Only then does the primary tell the app “commit successful.”
- Separately, the secondary’s redo thread applies it to the data files.
So:
- If secondary log I/O is slow → primary commit latency spikes (
HADR_SYNC_COMMIT,WRITELOGwaits). - If secondary data I/O is slow → redo queue grows, 833s hit the secondary
.mdf.
🛠 Step-by-Step Troubleshooting
1️⃣ Spot the pattern
EXEC xp_readerrorlog 0, 1, '833';
EXEC xp_readerrorlog 0, 1, 'Long Sync IO';
Note the timestamps — match them to AG queues and storage activity.
2️⃣ File-level latency
SELECT DB_NAME(vfs.database_id) AS db,
mf.type_desc AS file_type,
mf.physical_name,
vfs.num_of_reads,
vfs.io_stall_read_ms / NULLIF(vfs.num_of_reads,0) AS avg_read_ms,
vfs.num_of_writes,
vfs.io_stall_write_ms / NULLIF(vfs.num_of_writes,0) AS avg_write_ms
FROM sys.dm_io_virtual_file_stats(NULL,NULL) vfs
JOIN sys.master_files mf
ON mf.database_id=vfs.database_id AND mf.file_id=vfs.file_id
ORDER BY (vfs.io_stall_read_ms + vfs.io_stall_write_ms) DESC;
Look for:
- Log file avg write > 5–10 ms → log harden problem.
- Data file avg read/write > 20–30 ms → redo/data problem.
3️⃣ Top waits check
SELECT TOP 15 wait_type, waiting_tasks_count, wait_time_ms
FROM sys.dm_os_wait_stats
WHERE wait_type NOT LIKE 'SLEEP%' AND wait_type NOT LIKE 'BROKER_%'
ORDER BY wait_time_ms DESC;
PAGEIOLATCH_*→ slow data file reads.WRITELOG→ slow log writes.HADR_SYNC_COMMIT→ synchronous AG commit delay.
4️⃣ AG queues
SELECT ar.replica_server_name, drs.database_id, DB_NAME(drs.database_id) AS db,
drs.is_primary_replica, drs.synchronization_state_desc,
drs.log_send_queue_size, drs.redo_queue_size,
drs.log_send_rate, drs.redo_rate
FROM sys.dm_hadr_database_replica_states drs
JOIN sys.availability_replicas ar
ON drs.replica_id = ar.replica_id
ORDER BY db, is_primary_replica DESC;
- Log Send Queue high → primary/network issue.
- Redo Queue high → secondary redo/data issue.
5️⃣ Autogrowth check
DECLARE @t nvarchar(260) = (SELECT path FROM sys.traces WHERE is_default = 1);
SELECT TE.name AS event_name, t.StartTime, t.DatabaseName, t.FileName, t.TextData
FROM sys.fn_trace_gettable(@t, DEFAULT) t
JOIN sys.trace_events TE ON t.EventClass = TE.trace_event_id
WHERE t.EventClass IN (92,93)
ORDER BY t.StartTime DESC;
6️⃣ Scheduler pressure
SELECT scheduler_id, current_tasks_count, runnable_tasks_count, work_queue_count
FROM sys.dm_os_schedulers
WHERE scheduler_id < 255;
High runnable_tasks_count = workers are piling up behind slow I/O.
7️⃣ Power plan check (yes, it matters)
EXEC xp_cmdshell 'powercfg /GETACTIVESCHEME';
You want High performance, not Balanced.
⚡ Quick Wins (Apply Now)
- Switch to ASYNC commit temporarily if end users are hurting:
ALTER AVAILABILITY GROUP [YourAG]
MODIFY REPLICA ON N'YourSecondary' WITH (AVAILABILITY_MODE = ASYNCHRONOUS_COMMIT);
ALTER AVAILABILITY GROUP [YourAG]
MODIFY REPLICA ON N’YourSecondary’ WITH (AVAILABILITY_MODE = ASYNCHRONOUS_COMMIT);
- Enable Instant File Initialization for data files.
- Pre-size data/log files; set autogrowth in MB (not %).
- Exclude
.mdf/.ndf/.ldffrom antivirus scanning. - Keep log & data on separate disks.
- 64 KB NTFS allocation size for SQL volumes.
- High performance power plan.
🏗 Lasting Fixes
- Upgrade storage tier / raise IOPS/throughput.
- Update firmware/drivers for HBA/RAID/multipath.
- Enable controller write caching (battery-backed).
- Increase RAM to cut PAGEIOLATCH waits.
- Index tuning to reduce physical I/O.
- If redo queue high → move redo/data files to faster storage.
🎯 Optional: Tuning Parallel Redo
Since SQL Server 2016, AG secondaries use parallel redo for speed. But if your waits are dominated by:
SELECT TOP 10 wait_type, waiting_tasks_count, wait_time_ms
FROM sys.dm_os_wait_stats
WHERE wait_type LIKE 'PARALLEL_REDO%'
ORDER BY wait_time_ms DESC;
…and your redo throughput is poor, you can test disabling it with TF 3459:
-- Disable parallel redo (secondary only)
DBCC TRACEON(3459, -1);
DBCC TRACESTATUS(3459);
To re-enable:
DBCC TRACEOFF(3459, -1);
DBCC TRACESTATUS(3459);
Note: Some older builds need a restart to re-enable. Always patch before testing.
🕵 Watch It Live: Extended Events
CREATE EVENT SESSION [Track-LongSyncIO]
ON SERVER
ADD EVENT sqlserver.database_file_size_change,
ADD EVENT sqlserver.file_read_completed(
ACTION(sqlserver.database_name) WHERE (duration > 1000)),
ADD EVENT sqlserver.file_write_completed(
ACTION(sqlserver.database_name) WHERE (duration > 1000)),
ADD EVENT sqlserver.error_reported(
ACTION(sqlserver.database_name) WHERE (message LIKE '%Long Sync IO%'))
ADD TARGET package0.event_file(SET filename='C:\XE\longsyncio.xel', max_file_size=100, max_rollover_files=5);
ALTER EVENT SESSION [Track-LongSyncIO] ON SERVER STATE=START;
📝 The Bottom Line
- 833 = >15s I/O stall (bad).
- Long Sync IO = >1s I/O stall (warning).
- In AGs, the secondary’s slowness can slow down the primary.
- Always start with DMVs to confirm if it’s log harden or redo/data.
- Apply quick wins immediately, then fix the underlying storage/network.
- Keep parallel redo ON unless you prove it’s the bottleneck.
Discover more from SQLYARD
Subscribe to get the latest posts sent to your email.


