🚨 PART 2 – SQL Server 833 & Long Sync IO in Always On: What It Means, Why It’s Happening, and How to Fix It

If you’ve been staring at your error log and seeing things like:

SQL Server has encountered 122 occurrence(s) of I/O requests taking longer than 15 seconds to complete...

or

Long Sync IO: Scheduler 5 had 1 Sync IOs in nonpreemptive mode longer than 1000 ms

…you’re not alone. This is SQL Server waving a giant red flag that something’s wrong with your I/O path — and if you’re running Always On Availability Groups (AGs), the plot gets even more interesting.

Let’s break down what’s happening, why it matters, and exactly how to fix it.


📍 Understanding the Messages

Error 833 – The Big Stall

When SQL logs an 833, it means a disk operation (read/write) took more than 15 seconds to finish. This isn’t a slow query — it’s a storage stall. Healthy latency is:

  • Data file reads: < 20–30 ms
  • Log writes: < 5–10 ms

833 is SQL saying: “Your storage is choking. Fix it now.”


Long Sync IO – The Early Warning

This one logs when a synchronous I/O takes >1 second while SQL is running in its own scheduler. It’s the canary in the coal mine — ignore it, and you’ll probably see 833s next.


💡 Why This Matters Even More in AGs

In synchronous commit mode:

  1. Primary writes to its log.
  2. Sends the log block to the secondary.
  3. Secondary hardens it to its own log file and ACKs.
  4. Only then does the primary tell the app “commit successful.”
  5. Separately, the secondary’s redo thread applies it to the data files.

So:

  • If secondary log I/O is slow → primary commit latency spikes (HADR_SYNC_COMMIT, WRITELOG waits).
  • If secondary data I/O is slow → redo queue grows, 833s hit the secondary .mdf.

🛠 Step-by-Step Troubleshooting

1️⃣ Spot the pattern

EXEC xp_readerrorlog 0, 1, '833';
EXEC xp_readerrorlog 0, 1, 'Long Sync IO';

Note the timestamps — match them to AG queues and storage activity.

2️⃣ File-level latency

SELECT DB_NAME(vfs.database_id) AS db,
       mf.type_desc AS file_type,
       mf.physical_name,
       vfs.num_of_reads,
       vfs.io_stall_read_ms / NULLIF(vfs.num_of_reads,0)  AS avg_read_ms,
       vfs.num_of_writes,
       vfs.io_stall_write_ms / NULLIF(vfs.num_of_writes,0) AS avg_write_ms
FROM sys.dm_io_virtual_file_stats(NULL,NULL) vfs
JOIN sys.master_files mf
  ON mf.database_id=vfs.database_id AND mf.file_id=vfs.file_id
ORDER BY (vfs.io_stall_read_ms + vfs.io_stall_write_ms) DESC;

Look for:

  • Log file avg write > 5–10 ms → log harden problem.
  • Data file avg read/write > 20–30 ms → redo/data problem.

3️⃣ Top waits check

SELECT TOP 15 wait_type, waiting_tasks_count, wait_time_ms
FROM sys.dm_os_wait_stats
WHERE wait_type NOT LIKE 'SLEEP%' AND wait_type NOT LIKE 'BROKER_%'
ORDER BY wait_time_ms DESC;
  • PAGEIOLATCH_* → slow data file reads.
  • WRITELOG → slow log writes.
  • HADR_SYNC_COMMIT → synchronous AG commit delay.

4️⃣ AG queues

SELECT ar.replica_server_name, drs.database_id, DB_NAME(drs.database_id) AS db,
       drs.is_primary_replica, drs.synchronization_state_desc,
       drs.log_send_queue_size, drs.redo_queue_size,
       drs.log_send_rate, drs.redo_rate
FROM sys.dm_hadr_database_replica_states drs
JOIN sys.availability_replicas ar 
     ON drs.replica_id = ar.replica_id
ORDER BY db, is_primary_replica DESC;
  • Log Send Queue high → primary/network issue.
  • Redo Queue high → secondary redo/data issue.

5️⃣ Autogrowth check

DECLARE @t nvarchar(260) = (SELECT path FROM sys.traces WHERE is_default = 1);
SELECT TE.name AS event_name, t.StartTime, t.DatabaseName, t.FileName, t.TextData
FROM sys.fn_trace_gettable(@t, DEFAULT) t
JOIN sys.trace_events TE ON t.EventClass = TE.trace_event_id
WHERE t.EventClass IN (92,93)
ORDER BY t.StartTime DESC;

6️⃣ Scheduler pressure

SELECT scheduler_id, current_tasks_count, runnable_tasks_count, work_queue_count
FROM sys.dm_os_schedulers
WHERE scheduler_id < 255;

High runnable_tasks_count = workers are piling up behind slow I/O.

7️⃣ Power plan check (yes, it matters)

EXEC xp_cmdshell 'powercfg /GETACTIVESCHEME';

You want High performance, not Balanced.

⚡ Quick Wins (Apply Now)

  • Switch to ASYNC commit temporarily if end users are hurting:
ALTER AVAILABILITY GROUP [YourAG]
MODIFY REPLICA ON N'YourSecondary' WITH (AVAILABILITY_MODE = ASYNCHRONOUS_COMMIT);

ALTER AVAILABILITY GROUP [YourAG]
MODIFY REPLICA ON N’YourSecondary’ WITH (AVAILABILITY_MODE = ASYNCHRONOUS_COMMIT);

  • Enable Instant File Initialization for data files.
  • Pre-size data/log files; set autogrowth in MB (not %).
  • Exclude .mdf/.ndf/.ldf from antivirus scanning.
  • Keep log & data on separate disks.
  • 64 KB NTFS allocation size for SQL volumes.
  • High performance power plan.

🏗 Lasting Fixes

  • Upgrade storage tier / raise IOPS/throughput.
  • Update firmware/drivers for HBA/RAID/multipath.
  • Enable controller write caching (battery-backed).
  • Increase RAM to cut PAGEIOLATCH waits.
  • Index tuning to reduce physical I/O.
  • If redo queue high → move redo/data files to faster storage.

🎯 Optional: Tuning Parallel Redo

Since SQL Server 2016, AG secondaries use parallel redo for speed. But if your waits are dominated by:

SELECT TOP 10 wait_type, waiting_tasks_count, wait_time_ms
FROM sys.dm_os_wait_stats
WHERE wait_type LIKE 'PARALLEL_REDO%' 
ORDER BY wait_time_ms DESC;

…and your redo throughput is poor, you can test disabling it with TF 3459:

-- Disable parallel redo (secondary only)
DBCC TRACEON(3459, -1);
DBCC TRACESTATUS(3459);

To re-enable:

DBCC TRACEOFF(3459, -1);
DBCC TRACESTATUS(3459);

Note: Some older builds need a restart to re-enable. Always patch before testing.

🕵 Watch It Live: Extended Events

CREATE EVENT SESSION [Track-LongSyncIO]
ON SERVER
ADD EVENT sqlserver.database_file_size_change,
ADD EVENT sqlserver.file_read_completed(
    ACTION(sqlserver.database_name) WHERE (duration > 1000)),
ADD EVENT sqlserver.file_write_completed(
    ACTION(sqlserver.database_name) WHERE (duration > 1000)),
ADD EVENT sqlserver.error_reported(
    ACTION(sqlserver.database_name) WHERE (message LIKE '%Long Sync IO%'))
ADD TARGET package0.event_file(SET filename='C:\XE\longsyncio.xel', max_file_size=100, max_rollover_files=5);
ALTER EVENT SESSION [Track-LongSyncIO] ON SERVER STATE=START;

📝 The Bottom Line

  • 833 = >15s I/O stall (bad).
  • Long Sync IO = >1s I/O stall (warning).
  • In AGs, the secondary’s slowness can slow down the primary.
  • Always start with DMVs to confirm if it’s log harden or redo/data.
  • Apply quick wins immediately, then fix the underlying storage/network.
  • Keep parallel redo ON unless you prove it’s the bottleneck.


Discover more from SQLYARD

Subscribe to get the latest posts sent to your email.

Leave a Reply

Discover more from SQLYARD

Subscribe now to keep reading and get access to the full archive.

Continue reading