SQL Server system_health Alerts: Deadlock Capture, Long Waits, I/O Warnings, and More – SQLYARD

SQL Server system_health Alerts: Deadlock Capture, Long Waits, I/O Warnings, and More – SQLYARD

SQL Server system_health Alerts: Deadlock Capture, Long Waits, I/O Warnings, and More


Compatibility: SQL Server 2012 and later. The system_health session has been running on every SQL Server instance since 2012 and already captures deadlock events. No new Extended Events session required. All scripts deploy into a DBA utility database.

Companion article: This article focuses on system_health-driven alerts and structured deadlock capture. For broad coverage of severity-level and error-specific Agent alerts (825, 9002, severities 16–25), see SQL Server Severity Alerts: How to Set Up Proactive Error Notifications. The two systems are designed to work together — deploy both for complete coverage.

What system_health Is and Why You Should Use It

Every SQL Server instance since 2012 runs a built-in Extended Events session called system_health. It is always on, costs virtually nothing, and has been quietly collecting data about your server since the day it was installed. Most DBAs are aware of it but few take full advantage of it.

According to Microsoft’s documentation, system_health captures all of the following automatically:

  • All deadlocks — including the full deadlock graph as XML
  • Sessions waiting on latches for more than 15 seconds — with callstack, sql_text, and session_id
  • Sessions waiting on locks for more than 30 seconds — with callstack, sql_text, and session_id
  • Sessions in long preemptive waits — where SQL Server is waiting on external API calls
  • CLR allocation and virtual allocation failures
  • Memory broker and out-of-memory events
  • Security and connectivity events
  • Scheduler monitor and instance health data from sp_server_diagnostics

If you are not using system_health as part of your daily troubleshooting toolkit, start today. It is on every server you manage right now, collecting data that answers questions you have not even asked yet. The deadlock capture system in this article is built entirely on top of it — no additional session, no additional overhead.

File target vs ring buffer: system_health writes to both a ring buffer (in memory) and a file target (.xel files in the SQL log directory). On SQL Server 2022, the ring buffer target returns no rows for xml_deadlock_report events despite them being fully present in the file target. This is a known issue with how the ring buffer serializes events on newer versions. All scripts in this article use the file target exclusively via sys.fn_xe_file_target_read_file.

The Approach: Purely Event-Driven, Nothing Polling

The goal is to capture every deadlock into a permanent table and send an immediate email notification — without adding any scheduled polling to the server. Nothing runs between deadlocks. The system sits completely idle until SQL Server itself fires the trigger.

The complete deployment has exactly four components:

ObjectPurpose
dbo.DeadlockEventsPermanent landing table — stores every captured deadlock with victim and survivor detail, query text, and the full deadlock graph as XML
dbo.usp_CaptureDeadlockReads system_health for new deadlock events, parses victim and survivor from the XML, inserts new rows, and sends an HTML email notification
DBA_CaptureDeadlockUnscheduled Agent job that calls the procedure. No schedule — it only runs when fired by the alert
Deadlock_AlertSQL Server Agent alert on error message 1205. Fires the job the instant a deadlock victim is chosen

Here is the exact chain of events every time a deadlock occurs:

1. SQL Server’s deadlock monitor picks a victim and raises error 1205 ↓ (because we configured is_event_logged = 1 on message 1205) 2. Error 1205 is written to the SQL Server Error Log 3. The Deadlock_Alert agent alert fires on message 1205 and starts the DBA_CaptureDeadlock job 4. The job calls usp_CaptureDeadlock, which reads system_health’s .xel file for new deadlock events 5. The procedure parses victim and survivor details, inserts a row into DeadlockEvents, and emails the DBA team ↓ (then goes back to sleep until the next deadlock) 6. Nothing runs again until the next error 1205

Prerequisites

  • A DBA utility database to hold the table, procedure, and Agent job (create one if it does not exist)
  • Database Mail configured with a valid mail profile
  • SQL Server Agent running
  • Error 1205 must be configured to write to the Error Log — Step 1 of the install script checks and fixes this automatically

If is_event_logged = 0 on message 1205, the Agent alert will never fire because the error never reaches the Error Log. The install script checks this first and corrects it if needed. You cannot skip this step.

Install Script: All Four Components

Run all steps in sequence in your DBA utility database. Before running, update two values in the procedure: your Database Mail profile name and your DBA team email address.

1

Confirm Error 1205 Is Logged to the Error Log

This is the prerequisite that makes the entire chain work. If 1205 is not logged, the alert cannot fire.

IF NOT EXISTS
(
    SELECT 1 FROM sys.messages
    WHERE message_id   = 1205
    AND   language_id  = 1033
    AND   is_event_logged = 1
)
BEGIN
    EXEC master.sys.sp_altermessage
         @message_id      = 1205,
         @parameter       = 'WITH_LOG',
         @parameter_value = 'true';
    PRINT 'Error 1205 is now set to log to the Error Log.';
END
ELSE
    PRINT 'Error 1205 already logs to the Error Log. No change needed.';
GO
2

Create the Deadlock Landing Table

Stores each deadlock event with both victim and survivor details, the query text from each side, and the complete deadlock graph as XML for deep analysis.

USE [DBA];
GO

IF OBJECT_ID('dbo.DeadlockEvents', 'U') IS NOT NULL
    DROP TABLE dbo.DeadlockEvents;
GO

CREATE TABLE dbo.DeadlockEvents
(
    DeadlockID         INT            IDENTITY(1,1) NOT NULL,
    EventTime          DATETIME2(3)   NOT NULL,
    DatabaseName       NVARCHAR(128)  NULL,

    -- Victim details (the session SQL Server killed)
    VictimSPID         INT            NULL,
    VictimLogin        NVARCHAR(128)  NULL,
    VictimHost         NVARCHAR(128)  NULL,
    VictimApp          NVARCHAR(256)  NULL,
    VictimStatement    NVARCHAR(MAX)  NULL,
    VictimLockMode     NVARCHAR(20)   NULL,
    VictimWaitResource NVARCHAR(256)  NULL,

    -- Survivor details (the session that continued)
    SurvivorSPID       INT            NULL,
    SurvivorLogin      NVARCHAR(128)  NULL,
    SurvivorHost       NVARCHAR(128)  NULL,
    SurvivorApp        NVARCHAR(256)  NULL,
    SurvivorStatement  NVARCHAR(MAX)  NULL,
    SurvivorLockMode   NVARCHAR(20)   NULL,

    -- Full deadlock graph for deep analysis
    DeadlockGraph      XML            NOT NULL,

    DateRecorded       DATETIME2(3)   NOT NULL
        CONSTRAINT df_DeadlockEvents_DateRecorded DEFAULT (SYSDATETIME()),

    CONSTRAINT pkc_DeadlockEvents_DeadlockID
        PRIMARY KEY CLUSTERED (DeadlockID)
);
GO
3

Create the Capture Procedure

This procedure does all the real work. It uses the most recent EventTime in the table as a cutoff — anything newer than that in system_health is a new event to process. It parses victim and survivor from the deadlock XML, inserts the rows, and sends a concise HTML email with a frequency line so you can tell immediately whether this is a one-off or part of a larger pattern.

Update these two values before creating the procedure:

  • 'SQLMailProfile' — change to your actual Database Mail profile name
  • 'DBATeam@YourCompany.com' — change to your DBA team email address
USE [DBA];
GO

CREATE OR ALTER PROCEDURE dbo.usp_CaptureDeadlock
AS
SET NOCOUNT ON;
/*
    usp_CaptureDeadlock
    -------------------
    Reads system_health's .xel file target for xml_deadlock_report events
    newer than the most recent row in dbo.DeadlockEvents.
    Parses victim and survivor detail, inserts new rows, emails a notification.

    Called by DBA_CaptureDeadlock Agent job, which is fired by Deadlock_Alert
    on SQL Server error 1205. Nothing polls this procedure -- it runs only
    when a deadlock actually occurs.
*/
BEGIN
    -- Brief pause to allow system_health to flush the new event to the file target
    WAITFOR DELAY '00:00:05';

    DECLARE
        @LastEventTime DATETIME2(3),
        @Subject       NVARCHAR(255),
        @Body          NVARCHAR(MAX),
        @FrequencyLine NVARCHAR(200),
        @CountLastHour INT,
        @CountToday    INT;

    -- Use the most recent captured event as the cutoff for what is new
    SELECT @LastEventTime = ISNULL(MAX(EventTime), '19000101')
    FROM dbo.DeadlockEvents;

    -- -------------------------------------------------------
    -- Read new deadlock events from system_health file target
    -- system_health*.xel resolves to the instance log directory
    -- automatically -- no hardcoded paths, works on any server
    -- -------------------------------------------------------
    ;WITH FileDeadlocks AS
    (
        SELECT CAST(event_data AS XML) AS EventXML
        FROM sys.fn_xe_file_target_read_file('system_health*.xel', NULL, NULL, NULL)
        WHERE object_name = 'xml_deadlock_report'
    ),
    Parsed AS
    (
        SELECT
            DATEADD(MINUTE,
                DATEDIFF(MINUTE, GETUTCDATE(), GETDATE()),
                EventXML.value('(/event/@timestamp)[1]', 'DATETIME2(3)')) AS EventTime,
            EventXML AS FullEvent,
            EventXML.value(
                '(/event/data[@name="xml_report"]/value/deadlock/victim-list/victimProcess/@id)[1]',
                'NVARCHAR(50)') AS VictimID
        FROM FileDeadlocks
    )
    INSERT dbo.DeadlockEvents
    (
        EventTime,      DatabaseName,
        VictimSPID,     VictimLogin,     VictimHost,     VictimApp,
        VictimStatement, VictimLockMode, VictimWaitResource,
        SurvivorSPID,   SurvivorLogin,   SurvivorHost,   SurvivorApp,
        SurvivorStatement, SurvivorLockMode,
        DeadlockGraph
    )
    SELECT
        p.EventTime,
        v.n.value('@currentdbname',  'NVARCHAR(128)'),
        v.n.value('@spid',           'INT'),
        v.n.value('@loginname',      'NVARCHAR(128)'),
        v.n.value('@hostname',       'NVARCHAR(128)'),
        v.n.value('@clientapp',      'NVARCHAR(256)'),
        v.n.value('(inputbuf)[1]',   'NVARCHAR(MAX)'),
        v.n.value('@lockMode',       'NVARCHAR(20)'),
        v.n.value('@waitresource',   'NVARCHAR(256)'),
        s.n.value('@spid',           'INT'),
        s.n.value('@loginname',      'NVARCHAR(128)'),
        s.n.value('@hostname',       'NVARCHAR(128)'),
        s.n.value('@clientapp',      'NVARCHAR(256)'),
        s.n.value('(inputbuf)[1]',   'NVARCHAR(MAX)'),
        s.n.value('@lockMode',       'NVARCHAR(20)'),
        p.FullEvent
    FROM Parsed p
    CROSS APPLY p.FullEvent.nodes(
        '/event/data[@name="xml_report"]/value/deadlock/process-list/process[@id=sql:column("p.VictimID")]'
    ) v(n)
    CROSS APPLY p.FullEvent.nodes(
        '/event/data[@name="xml_report"]/value/deadlock/process-list/process[not(@id=sql:column("p.VictimID"))]'
    ) s(n)
    WHERE p.EventTime > @LastEventTime;

    -- Nothing new to report -- exit cleanly
    IF @@ROWCOUNT = 0 RETURN;

    -- -------------------------------------------------------
    -- Build frequency context for the email
    -- Tells the reader at a glance if this is isolated or a pattern
    -- -------------------------------------------------------
    SELECT
        @CountLastHour = SUM(CASE WHEN EventTime >= DATEADD(HOUR, -1,  SYSDATETIME()) THEN 1 ELSE 0 END),
        @CountToday    = SUM(CASE WHEN EventTime >= CAST(CAST(SYSDATETIME() AS DATE) AS DATETIME2(3)) THEN 1 ELSE 0 END)
    FROM dbo.DeadlockEvents;

    SET @FrequencyLine =
          CAST(@CountLastHour AS NVARCHAR(10)) + ' in the last hour, '
        + CAST(@CountToday    AS NVARCHAR(10)) + ' today.';

    -- -------------------------------------------------------
    -- Build and send the HTML email notification
    -- -------------------------------------------------------
    SET @Body =
        N''
      + N'

Deadlock captured on ' + @@SERVERNAME + N'

' + N'

Frequency: ' + @FrequencyLine + N'

'; SELECT @Body = @Body + N'

Event time: ' + CONVERT(NVARCHAR(30), EventTime, 120) + N'
' + N'Database: ' + ISNULL(DatabaseName, '-') + N'
' + N'Victim SPID: ' + ISNULL(CAST(VictimSPID AS NVARCHAR(10)), '-') + N' (' + ISNULL(VictimLogin, '-') + N' / ' + ISNULL(VictimHost, '-') + N')
' + N'Victim statement:
' + N'

'
            + ISNULL(VictimStatement, '-') + N'
' + N'Survivor SPID: ' + ISNULL(CAST(SurvivorSPID AS NVARCHAR(10)), '-') + N' (' + ISNULL(SurvivorLogin, '-') + N')
' + N'Survivor statement:
' + N'
'
            + ISNULL(SurvivorStatement, '-') + N'

' FROM dbo.DeadlockEvents WHERE EventTime > @LastEventTime ORDER BY EventTime DESC; SET @Body = @Body + N'

Full deadlock graph stored in DBA.dbo.DeadlockEvents.

' + N''; SET @Subject = @@SERVERNAME + N' - Deadlock captured'; EXEC msdb.dbo.sp_send_dbmail @profile_name = 'SQLMailProfile', -- UPDATE THIS @recipients = 'DBATeam@YourCompany.com', -- UPDATE THIS @subject = @Subject, @body = @Body, @body_format = 'HTML', @importance = 'High'; END; GO
4

Create the Unscheduled Agent Job

This job has no schedule. It sits idle until fired by the alert in Step 5. The output file is written to the SQL log directory alongside the system_health files for easy access.

USE [msdb];
GO

-- Clean up if a previous version exists
IF EXISTS (SELECT 1 FROM msdb.dbo.sysjobs WHERE name = N'DBA_CaptureDeadlock')
    EXEC msdb.dbo.sp_delete_job @job_name = N'DBA_CaptureDeadlock';
GO

DECLARE @jobId   BINARY(16);
DECLARE @LogPath VARCHAR(255);

-- Build output log path in the SQL log directory
SELECT @LogPath = REPLACE(CONVERT(VARCHAR(1000), SERVERPROPERTY('ErrorLogFileName')), '\ERRORLOG', '\');
SET @LogPath = @LogPath + 'DBA_CaptureDeadlock_$(ESCAPE_SQUOTE(STRTDT)).txt';

EXEC msdb.dbo.sp_add_job
     @job_name         = N'DBA_CaptureDeadlock',
     @enabled          = 0,                  -- enabled in Step 6
     @description      = N'Fired by Deadlock_Alert on error 1205. Reads system_health for new deadlock events, logs to DBA.dbo.DeadlockEvents, and emails a summary. Not scheduled.',
     @category_name    = N'[Uncategorized (Local)]',
     @owner_login_name = N'sa',
     @job_id           = @jobId OUTPUT;

EXEC msdb.dbo.sp_add_jobstep
     @job_id            = @jobId,
     @step_name         = N'Capture',
     @step_id           = 1,
     @subsystem         = N'TSQL',
     @command           = N'EXEC dbo.usp_CaptureDeadlock;',
     @database_name     = N'DBA',
     @output_file_name  = @LogPath,
     @on_success_action = 1,   -- Quit with success
     @on_fail_action    = 2;   -- Quit with failure

EXEC msdb.dbo.sp_add_jobserver
     @job_id     = @jobId,
     @server_name = N'(local)';
GO
5

Create the Alert on Error 1205

SQL Server only allows one alert per message_id. The script removes any pre-existing alert on 1205 before creating this one — regardless of what it was named.

USE [msdb];
GO

-- Remove any existing alert on message 1205 (regardless of name)
DECLARE @ExistingAlert NVARCHAR(128);
SELECT @ExistingAlert = name FROM msdb.dbo.sysalerts WHERE message_id = 1205;

IF @ExistingAlert IS NOT NULL
    EXEC msdb.dbo.sp_delete_alert @name = @ExistingAlert;
GO

EXEC msdb.dbo.sp_add_alert
     @name                         = N'Deadlock_Alert',
     @message_id                   = 1205,
     @severity                     = 0,
     @enabled                      = 0,          -- enabled in Step 6
     @delay_between_responses      = 0,
     @include_event_description_in = 0,
     @job_name                     = N'DBA_CaptureDeadlock';
GO
6

Enable the Alert and Job

EXEC msdb.dbo.sp_update_job   @job_name = N'DBA_CaptureDeadlock', @enabled = 1;
EXEC msdb.dbo.sp_update_alert @name     = N'Deadlock_Alert',      @enabled = 1;
GO

Why We Use the File Target, Not the Ring Buffer

Many published examples — including some in Microsoft’s own documentation — query system_health’s ring buffer target to retrieve deadlock events. The query looks clean and simple. The problem is that on SQL Server 2022, the ring buffer returns no rows for xml_deadlock_report events, even when those events are clearly visible in the file target.

This is a known issue with how the ring buffer serializes certain event types on newer versions. All procedures in this article use sys.fn_xe_file_target_read_file with the system_health*.xel pattern, which resolves relative to the instance log directory automatically. No path hardcoding — it works on any server regardless of where SQL Server was installed.

-- This is what the procedure uses -- portable, no hardcoded paths
FROM sys.fn_xe_file_target_read_file('system_health*.xel', NULL, NULL, NULL)
WHERE object_name = 'xml_deadlock_report'

-- Avoid this pattern -- ring buffer does not work reliably on SQL Server 2022
-- SELECT CAST(target_data AS XML)
-- FROM sys.dm_xe_session_targets
-- WHERE target_name = 'ring_buffer'

Testing the Setup

Create a small test table and reproduce a classic deadlock in two query windows. One session will be chosen as the victim and receive error 1205.

Create the Test Table

USE DBA;
IF OBJECT_ID('dbo.DeadlockTest', 'U') IS NOT NULL
    DROP TABLE dbo.DeadlockTest;

CREATE TABLE dbo.DeadlockTest (ID INT PRIMARY KEY, Val INT);
INSERT INTO dbo.DeadlockTest (ID, Val) VALUES (1, 0), (2, 0);
GO

Session 1 — Run First

USE DBA;
BEGIN TRAN;
  UPDATE dbo.DeadlockTest WITH (HOLDLOCK) SET Val = 1 WHERE ID = 1;
  WAITFOR DELAY '00:00:05';
  UPDATE dbo.DeadlockTest SET Val = 1 WHERE ID = 2;
COMMIT;

Session 2 — Run Within the 5-Second Window

USE DBA;
BEGIN TRAN;
  UPDATE dbo.DeadlockTest WITH (HOLDLOCK) SET Val = 2 WHERE ID = 2;
  WAITFOR DELAY '00:00:05';
  UPDATE dbo.DeadlockTest SET Val = 2 WHERE ID = 1;
COMMIT;

One session will be killed with the familiar message:

Msg 1205, Level 13, State 51, Line 4
Transaction (Process ID 79) was deadlocked on lock resources with another process
and has been chosen as the deadlock victim. Rerun the transaction.

Verifying It Worked

-- Did the alert fire?
SELECT name, last_occurrence_date, last_occurrence_time, occurrence_count
FROM msdb.dbo.sysalerts
WHERE name = 'Deadlock_Alert';

-- Did the job run successfully?
SELECT TOP 5
    h.run_date, h.run_time, h.run_status,
    h.message
FROM msdb.dbo.sysjobhistory h
JOIN msdb.dbo.sysjobs       j ON h.job_id = j.job_id
WHERE j.name = 'DBA_CaptureDeadlock'
ORDER BY h.run_date DESC, h.run_time DESC;

-- Was the deadlock captured in our table?
SELECT TOP 5 *
FROM DBA.dbo.DeadlockEvents
ORDER BY EventTime DESC;

Querying the Deadlock Log

Use these queries for ongoing monitoring and trend analysis from the permanent log table:

-- Deadlocks by day — spot frequency trends
SELECT
    CAST(EventTime AS DATE)  AS deadlock_date,
    COUNT(*)                 AS deadlock_count
FROM DBA.dbo.DeadlockEvents
WHERE EventTime >= DATEADD(DAY, -30, GETDATE())
GROUP BY CAST(EventTime AS DATE)
ORDER BY deadlock_date DESC;

-- Most frequently victimized logins
SELECT
    VictimLogin,
    COUNT(*)         AS times_victim,
    MIN(EventTime)   AS first_seen,
    MAX(EventTime)   AS last_seen
FROM DBA.dbo.DeadlockEvents
GROUP BY VictimLogin
ORDER BY times_victim DESC;

-- Most common databases involved
SELECT
    DatabaseName,
    COUNT(*) AS deadlock_count
FROM DBA.dbo.DeadlockEvents
GROUP BY DatabaseName
ORDER BY deadlock_count DESC;

-- Retrieve the full XML graph for a specific event (for detailed analysis in SSMS)
SELECT DeadlockGraph
FROM DBA.dbo.DeadlockEvents
WHERE DeadlockID = 1;
-- Paste the XML into a .xdl file and open in SSMS for the graphical deadlock view

Going Further: Other system_health Alerts You Can Add

The deadlock capture system above uses exactly the same pattern — Agent alert on an error message fires an Agent job that reads system_health. That same architecture works for several other categories of events that system_health already captures. Here are the most useful ones to add.

Long Lock Waits (system_health captures sessions waiting on locks > 30 seconds)

Event: sql_os_wait_info  ·  Query from system_health file target

System_health captures the callstack, sql_text, and session_id for any session that has waited on a lock resource for more than 30 seconds. You cannot hang an Agent alert directly off this event, but you can query it on demand or via a scheduled job:

-- Query system_health for long lock wait events
SELECT
    DATEADD(MINUTE, DATEDIFF(MINUTE, GETUTCDATE(), GETDATE()),
        CAST(event_data AS XML).value('(/event/@timestamp)[1]', 'DATETIME2(3)'))
        AS EventTime,
    CAST(event_data AS XML).value('(/event/data[@name="wait_type"]/text)[1]',  'NVARCHAR(100)') AS wait_type,
    CAST(event_data AS XML).value('(/event/data[@name="duration"]/value)[1]',  'BIGINT')        AS duration_ms,
    CAST(event_data AS XML).value('(/event/action[@name="sql_text"]/value)[1]','NVARCHAR(MAX)')  AS sql_text,
    CAST(event_data AS XML).value('(/event/action[@name="session_id"]/value)[1]','INT')          AS session_id
FROM sys.fn_xe_file_target_read_file('system_health*.xel', NULL, NULL, NULL)
WHERE object_name = 'sql_os_wait_info'
AND   CAST(event_data AS XML).value('(/event/data[@name="duration"]/value)[1]', 'BIGINT') > 30000
ORDER BY EventTime DESC;

Error 825 — Read-Retry Required (I/O Warning)

Agent Alert on message_id = 825  ·  SQL Server 2005+

Error 825 fires when SQL Server had to retry a page read due to an I/O error. A single occurrence can be a transient storage blip. Multiple occurrences in a short period indicate storage problems that need investigation before they progress to corruption. Set up the alert exactly the same way as the deadlock alert — on message_id 825:

-- Ensure 825 is logged to the Error Log
EXEC master.sys.sp_altermessage
     @message_id = 825, @parameter = 'WITH_LOG', @parameter_value = 'true';

-- Create the alert
EXEC msdb.dbo.sp_add_alert
     @name                    = N'IO_ReadRetry_825',
     @message_id              = 825,
     @severity                = 0,
     @enabled                 = 1,
     @delay_between_responses = 300,   -- 5 minutes between repeated alerts
     @include_event_description_in = 1;

EXEC msdb.dbo.sp_add_notification
     @alert_name          = N'IO_ReadRetry_825',
     @operator_name       = N'DBA Team',   -- your operator name
     @notification_method = 7;

Error 9002 — Transaction Log Full

Agent Alert on message_id = 9002  ·  SQL Server 2005+

When the transaction log fills completely, all writes to that database stop immediately. Error 9002 fires at that moment. An immediate alert gives the DBA time to respond before application errors cascade. The response is typically to grow the log file, back up the log, or investigate what is holding an open transaction.

EXEC msdb.dbo.sp_add_alert
     @name                         = N'TransactionLog_Full_9002',
     @message_id                   = 9002,
     @severity                     = 0,
     @enabled                      = 1,
     @delay_between_responses      = 60,
     @include_event_description_in = 5;  -- 5 = include in both email and net send

EXEC msdb.dbo.sp_add_notification
     @alert_name          = N'TransactionLog_Full_9002',
     @operator_name       = N'DBA Team',
     @notification_method = 7;

Severity 22–25 — Corruption and System Errors

Agent Alerts on severity 22, 23, 24, 25  ·  SQL Server 2005+

Severities 22 and 23 indicate table or database corruption. Severity 24 indicates a hardware problem. Severity 25 is a fatal system error that may result in SQL Server shutting down. These should always be alerted on with zero delay between responses:

-- Severity 22: table or index damaged
EXEC msdb.dbo.sp_add_alert @name = N'Severity 22',
    @severity = 22, @enabled = 1, @delay_between_responses = 0,
    @include_event_description_in = 1;
EXEC msdb.dbo.sp_add_notification @alert_name = N'Severity 22',
    @operator_name = N'DBA Team', @notification_method = 7;

-- Severity 23: suspect database
EXEC msdb.dbo.sp_add_alert @name = N'Severity 23',
    @severity = 23, @enabled = 1, @delay_between_responses = 0,
    @include_event_description_in = 1;
EXEC msdb.dbo.sp_add_notification @alert_name = N'Severity 23',
    @operator_name = N'DBA Team', @notification_method = 7;

-- Severity 24: hardware problem
EXEC msdb.dbo.sp_add_alert @name = N'Severity 24',
    @severity = 24, @enabled = 1, @delay_between_responses = 0,
    @include_event_description_in = 1;
EXEC msdb.dbo.sp_add_notification @alert_name = N'Severity 24',
    @operator_name = N'DBA Team', @notification_method = 7;

-- Severity 25: fatal system error
EXEC msdb.dbo.sp_add_alert @name = N'Severity 25',
    @severity = 25, @enabled = 1, @delay_between_responses = 0,
    @include_event_description_in = 1;
EXEC msdb.dbo.sp_add_notification @alert_name = N'Severity 25',
    @operator_name = N'DBA Team', @notification_method = 7;

Long Latch Waits (system_health captures sessions waiting on latches > 15 seconds)

Event: latch_suspend_end  ·  Query from system_health file target

System_health also captures sessions that have waited on latches for more than 15 seconds. Latch waits this long usually indicate I/O pressure, NUMA imbalance, or hot pages. Query the file target on demand:

-- Query system_health for long latch wait events
SELECT
    DATEADD(MINUTE, DATEDIFF(MINUTE, GETUTCDATE(), GETDATE()),
        CAST(event_data AS XML).value('(/event/@timestamp)[1]', 'DATETIME2(3)'))
        AS EventTime,
    CAST(event_data AS XML).value('(/event/data[@name="class"]/text)[1]',        'NVARCHAR(60)')  AS latch_class,
    CAST(event_data AS XML).value('(/event/data[@name="duration"]/value)[1]',    'BIGINT')        AS duration_ms,
    CAST(event_data AS XML).value('(/event/action[@name="sql_text"]/value)[1]',  'NVARCHAR(MAX)') AS sql_text,
    CAST(event_data AS XML).value('(/event/action[@name="session_id"]/value)[1]','INT')           AS session_id
FROM sys.fn_xe_file_target_read_file('system_health*.xel', NULL, NULL, NULL)
WHERE object_name = 'latch_suspend_end'
AND   CAST(event_data AS XML).value('(/event/data[@name="duration"]/value)[1]', 'BIGINT') > 15000
ORDER BY EventTime DESC;

The pattern is always the same: if an error message number exists for the event, create a SQL Server Agent alert on that message_id. If there is no error message — like long latch waits — query the system_health file target directly. The data is already being collected. You just need to read it.

References

This implementation is based on an approach published by sqlfingers.com. The additional alert patterns, system_health event inventory, query log analysis, and SQLYARD formatting were added by David Yard.


Discover more from SQLYARD

Subscribe to get the latest posts sent to your email.

Leave a Reply

Discover more from SQLYARD

Subscribe now to keep reading and get access to the full archive.

Continue reading