SQL Server Error Log Messages You Can’t Ignore: Service Broker, Endpoints, and TDS 17836

Environment: Always On availability groups

Summary

You may see these entries in the SQL Server error log:

The Service Broker endpoint is in disabled or stopped state.
Error: 9642, Severity: 16, State: 3.
Service Broker/Database Mirroring transport connection endpoint. Error: 8474.
Error: 17836, Severity: 20, State: 17.
Length specified in network packet payload did not match number of bytes read; the connection has been closed.

Short version: a few lines at startup or during a planned failover are normal. Repeating messages mean something keeps trying and failing to connect to the Database Mirroring endpoint that AG uses, or a client is sending bad or truncated TDS packets. Do not ignore repeated entries.

The guide below explains what each message means, likely causes, how to prove the cause, and how to fix it. I also included two related messages you showed earlier:

  • TDE log encryption error that auto-recovered
  • FlushCache long write

Symptoms and sample messages

Service Broker endpoint stopped

The Service Broker endpoint is in disabled or stopped state.

Transport errors

Error: 9642, Severity: 16, State: 3.
An error occurred in Service Broker/Database Mirroring transport connection endpoint. Error: 8474.

TDS protocol error

Error: 17836, Severity: 20, State: 17.
Length specified in network packet payload did not match number of bytes read; the connection has been closed.
Error: 17836, Severity: 20, State: 17.
Length specified in network packet payload did not match number of bytes read; the connection has been closed.

Other messages you asked about

An error occurred while processing log encryption. The process was recovered automatically. No user action is required.
FlushCache: cleaned up 1186 bufs with 1082 writes in 65523 ms (avoided 1 new dirty bufs) for db 6:0

What each one means

1) “Service Broker endpoint is in disabled or stopped state”

Availability Groups and old database mirroring both use the Database Mirroring endpoint for their wire protocol. If that endpoint is stopped or blocked, partners cannot connect. You will also see 9642 and 8474 when something hits an endpoint that is down or unreachable.

Common causes

  • The mirroring endpoint is stopped or disabled.
  • The SQL Server service account does not have CONNECT on the endpoint.
  • Port blocked by firewall or network device.
  • Wrong endpoint URL or port on one of the replicas.
  • TLS or certificate problems if you use certificate based endpoints.

How to check

-- Endpoint state and port
SELECT name, state_desc, role_desc, port
FROM sys.database_mirroring_endpoints;

-- All endpoints and states
SELECT name, type_desc, state_desc AS endpoint_state
FROM sys.endpoints;

-- AG connectivity snapshot
SELECT ar.replica_server_name,
       ars.connected_state_desc,
       ars.synchronization_health_desc
FROM sys.dm_hadr_availability_replica_states AS ars
JOIN sys.availability_replicas AS ar
  ON ars.replica_id = ar.replica_id;

Fix

-- Start the endpoint if stopped
ALTER ENDPOINT [Hadr_endpoint] STATE = STARTED;

-- Make sure the SQL Server service account can connect
-- Replace DOMAIN\SqlSvc with your service account
GRANT CONNECT ON ENDPOINT::[Hadr_endpoint] TO [DOMAIN\SqlSvc];

Network check

  • Verify the port is listening on each replica: netstat -ano | findstr <port>
  • Test from each partner to each partner: Test-NetConnection <server> -Port <port>
  • Confirm firewalls allow inbound on that port.

Clean up ghost Service Broker routes if you do not use them

-- See if Service Broker is enabled
SELECT name, is_broker_enabled FROM sys.databases;

-- See routes that might be pointing to dead targets
SELECT * FROM sys.routes;

2) 9642 with 8474 from “Service Broker/Database Mirroring transport”

These entries are transport level failures on the mirroring endpoint. Pair them with the endpoint and network checks above.

Typical root causes

  • Endpoint stopped
  • Partner unreachable
  • Port mismatch
  • TLS or certificate chain problem
  • Packet inspection device interfering with traffic

Extra monitoring
Create a short Extended Events session to capture transport errors with client and host info.

CREATE EVENT SESSION [HADR_TransportErrors] ON SERVER
ADD EVENT sqlserver.error_reported(
    WHERE (error_number IN (9642, 8474))
)
ADD TARGET package0.ring_buffer;
ALTER EVENT SESSION [HADR_TransportErrors] ON SERVER STATE = START;

Read the ring buffer when you see errors:

SELECT
  DATEADD(minute, DATEDIFF(minute, GETUTCDATE(), SYSDATETIME()),
          x.value('(event/@timestamp)[1]','datetime2')) AS local_time,
  x.value('(event/data[@name="message"]/value)[1]','nvarchar(4000)') AS message,
  x.value('(event/action[@name="client_app_name"]/value)[1]','nvarchar(256)') AS app,
  x.value('(event/action[@name="client_hostname"]/value)[1]','nvarchar(256)') AS host
FROM (
  SELECT CAST(target_data AS XML) AS xml_data
  FROM sys.dm_xe_sessions s
  JOIN sys.dm_xe_session_targets t
    ON s.address = t.event_session_address
  WHERE s.name = 'HADR_TransportErrors'
) AS d
CROSS APPLY xml_data.nodes('//RingBufferTarget/event') AS q(x);

Stop and drop when done:

ALTER EVENT SESSION [HADR_TransportErrors] ON SERVER STATE = STOP;
DROP EVENT SESSION [HADR_TransportErrors] ON SERVER;

3) Logon Error 17836 “Length specified in network packet payload did not match…”

SQL Server terminated a connection because the TDS packet did not match the expected length. This is a server side protection. It prevents bad clients and corrupted or truncated packets from being processed.

Common causes

  • Old or buggy client drivers. Fix with ODBC 18 or current SqlClient.
  • TLS handshake problems that lead to truncation by a proxy or load balancer.
  • Middleboxes that rewrite or compress traffic incorrectly.
  • Network instability or NIC offload issues.

How to identify the offending client
Use Extended Events to capture 17836 with host and app information.

CREATE EVENT SESSION [TDS_17836] ON SERVER
ADD EVENT sqlserver.error_reported(
  ACTION(sqlserver.client_hostname, sqlserver.client_app_name, sqlserver.client_ip)
  WHERE (error_number = 17836)
)
ADD TARGET package0.ring_buffer;
ALTER EVENT SESSION [TDS_17836] ON SERVER STATE = START;

Read results:

SELECT
  DATEADD(minute, DATEDIFF(minute, GETUTCDATE(), SYSDATETIME()),
          x.value('(event/@timestamp)[1]','datetime2')) AS local_time,
  x.value('(event/data[@name="message"]/value)[1]','nvarchar(4000)') AS msg,
  x.value('(event/action[@name="client_app_name"]/value)[1]','nvarchar(256)') AS app,
  x.value('(event/action[@name="client_hostname"]/value)[1]','nvarchar(256)') AS host,
  x.value('(event/action[@name="client_ip"]/value)[1]','nvarchar(64)') AS ip
FROM (
  SELECT CAST(target_data AS XML) AS xml_data
  FROM sys.dm_xe_sessions s
  JOIN sys.dm_xe_session_targets t
    ON s.address = t.event_session_address
  WHERE s.name = 'TDS_17836'
) AS d
CROSS APPLY xml_data.nodes('//RingBufferTarget/event') AS q(x);

Stop and drop when finished:

ALTER EVENT SESSION [TDS_17836] ON SERVER STATE = STOP;
DROP EVENT SESSION [TDS_17836] ON SERVER;

Fixes once you find the client

  • Upgrade client drivers and providers. Use ODBC 18 or newer SqlClient.
  • If Force Encryption is on, verify the certificate chain on the server and clients.
  • Check Windows System log for Schannel errors.
  • Remove or reconfigure proxies and load balancers that touch TDS.
  • On the server, review NIC offload features. If needed, test with TCP Chimney, RSS, LRO disabled.

You can also confirm which sessions are encrypted:

SELECT session_id, client_net_address, encrypt_option, net_transport, protocol_type
FROM sys.dm_exec_connections
WHERE session_id > 50;

Two related messages you saw

“An error occurred while processing log encryption… recovered automatically”

This is from TDE. SQL Server hit a transient issue while encrypting the transaction log and retried successfully. If rare, treat as informational. If frequent, check:

  • Storage latency on the log file
  • CPU pressure during heavy encryption
  • EKM or KMS latency if you use an external key provider

Health checks:

-- Confirm database encryption state
SELECT name, encryption_state, key_algorithm, key_length
FROM sys.databases;

-- Backup the TDE certificate and DEK if not already done
-- BACKUP CERTIFICATE ... WITH PRIVATE KEY = ...
-- BACKUP DATABASE ENCRYPTION KEY ...

“FlushCache … cleaned up … took 65523 ms”

FlushCache messages are informational. SQL Server wrote dirty pages to disk. Sixty five seconds is long and points to slow I/O or heavy checkpoint work.

Check storage latency:

;WITH v AS (
  SELECT DB_NAME(vfs.database_id) AS db_name,
         mf.type_desc,
         vfs.num_of_reads,
         vfs.num_of_writes,
         vfs.io_stall_read_ms,
         vfs.io_stall_write_ms
  FROM sys.dm_io_virtual_file_stats(NULL, NULL) AS vfs
  JOIN sys.master_files AS mf
    ON vfs.database_id = mf.database_id
   AND vfs.file_id     = mf.file_id
)
SELECT db_name, type_desc,
       CASE WHEN num_of_reads=0 THEN 0 ELSE io_stall_read_ms/num_of_reads END AS avg_read_ms,
       CASE WHEN num_of_writes=0 THEN 0 ELSE io_stall_write_ms/num_of_writes END AS avg_write_ms
FROM v
ORDER BY avg_write_ms DESC;

Checkpoint aggressiveness:

EXEC sp_configure 'show advanced options', 1; RECONFIGURE;
EXEC sp_configure 'recovery interval';  -- consider a higher value only if checkpoints are thrashing

Also review waits:

SELECT TOP 20 * FROM sys.dm_os_wait_stats
ORDER BY wait_time_ms DESC;

Large WRITELOG or PAGEIOLATCH waits point to I/O pressure.


Runbook: decide ignore or investigate

  1. Are AG replicas healthy and connected?
    If yes and there are only a few lines around failover or restart, you can ignore.
  2. Do you see 9642 or 8474 or 17836 repeating?
    If yes, investigate. Start with endpoint state, then network tests, then Extended Events to find the client.
  3. Do you see frequent TDE log encryption recoveries?
    If yes, check log I/O and key provider latency.
  4. Do “FlushCache” durations exceed normal baselines?
    If yes, check storage latency, recovery interval, and workload spikes.

Example: quick triage in an Always On deployment

  1. Confirm AG health:
SELECT ar.replica_server_name, ars.connected_state_desc, ars.synchronization_health_desc
FROM sys.dm_hadr_availability_replica_states ars
JOIN sys.availability_replicas ar ON ars.replica_id = ar.replica_id;

2. Verify endpoint is started and port is correct:

SELECT name, state_desc, role_desc, port FROM sys.database_mirroring_endpoints;
-- ALTER ENDPOINT [Hadr_endpoint] STATE = STARTED;

3. Test network reachability from each replica:

Test-NetConnection <PartnerHostName> -Port <EndpointPort>

4. Grant endpoint CONNECT to the SQL service account if missing:

GRANT CONNECT ON ENDPOINT::[Hadr_endpoint] TO [DOMAIN\SqlSvc];

5. Catch the 17836 offender if it keeps happening:

-- Create TDS_17836 XE session, let it run for 10 to 30 minutes, then read results as shown above.
  1. If the client is an old app server:
  • Upgrade its ODBC or SqlClient
  • Fix TLS trust
  • Remove packet rewriting on proxies

Closing thoughts

  • A clean error log helps you catch real incidents. Repeating 9642, 8474, and 17836 entries are signals.
  • Start with endpoint state and network reachability.
  • Use Extended Events to identify noisy clients.
  • Keep TDE keys backed up and watch for spikes in log I/O.
  • Treat long FlushCache durations as a prompt to verify storage health.


Discover more from SQLyard

Subscribe to get the latest posts sent to your email.

Leave a Reply

Discover more from SQLyard

Subscribe now to keep reading and get access to the full archive.

Continue reading