Good morning everyone! Today we will see how to monitor GAP in AlwaysOn to be able to react quickly and minimize the impact of recovering to the hardening point. This point is of crucial importance since it is the threshold where the log empties, causing the stop of other nodes due to it and leading to unexpected problems. More information about AlwaysOn at https://www.gpsos.es/2020/09/alta-disponibilidad-en-versiones-standard/
How to monitor GAP in AlwaysOn?
To monitor we will use the tables:
sys.dm_hadr_availability_replica_states sys.dm_hadr_database_replica_states sys.availability_replicas
Crossing these tables we obtain the number of seconds of difference of synchronization by DB as well as the fate they sync against:
; WITH UpTime AS ( SELECT DATEDIFF (SECOND, create_date, GETDATE ()) [upTime_secs] FROM sys.databases WHERE name = 'tempdb' ), AG_Stats AS ( SELECT AR.replica_server_name, HARS.role_desc, Db_name (DRS.database_id)[DBName] , CAST (DRS.log_send_queue_size AS DECIMAL (19,2)) log_send_queue_size_KB, (CAST (perf.cntr_value AS DECIMAL (19,2)) / CAST (UpTime.upTime_secs AS DECIMAL (19,2))) / CAST (1024 AS DECIMAL (19,2)) [log_KB_flushed_per_sec] FROM sys.dm_hadr_database_replica_states DRS INNER JOIN sys.availability_replicas AR ON DRS.replica_id = AR.replica_id INNER JOIN sys.dm_hadr_availability_replica_states HARS ON AR.group_id = HARS.group_id AND AR.replica_id = HARS.replica_id --I am calculating this as an average over the entire time that the instance has been online. --To capture a smaller, more recent window, you will need to: --1. Store the counter value. --2. Wait N seconds. --3. Recheck counter value. --4. Divide the difference between the two checks by N. INNER JOIN sys.dm_os_performance_counters perf ON perf.instance_name = Db_name (DRS.database_id) AND perf.counter_name like 'Log Bytes Flushed / sec%' CROSS APPLY UpTime ), Pri_CommitTime AS ( SELECT replica_server_name , DBName , [log_KB_flushed_per_sec] FROM AG_Stats WHERE role_desc = 'PRIMARY' ), Sec_CommitTime AS ( SELECT replica_server_name , DBName --Send queue will be NULL if secondary is not online and synchronizing , log_send_queue_size_KB FROM AG_Stats WHERE role_desc = 'SECONDARY' ) SELECT p.replica_server_name [primary_replica] , p.[DBName] ACE [DatabaseName] , s.replica_server_name [secondary_replica] , CAST (s.log_send_queue_size_KB / p.[log_KB_flushed_per_sec] AS BIGINT) [Sync_Lag_Secs] FROM Pri_CommitTime p LEFT JOIN Sec_CommitTime s ON [s].[DBName] = [p].[DBName]
The output will be following:

Once we have the seconds per DB, it is easy for us to add a sum and schedule a job so that when said sum is greater than 5 min (300) it jumps. We must change the output of the WITH with the following code:
declare @TEMPPGAP int; SELECT sum (CAST (s.log_send_queue_size_KB / p.[log_KB_flushed_per_sec] AS BIGINT))[Sync_Lag_Secs] into ## TEMPPGAP FROM Pri_CommitTime p LEFT JOIN Sec_CommitTime s ON [s].[DBName] = [p].[DBName] set @ TEMPPGAP = (select[Sync_Lag_Secs] from ## TEMPPGAP) If (@TEMPPGAP> 3600) begin EXEC msdb..sp_send_dbmail @ profile_name = 'Alert Manager', @ recipients = 'alerts @ gpsos.es', @ subject = 'GAP with replicas', @ body = 'Check nodes', @ query = 'select[Sync_Lag_Secs] from ## TEMPPGAP ', @query_result_header = 1, @query_no_truncate = 0 end
From now on we will receive an email when our replicas are out of sync.
If you want us to help you with your environment, be it management or consulting, do not hesitate to contact us !! See you next time!!
Feel free to subscribe to our monthly newsletter so you don’t miss any of our publications with just one email per month.
Still do not know Query Performance ? Find out how it can help you in your Oracle environment. More information on their page LinkedIn .
Follow GPS on LinkedIn
More information about AlwaysOn at https://docs.microsoft.com/es-es/sql/database-engine/availability-groups/windows/overview-of-always-on-availability-groups-sql-server?view=sql-server-ver15