Azure SQL Database offers geo-replication for resilience, but delays between primary and secondary databases can occur, especially under heavy write loads. This article explains how to monitor and troubleshoot these delays using public Dynamic Management Views (DMVs) and T-SQL. Redo lag is a common cause, where the secondary receives transaction logs but takes time to apply them to data pages. The DMV `sys.dm_geo_replication_link_status` provides information like replication state and lag in seconds. Healthy replication shows a lag of zero, while transient delays are temporary and self-correcting, unlike sustained delays that require investigation.
To assess redo activity on the secondary, `sys.dm_database_replica_states` is used, showing the `redo_queue_size` and `redo_rate`. A zero `redo_queue_size` indicates normal operation, while an increasing queue suggests pending work. Reporting workloads on geo-secondaries should anticipate near-real-time data, not guaranteed instantaneous updates. Large batch updates, index maintenance, and bursty write workloads are common triggers for redo lag.
Best practices include using UTC timestamps for correlation, monitoring both lag and redo queue size together, and implementing retry logic in applications. It is important to avoid assuming perfect synchronization and refrain from manual intervention during short-lived redo spikes. Redo lag is a normal, self-healing behavior in Azure SQL Database geo-replication, and understanding DMVs helps monitor replica freshness and make informed routing decisions. Azure SQL Database automatically stabilizes replication once redo pressure subsides.
techcommunity.microsoft.com
techcommunity.microsoft.com
Create attached notes ...
