Wednesday, January 31, 2018

SQL Server Frequent mirror failovers, think beyond DBA skills

As you are a DBA you might have observed the primary databases configured in mirror gets failed over frequently / unexpectedly which causes downtime to the users.   So how do you start troubleshoot.... probably you could see the database status and once you find this was failed over to secondary, then how do you verify as you have all connectivity in place from secondary otherwise bring it back to the primary server to make sure the application connectivity remains to the users. 

But how do you troubleshoot to fix that unexpected errors, so that it will not happen again.

The first step is to start analyzing the SQL Server logs ....  you would get lot of information from it ;

SQL Server has encountered 1 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [C:\Program Files\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\MSSQL\DATA\MSDBLog.ldf] in database [msdb] (4). The OS file handle is 0x000000000000

 
The mirroring connection to "TCP://xxxx_mir.xxx.isyntax.net:5022" has timed out for database "xxxxx" after 10 seconds without a response. Check the service and network connections.

And check in Windows Event logs as well where you can find information for hardware & OS related.

So, one of these scenarios how do you troubleshoot further and fix the issue. 

1) You could change the timeout increased to a higher value, so that if there is no response from the IO or network you can avoid frequent failovers.
2) Analyze the IO errors working with storage admin then take up the action accordingly.
3) If you observe much load happening it requires system to be tuned.
4) See any disk related errors locally and sufficient disk space available.

In my case I have observed the server is a shared database node in VMware cluster and the SAN which is hosted the storage allocated to the database node is used for the other nodes as well.    So it looks like recently there were few more nodes added which kept more load on it, hence the IOPs is not happening with the desired rate.   

Hence SQL server is getting timed out as there is no response when it writes the data on the data file.   I worked with storage admin and move one of the other node to different VMcluster and that is connected to an another SAN together.  So reduced the IO load.

Hope this info helps you troubleshooting the mirror failover issues.
 

No comments: