As you move your platform from running on a single database and single server to running on a highly available scenario, be that a Mirror, AlwaysOn, or a cloud based platform like Windows or SQL Azure, you will find out that highly available does not equal highly reliable! The service will still go down during failover and this will not be covered by your “up time SLA”.
The solution is unfortunately the need to re-code the application tier to detect these “transient” errors and retry the offending code – hoping that the service is now running again on another node.
I will explore three common scenarios that need fault retry logic:
- SQL Mirror (SQL 2005+)
- SQL Always On (SQL 2012)
- Windows or SQL Azure
1) Dealing with Fault Retry Logic on a SQL Mirror
The sales blurb may claim that a SQL Mirror is zero down time with its “transparent” client redirection. The truth is that during a failover you will loose your current connection, and a rollback of any uncommitted transactions will occur.
If you are happy that your application crashes for a little while (e.g. web users see say and error page), and that in progress transactions are lost, then this is ok. Otherwise you will need the application code to support Fault Retry Logic.
The good news is that there is some sample code blocks available on codeplex that can be used as a starting point if you are developing on dot.net
If you are not developing on dot.net, then this code is still useful as you can read from the code the typical error codes that you need to look out for on failover. You can see the error numbers along with a nice KB article below.
// The relevant error codes which are documented in http://support.microsoft.com/kb/109787
// which seem to be worthwhile for retrying. If errors occur which are not in this list they
// are simply rethrown.
readonly static int RetryErrorNumbers = new int
1608, // A network error was encountered while sending results to the front end
10008, // Bad token from SQL Server: datastream processing out of sync
10010, // Read from SQL Server failed
10018, // Error closing network connection
10025, // Write to SQL Server failed
10053, // Software caused connection abort
10054, // Connection reset by peer
10058, // Can't send after socket shutdown
17824, // Unable to write to server-side connection
17825, // Unable to close server-side connection
17832, // Unable to read login packet(s)
The sample code block uses a delegate to ensure facilitate the following algorithm:
a) Try method
b) If it throws and exception, check to see if the exception number is one of the known errors associated with mirror failover (see above)
c) If it is a known error message then go to sleep and retry
d) each retry attempt takes exponentially longer until the max retries are reached
1: /// <summary>
2: /// This method invokes the supplied delegate with retry logic wrapped around it. No values are returned. If the delegate raises
3: /// recoverable SQL server or client errors, then the supplied delegate is reinvoked after a certain amount of delay
4: /// (which grows exponentially for each retry) until the retry limit is exceeded, at which point the exception
5: /// is rethrown. Other exceptions are not caught and will be visible to callers.
6: /// </summary>
7: /// <param name="body">The delegate to invoke within the retry code.</param>
8: internal void WithMirrorRetry (RetryBody body)
10: int retryCount = 0;
12: while (true)
19: catch (SqlException se)
21: if (!RetryErrorNumbers.Contains<int>(se.Number))
24: if (retryCount >= _maxRetries) throw;
26: //Sleep exponentially more each time through the retry loop
27: Thread.Sleep((int)Math.Pow(_delayBase, retryCount));
28: retryCount += 1;
2) Dealing with Fault Retry Logic for SQL 2012 Always On
In theory SQL 2012 is similar needs a similar approach to that of the Mirror Fault Retry Provider (above). However there is now an expanded list of potential error numbers with new messages specific to Always On. A quick search on sys.messages shows some 306 new messages that mention replicas or availability groups
I haven’t managed to find a definitive list of which ones are relevant for detecting failover, but so far in my testing I am only hitting error codes 976 and 983 on top of the standard mirror ones listed in http://support.microsoft.com/kb/109787
Hopefully there will be an official KB article or code block from the patterns and practices group to help narrow down the list of error codes to use.
3) Dealing with Transient Errors in Windows Azure
Just to make matters more complex, environments like Windows Azure also have “transient” errors relating to either failover, changed connections or just the nature of the cloud.
You can read about the Transient Fault Handling Block and download it from this link
A Word of Warning on Application Transactions
If your application is using Microsoft DTC for transactions, or explicitly starting database transactions outside of a stored procedure, then the failover will cause the entire transaction to roll back. Your retry logic needs to do a lot more work than just retry the last statement after re-opening the connection. It will need to retry all statements, including the start of the transactions.
The sample code on transient errors, or fault retry will not help you here. You will need to ensure that retry happens for the whole batch of statements within the application. This seems to be a very common practice with Java development, so make sure the project includes a review of this risk.