Circuit Breaker for Windows Azure

| Comments

No application is in island.  Every application needs to interact with other applications located in remote, or consumes data stored in remote.  Your application should be cautious and handle instability situations while interacting with these remote endpoints.

Various practices and patterns are available for implementing a stable system.  Michael T. Nygard specifies following stability patterns when accessing remote endpoints in his Release It! Book:

  • Timeout – don’t wait for the response against a request after the given time limit
  • Retry – strategically request repeatedly until success
  • Circuit Breaker – fail fast if remote refuses and prevent re occurrence

These patterns are very much required for applications hosted in cloud.  Azure managed library implements first two patterns on storage service APIs.  This post explains how and when to use Circuit Breaker pattern in Azure.

Problem

Generally, a remote endpoint access is happened across the system.  When accessing a remote endpoint, the reliability of the connection might not be consistent.   The timeout and retry policy help to handle this failure, if it has happened for a particular request or very short time connection refuses.  However, there are some situations like cloud services outage or remote endpoint under maintenance, where time out and retry logic could not be a real rescue.   Instead, a quick fail detection mechanism helps the various access points in the system to react quickly.  This would avoid unnecessary remote invocations.

Example

Let us take an example.  There is an online flight reservation system hosted in Azure. It uses various flight operators’ databases through their WCF services to determine the availability.  It stores its customer de-tails and their booking information on SQL Azure as depicted in figure below:


The Flight Reservation System (FRS) should take care of following failures when interacting with these re-mote resources:

  • The flight availability query services (Flight A and Flight B) are unavailable daily between 11:30PM and 11:55PM.
  • Flight B operator has provided very low SLA, hence frequent connection refuses happened with the system
  • It is uncommon for SQL Azure outage, but the system should handle it.
  • Sometime, a specific Azure data center responds slowly, at that time the system should handle it.

In some cases, subsystem of an application may create, update and delete set of blobs or queue messages.  Another subsystem of the application may require these resources.  Leaving this as it is may results unreliable system.

Forces

  • Fail fast and handle it gracefully
  • Prevent reoccurred request to a refused remote invocation

Solution

The circuit breaker component keeps the recent connection state information for a remote endpoint globally across the system.  It behaves like our residential electrical fuses.  Initially the circuit is in closed state.  If the number of attempt to connect to the remote resource getting failed (retry), circuit breaker will open the circuit to prevent succeeding invocations for a while.  This is called as “trip broken” and circuit breaker is now in open state.  After some time later (threshold time), when a new request made, circuit breaker halfly open the circuit (means tries to made actual connection to the remote), if it is success then close the circuit, otherwise open it.  The attempt and resume policy is global for a remote endpoint.  Hence, unique circuit breaker should exist for every remote endpoint.  The conceptual diagram below depicts this.

Behavior

The sequence diagram below explains the typical circuit breaker behavior.

(click the above diagram for full view)

“Timeout?()”method specifies the connection timeout.  Number of attempt before moving to open state not mentioned in this diagram.  The AttemptReset() method in half open state will happen when a request has been made after some time while circuit breaker is in open state.  This time to make half open state is called as threshold time.

The diagram below shows the various state of the circuit breaker for a remote resource.

Implementation and Example

I am started developing a circuit breaker library for Windows Azure, with the following capabilities:

  • Handle various types of remote invocation happens in a typical Azure application like Azure storage services, SQL Azure, Web Request, WCF service invocation.
  • Automatically find and react to the exceptions those are relevant for circuit breaker concept like TimeoutException for WCF’s CommunicationObject
  • All the remote resources are managed by their URIs including differentiating the resources by their sub URIs.
  • Instead of singleton circuit breaker for a remote resource, maintaining the state for a resource in persistence store like Azure cache, table storage, blob storage.
  • Allow to define circuit breaker policy for a remote resource globally.
  • Log the open and half open state of the circuit breaker instances
  • Allow to define global “Failure Handling Strategy” for a remote resource

In this post, I have used the limited scope of Azure Circuit Breaker for easier understanding.  I have a vanilla ASP.NET MVC3 application and a hello world WCF service; both are in same hosted services.  The code for WCF service is shown below:

1
2
3
4
5
6
7
8
public class HelloService : IHelloService
{

  public string Greet(string name)
  {
    return string.Format("Hello, {0}", name);
  }
}

I have hosted this service on a worker role and opened TCP/IP port for internal access.  For the demon-stration purpose, I have open this service host one minute and then closed in the WorkerRole’s Run() method as shown below:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
using (ServiceHost host = new ServiceHost(typeof(HelloService)))
{
  // service host initialization code
  // removed for clarity

  host.AddServiceEndpoint(typeof(IHelloService), new NetTcpBinding(SecurityMode.None), endpointurl, new Uri(listenurl));
  host.Open();

  while (true)
  {
    Thread.Sleep(TimeSpan.FromMinutes(1));
    break;
    //Trace.WriteLine("Working", "Information");
  }
}

The circuit breaker policy has been defined in MVC3 app’s Global.asax.cs as shown below:

1
2
3
4
5
CbPolicyBuilder
  .For("net.tcp://localhost:9001/HelloServiceEndpoint")
    .Timeout(TimeSpan.FromSeconds(30))
    .MaxFailure(1).OpenTripFor(TimeSpan.FromSeconds(30))
  .Do();

As I mentioned, the policy is defined against remote resource URI.  Here, for net.tcp://localhost:9001/HelloServiceEndpoint resource, if the invocation is not successful or no response till 30 seconds (Timeout) attempt only once (MaxFailure) and keep the circuit breaker open for 30 seconds.  After 30 seconds, half-open the circuit breaker, when any connection made.  The policy will be persisted on persistence store and accessed across the application.

The MVC3 app has two controllers named HomeController and AuthorController where this service has been invoked using circuit breaker as shown below

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
//specify the resource access type, here ChannelFactory<T>
CircuitBreaker<ChannelFactory<IHelloService>>
  // the resource access type instance
  .On(new ChannelFactory<IHelloService>(helloServiceBinding, epHelloService))
  // made remote invocation
  .Execute<string>(cf =>
  {
    var helloClient = cf.CreateChannel();
    return helloClient.Greet("Udooz!");
  },
  // if everything goes well
  msg => ViewBag.Message = msg,
  // oops, circuit trip broken
  ex =>
  {
    ViewBag.Message = ex.Message;
  });

The same code has been there in AuthorController. I don’t give any link to access the Index() action of this controller in the page. Test yourself by giving the URL on the browser.

Final Note

You can download the above sample from http://udooz.net/file-drive/doc_details/25-azurecircuitbreaker.html.  It contains the basic CircuitBreaker library also.  This post does not cover those aspects.  The code has basic design aspects to implement CircuitBreaker for Azure, but does not has production ready state persistence repository implementation and other IoC aspects. The sample uses in-memory state persistence (hence per web/worker role state) and supports WCF ChannelFactory type.

I shall announce the production-ready library once it is completed.