Reaching Stable Performance in AppFabric Cache with Non-Idle Cache Channel

While using Azure AppFabric Cache, you will notice that object-retrieval-time averages around 6 milliseconds. However, you may also notice some spikes which can be as high as 400 milliseconds, which by most accounts will be undesirable. The purpose of this article is to explain the reason behind this seemly mysterious latency points. To do this, first we need to look at the DataCacheFactory (DCF), as stated in a previous blog, its instantiation is required to establish the communication with your Azure AppFabric Cache service end-point, which under-the-hood is a WCF channels and as such leverages the CLR ThreadPool. Here is where the problem resides, the CLR ThreadPool has a known issue, which manifests itself by releasing all of the I/O threads in the ThreadPool (except for one) after 15-seconds of inactivity , for further details on the issue, refer to this blog. This, in turn, destroys the WCF channel running on the released thread; the spikes are simply the result of trying to recover from the lost channel.

NOTE: This may also manifest in the Azure AppFabric Server (on-Prem) however it is less noticeable because idle times may not be as high since we have dedicated cluster, unlike the share cluster on the cloud. Either way, the recommendation below should be utilized in either on-Prem or on-cloud.

Not just a 15 second idle problem

Even after the 15 second I/O thread issue is fixed (a fix is being investigated for a future release, more details are not currently available), the fact still remains that Windows Azure load balancers (LB) will close idle connections after 60 seconds. Hence, you need to avoid either of these possible spikes.

Keeping a busy channel

In a system where DCF is always kept busy, NEITHER of these issues will be a concern because neither of those idle times will ever be reached, unless using local cache (see the “other considerations” title below). Similarly, for application that may incur this idle time, keeping them artificially busy, will avoid these performance spikes.

As described in the article above, a workaround will require changes on the service, which in this case; it will mean a QFE in Azure AppFabric Cache. And even then, it is uncertain that it will also fix the 60 second connection timeout from the LB (since the service is behind the LB). Either way, keeping an active channel, would avoid the problems and this can be done by simply doing an API call to Put a small object in intervals below 15 seconds – I will call this preserving an active channel.

Where to best preserve the active channel

My first reaction was to simply add a call to my static encapsulation of the DataCacheFactory class under the RoleEntryPoint.Run() method (in the public class WebRole : RoleEntryPoint) and do a Put() operation from a thread every 15 seconds. Unfortunately, even though I was invoking a static class, it needed to create another class instance (separate from the one used in the application) because the memory points at which the Session and the RoleEntryPoint ran are too far apart at the moment of their respective executions and hence they cannot both run a single static instance. So I ended up with two separate DataCacheFactories, one invoked in RoleEntryPoint.Run() and the other in the webpage, at the httpContext, which defeats the purpose.

This took me into the Global class (public class Global : System.Web.HttpApplication), implementing the thread under the Application_Start() method throw an exception. It turns out that since the DataCacheFactory requires ACS authentication which turns requires an HttpContext To finally make it work, the code had to be added into the session_start() method of the global class.

As you will see below, the interval is set to 11 seconds just to avoid any delay that will make the lapse go over 15 seconds. The Boolean initialized, is used to prevent the creation of several other threads. And lastly note the use of the TimerCallback thread, as it is the most adequate to handle thread that are dedicated to this type of task (waking up doing a task and then reviving after a given period of time). The following is how the code is implemented under the global.asax.cs

    public class Global : System.Web.HttpApplication
    {

        //Make the time lapse just a little above 2/3 of the 15 seconds, to avoid time sync issues
        private const int PingChannelInterval = 11000;
        private const int DueTime = 1000; //Time to wait before thread starts

        public const string WakeNonLocalCacheChannelObjKey = "UTC time of last wake call to non-LocalCache channel";
        private static bool initialized = false;

        //Thread to keep channel from idling.
        private static Timer t = null;

        void Application_End(object sender, EventArgs e)
        {
            //Now it is safe to dispose of the thread that kept the channel from idling
            //Reset initialized flag since thread has been disposed.
            t.Dispose();
            initialized = false;
        }

        void Application_Error(object sender, EventArgs e)
        {
            //Now it is safe to dispose of the thread that kept the channel from idling
            //Reset initialized flag since thread has been disposed.
            t.Dispose();
            initialized = false;
        }

        //Here is where the Thread to keep the channel from idling happens
        void Session_Start(object sender, EventArgs e)
        {
            //If a session has already started then no need to create another
            //Preserve channel thread, one is enough
            if (!initialized)
            {
                t = new Timer(PreserveActiveChannel, null, DueTime, PingChannelInterval);
            }
        }

        //Thread to keep channel from idling
        static void PreserveActiveChannel(object state)
        {
            try
            {
                //Stop any more thread from starting, we only need one running
                initialized = true;

                //Since I have 2 DataCacheFactories, store the flag of the one been used
                //either the one with localcache or the one without local cache
                bool StoreTypeOfFactoryBeenUsed = MyDataCache.CacheFactory.UseLocalCache;

                //Choose to work with the non-local Cache DataCacheFactory
                //since this is the only DataCacheFactory that will be kept warm
                MyDataCache.CacheFactory.UseLocalCache = false;

                //warm up only the non-localCache Factory, via a Put(), note that stored obj is a timestamp
                MyDataCache.CacheFactory.Put(WakeNonLocalCacheChannelObjKey, DateTime.UtcNow.ToString());

                //Now that the warming is done, set the DataCacheFactory back to the one been previously used
                MyDataCache.CacheFactory.UseLocalCache = StoreTypeOfFactoryBeenUsed;
            }
            catch (Microsoft.ApplicationServer.Caching.DataCacheException exception)
            {
                //If things failed, reset initialization
                initialized = false;

                //Logic to gracefully handle this exception goes here
            }
        }
    }//end of class

Other considerations

When running a DCF instance with local cache turn on, the cached objects stay in the local memory of client application and hence the WCF channel is not used and the idle timeout will likely take place. But this may not be apparent until a lapse of over 15 seconds, of only local cache activity, is followed by a trip to the AppFabric Cache service. Similarly, this will also happen after a local cache time out of over 15 seconds.

Another gotcha can be when IIS decides to recycle, the following two blogs go over some methods to avoid this but keep in mind that at the first time the DCF is used, the delay to create the channel has to be incurred, so this has to happen at least once.

· How to get to the hosted web role

· IIS app pool recycle settings

Observing the behavior

The easiest way to observe the behavior is by hitting the service from outside the local Datacenter (DC), so below I am sharing the URL to a running web role with the project. It is deployed on the south/Central US region and consuming a service in the north/central US region. As such you will also see that the normal delay is not around 6 milliseconds but more around 30 milliseconds, which is expected since the request has to travel from one DC to the other.

Instructions

Since I am leveraging a project I used before, the basic instructions to understand the APIs can be found on this blog under the title “Running the Demos”.

URL to sample: http://test1perfofworkerrolecache.cloudapp.net.

Below is the interface you will be shown.

clip_image002

The interface exposes 2 DCFs, they can be picked from the drop down menu:

Choose the “Enabled” local cache DCF

Then press the button label “Get (optimistic)”, will show the lapsed in time in the Status box. The first click will be the longest at likely over 400 milliseconds the subsequent runs will reach around 30 milliseconds and then you will get 0 milliseconds once local cache kicks in. Now, do nothing for say 18 seconds and then do a Get again. The long elapse time will happen again (unless someone else happens to be running the exact same app at the same time, I am assuming this will be unlikely).

Choose the “Disable” local cache DCF

Then do the exact same steps as above and you will first noticed that since local cache is not use the elapse time will never go below 20 or so milliseconds but then if you do nothing for 15 or so seconds, the next hit will not incurred the 400 millisecond spike. To find out when was the last time the PreserveActiveChannel thread reached the service click on the “TimeStamp of last factory ping” button and it will retrieve the object been leverage to keep the channel warm which value contains the timestamp of the latest ping.

For reference

· More information on the RoleEnryPoint method

· This link will take you to the project’s code.

Reviewers: Rama Ramani, Mark Simms, Christian Martinez and James Podgorski

1 Star2 Stars3 Stars4 Stars5 Stars (No Ratings Yet)