Windows Azure-Hosted Services: What to expect of latency and response times from services deployed to the Windows Azure platform

Emil Velinov

I’ve stayed quiet in the blogosphere for a couple of months (well, maybe a little longer)…for a good reason – since my move from Redmond to New Zealand as a regional AppFabric CAT resource for APAC, I’ve been aggressively engaging in Azure-focused opportunities and projects, from conceptual discussions with customers to actual hands-on development and deployment.

By all standards, New Zealand is one of the best tourist destinations in the world; however, the country doesn’t make it to the top of the list when it comes to Internet connectivity and broadband speeds. I won’t go into details and analysis of why that is, but one aspect of it is the geographical location – distance does matter in that equation too. Where am I going with this, you might be wondering? Well, while talking to customers about cloud in general and their Windows Azure platform adoption plans, one of the common concerns I’ve consistently heard from people is the impact that connectivity will have on their application, should they make the jump and move to a cloud deployment. This is understandable, but I always questioned how much of this was justified and backed up by factual data and evidence matched against real solution requirements. The general perception seems to be that if it is anything outside of the country and remote, it will be slow…and yes – New Zealand is a bit of an extreme case in that regard due to the geographical location. Recently I came across an opportunity to validate some of the gossip.

This post presents a real-life solution and test data to give a hands-on perspective on the latency and response times that you may expect from a cloud-based application deployed to the Windows Azure platform. It also offers some recommendations for improving latency in Windows Azure applications.

I want to re-emphasize that the intent here is to provide an indication of what to expect from Azure-hosted services and solutions performance rather than provide hard baseline numbers! Depending on exactly how you are connected to the Internet (think a dial-up modem, for example, and yes – it is only an exaggeration to make the point!), what other Internet traffic is going in and out of your server or network, and a bunch of other variables, you may observe vastly different connectivity and performance characteristics. Best advice – test before you make ANY conclusions – either positive or not so positive J

The Scenario

Firstly, in order to preserve the identity of the customer and their solution, I will generalize some of the details. With that in mind, the goal of the solution was to move an on-premises implementation of an interactive “lookup service” to the Windows Azure platform. When I say interactive, it means performing lookups against the entries in a data repository in order to provide suggestions (with complete details for each suggested item), as the user types into a lookup box of a client UI. The scenario is very similar to the Bing and Google search text boxes – you start typing a word, and immediately the search engine provides a list of suggested multi-word search phrases based on the characters already typed into the search box, almost at every keystroke.

The high level architecture of the lookup service hosted on the Windows Azure platform is depicted in the following diagram:

Key components and highlights of the implementation:

  • A SQL Azure DB hosts the data for the items and provides a user defined function (UDF) that returns items based on some partial input (a prefix)
  • A Windows Azure worker role, with one or more instances hosting a REST WCF service that ultimately calls the “search” UDF from the database to perform the search
  • An Azure Access Control Service (ACS) configuration used to 1) serve as an identity provider (BTW, it was a quick way to get the testing under way; it should be noted that ACS is not designed as a fully-fledged identity provider and for a final production deployment we have ADFS, or the likes of Windows Live and Facebook to play that role!), and 2) issue and validate OAuth SWT-based tokens for authorizing requests to the WCF service
  • The use of the Azure Cache Service was also in the pipeline (greyed out in the diagram above) but at the time of writing this article this extension has not been implemented yet.

To come back to the key question we had to answer – could an Azure-hosted version of the implementation provide response times that support lookups as the user starts typing in a client app using the service, or was the usability going to be impacted to an unacceptable level?

Target Response Times

Based on research the customer had done and from the experience they had with the on-premises deployment of the service, the customer identified the following ranges for the request-response round-trip, measured at the client:

  • Ideal< 250ms
  • May impact usability – 250-500ms
  • Likely to impact usability – 500ms – 1 sec
  • “No-go” – > 1 sec

Also, upon customer’s request, we also had to indicate the “Feel” factor – putting the response time numbers aside, does the lookup happen fast enough for a smooth user experience.

With the above in mind, off to do a short performance lab on Azure!

The Deployment Topology

After a couple of days working our way through the migration of the existing code base to SQL Azure, Worker Role hosting, and ACS security, we had everything ready to run in the cloud. I say “we” because I was lucky to have Thiago Almeida, an MVP from Datacom (the partner for the delivery of this project), working alongside with me.

In order to find the optimal Azure hosting configuration in regards to performance, we decided to test a few different deployment topologies as follows:

  • Baseline: No “Internet latency” – the client as well as all components of the service run in the same Azure data center (DC) – either South Central US or South East Asia. In fact, for test purposes, we loaded the test client into the same role instance VM that ran the WCF service itself:

  • Deployment 1: All solution components hosted in the South Central US

  • Deployment 2: All solution components hosted in the South East Asia data center (closest to New Zealand)

  • Deployment 3: Worker role hosting the WCF service in South East Asia, with the SQL Azure DB hosted in South Central US

  • Deployment 4: Same as Deployment 3 with reversed roles between two DCs

Notes:

1) In the above topologies except the Baseline, the client, a Win32 app that we created for flexible load testing from any computer, was located in New Zealand; tests were performed from a few different ISPs and networks, including the Microsoft New Zealand office in Auckland and my home (equipped with a “blazing fast” ADSL2+ connection – for the Kiwi readers, I’m sure you will appreciate the healthy dose of sarcasm here J)

2) Naturally, Deployment 1 and 2 represent the intended production topology, while all other variations were tested to evaluate the network impact between a client and the Azure DCs, as well as between the DCs themselves.

  • Deployment 5: For this scenario, we moved the client app to an Azure VM in South East Asia, while all of the service components (SQL Azure DB, worker role, ACS configuration) were deployed to South Central US.

The Results

Here we go – this is the most interesting part. Let’s start with the raw numbers, which I want to highlight are averages of tests performed from a few different ISPs and corporate networks, and at different times over a period of about 20 days:

I can’t resist from quoting the customer when we presented these numbers to them: “These [the response times] are all over the place!” Indeed, they are, but let’s see why that is, and what it all really means:

  • The Baseline test with the client (load test app) running within the same Azure DC as all service components, the response times do not incur any latency from the Internet – all traffic is between servers within the same Azure DC/network and hence the ~65ms average time. This is mostly the time it takes for the WCF service to do its job under the best possible network conditions.
  • Response times for Deployment 1 & 2 vary from 170ms up to about 250-260ms with averages for South Central US and South East Asia of 225ms and 184ms respectively. As mentioned earlier, these two deployments are the intended production service topology, with all service components as close to each other as possible, within the same Azure DC. The results also show that from New Zealand, the South East Asia DC will probably be the preferred choice for hosting; however the difference between South East Asia and South Central is not really significant – this will also be country and location specific. What is more important though is that the response times from both data centers are well within the Ideal range for the interactive lookup scenario. Needless to say – the customer was very pleased with this finding! And so were Thiago and I! J
  • Next we have Deployment 3 & 4 (mirrors of each other) where we separate the WCF service implementation from the database onto different Azure DCs. Wow – suddenly the latency goes up to 660-690ms! Initially when I looked at this, and without knowing much about the service implementation (remember we migrated an existing code base from the on-premises version to Azure), I thought it was either a very slow network between the two Azure DCs or something in the ADO.NET/ODBC connection/protocol that was slowing down the queries to the SQL Azure DB hosted in the other DC. I was wrong with both assumptions – it turned out that the lookup service code wasn’t querying the DB just once, meaning that this scenario was incurring cross-DC Internet latency multiple times for a single request to the WCF service – comparing apples to oranges here.

    Lesson learned: The combination between implementation logic and the deployment topology of your solution components may have a dramatic impact. Naturally – try to keep everything in your solution on the same DC. There are of course valid scenarios where you may need to run some components in different data centers (I know of at least one such scenario from a global ISV from the Philippines) – if this is your case too, be extremely careful to design for the minimum number of cross-DC communications possible. A great candidate for such scenarios is the Azure Cache Service which can cache information that has already been transferred from one DC to the other as part of a previous request.

  • Deployment 5 as a scenario was included in the test matrix mostly based on the interest sparked by the significant response times observed while testing Deployment 3 and 4. The idea was to see how fast the network between two of the Azure DCs is. The result – well, at the end of the day it is just Internet traffic; and as such it is subject to exactly the same rules as the traffic from my home to either of the DCs, for example. In fact, the tests show that Internet traffic from New Zealand (ok, from my home connection) to either of the DCs is marginally faster than the traffic between the South East Asia and South Central US DCs. This surprised me a little initially but when you think about it – again, it’s just Internet traffic over long distances.

Traffic Manager

One other topology that came late into the picture but is well worth a paragraph on its own, is a “geo-distributed” deployment to two DCs, fronted by the new Azure Traffic Manager service recently released as a Community Technology Preview (CTP). In this scenario, data is synchronized between the DCs using the Azure Data Sync service (again in CTP at the time of writing). The deployment is depicted in the diagram below:

Requests from the client are routed to one of the DCs based on the most appropriate policy as follows:

(If you’ve read some of my previous blog articles, you’d already know that I keep pride in my artistic skill. The fine, subtle touches in the screenshot above are a clear, undisputable confirmation of that fact!)

In our case we chose to route requests based on best performance, which is automatically monitored by the Azure Traffic Manager service. BTW, if you need a quick introduction to Traffic Manager, this is an excellent document that covers most of the points you’ll need to know. The important piece of information here is that for Deployment 3 and 4, combined with Traffic Manager sitting in front, the impact on the response time was only about 10-15ms. Even with this additional latency our response times were still within the Ideal range (<250ms) for the lookup service, but now enhanced with the “geo-distribution and failover” capability! Great news and a lot of excitement around the board room table during the closing presentation with our customer!

Important Considerations and Conclusions

Well, since this post is getting a little long I’ll keep this short and sweet:

  • If possible, keep all solution components within the same Azure Data center
  • In a case where you absolutely need to have a cross-DC solution – 1) minimize the number of calls between data centers through smart design (don’t get caught!), and 2) cache data locally using the Azure Cache service
  • Azure Traffic Manager is a fantastic new feature that makes cross-DC “load balancing” and “failover” a breeze, at minimal latency cost. As a pleasantly surprising bonus, it’s really simple – so, use it!!!
  • Internet traffic performance will always have variations to some extent. All our tests have shown that with the right deployment topology, an Azure-hosted service will deliver the performance, response times, and usability for as demanding scenarios as interactive “on-the-fly” data lookups…even from an ADSL2+ line down in New Zealand! And to come back to the customer “Feel” requirement – it scored 5 stars. J

Thanks for reading!

Author: Emil Velinov

Contributors and reviewers: Thiago Almeida (Datacom), Christian Martinez, Jaime Alva Bravo, Valery Mizonov, James Podgorski, Curt Peterson

1 Star2 Stars3 Stars4 Stars5 Stars (2 votes, average: 5.00 out of 5)

One Comment

  1. Pingback: Distributed Weekly 102 — Scott Banwart's Blog

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>