High Availability Principle : Request Queueing

In my earlier post about concurrency control I mentioned that due to the exponential characteristic of response time with increasing concurrency it is better to potentially reject the extra requests than letting them negatively affect your system.  While rejecting these requests is an option it is not the most desirable option. Request Queuing offers a much more powerful solution as explained below.

Request Queueing Example

For simplicities sake lets assume the following base response time characteristics for a system with no concurrency controls or request queueing:

  • at a concurrency of 500 or below the average response times are 1 second.
  • at a concurrency of 600 the average  response times are 2 seconds for all requests.
  • at a concurrency of 700 the response times are 4 seconds for all requests.

The exponential degradation starts after a concurrency of 500.  With the above numbers in mind lets assume that you have set up your concurrency controls optimally such that 500 concurrent requests can get through and the next 1000 requests (requests 501-1500) are set up to wait.  That is : The Max Queue Size is set to 1000.

Now lets say you get a sudden traffic spike of 1501 requests. Following is what will happen

  1. The first 500 requests will get through and be served on average in 1 second.  The next 1000 requests(requests 501-1500) are queued .  The 1501st request is rejected.
  2. As requests 1-500 finish the requests 501-1000 are sent forward.  So requests 501-1000 will be served in 2 seconds total.  (1 second execution time and 1 second queue time)
  3. As requests 501-1000 finish, requests 1001-1500 are forwarded on.  So requests 1001-1500 will be served in 3 seconds total.  (1 second execution time and 2 second queue time)

Characteristics of Systems With Request Queueing

  1. Request Queuing allows your system to operate at optimal throughput.  In the above example : the optimal throughput was at 500 concurrency.  The concurrency at which optimal throughput is achieved is usually right below where exponential degradation starts to take place.  At all times the system was operating at this optimal throughput
  2. Your users only experience linear degradation versus exponential degradation. As shown in the diagram, with no request queueing your users and your system would have experienced exponential degradation after 500 requests.  Requests 0-500 take 1 second, 501-1000 takes 2 seconds, 1001-1500 take 3 seconds and so on – With Request queueing – the response times become linear
  3. Your system experiences NO degradation – This is worth repeating.  The system is always operating at an optimal throughput.  The only attribute that is dynamic is the queue size.  The system remains in the green zone as highlighted in the diagram.

If even during times of unusually high load your system can show the above 3 characteristics you are in good shape.

Haproxy is one solution that can do both concurrency control and request queueing and is one of the solutions that should be considered for high availability.

7 Responses

  1. This is a great article. Thanks for posting it.

    I agree with the points you make but I want to add one more. In some cases it is preferable to simply reject the request rather than respond slowly, even with linear degradation from the end user’s perspective. Suppose for example that you have a portal home page that makes server-side web service calls to grab the various pieces of the page. If some of those pieces are less important to the user, it’s probably better for the web service to simply reject the request than to have the web service latency cascade into whole-page latency.

    Anyway great article.

    • Thanks for the comment Willie.
      I too am a believer in the “Fail Fast” paradigm. Request Queueing along with some reasonably set queue timeouts can achieve linear degradation along with the ability to Fail Fast.

  2. Yup.

    Also, I was a bit unclear in one part of my explanation, so I want to clarify. We would fail fast on just that one panel of the page (leave the panel blank or else have a error message for that panel), but deliver the overall page itself within SLA.

  3. Your example neglects to mention that new requests are coming even while entries are being removed from the queue. If new entries come in at 500 requests per second, from the point of the spike onwards, your response time will always be 3 seconds. Another 1500 spike next week will push the response time to 6 seconds, and so forth.

    You need headroom between your usual traffic and the degeneration point in order to recover from spikes.

    • Thanks Martin,
      You make a great point that is worth highlighting : “You need headroom between your usual traffic and the degeneration point in order to recover from spikes.” So in the example above with the degeneration point being at 500 requests/second – one’s steady system state should be below that.
      As an example : With the usual traffic being 50% below the degeneration point(in the example at 500) following would be the response time profile for a system handling a spike (while getting the usual traffic):
      Spike equal to 1x usual traffic : No response time degradation
      Spike equal to 2x usual traffic : Response times doubled.
      Spike equal to 3x usual traffic : Response times tripled.
      and so on.

  4. The problem with today’s web servers is that requested are served synchronously. Typically, the request processing involves talking to a database, file system, or similar long lasting operation, which holds up the connection. A new slew of web servers explore handling requests asynchronously. Node.js (http://nodejs.org/), Tornado (http://www.tornadoweb.org/), etc. are some of the non-blocking servers that might help with concurrency issues. Queuing is required eventually, but async handling at least delays queuing requirement and helps supporting a larger concurrency.

  5. Guys

    I’ve just come across this thread googling around on ideas to implement a ‘queing concept’ for web sites. What I want to understand is this:

    Say that a web site copes with 1000 concurrent requests at any time. The web site is database driven.

    And for a specific hour, you expect the concurrent requests to possibly hit 5 fold (ie 5000 concurrent requests) due to a timed activity say advertised on MSN.

    Is there a way to allow the first 1000 users into the web site, and let them visit the 4-5 pages required to complete a shopping basket transaction, and hold the remaining 4000 queing them in.

    I’m not an expert at all in this subject, but curious as to how to implement a solution to queue up requests without having to spend loads on additional web servers to handle more traffic.

    From what I understand on web requests, each user request is separate to their previous request. So if a user is one of the first 1000 to get in, the next time they traverse to another page on the site, their request will be new and therefore queued with the remaining 4000, which is not what I want. I want this user to complete their journey on the web site pages.



Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Get every new post delivered to your Inbox.

%d bloggers like this: