Scaling Web Servers for Production

In high volume sites, server scaling is a tradeoff between providing acceptable performance level to users, and minimising excess capacity. 19 July 2019

# Background

You want to build a high performance web application. You've set proper HTTP cache headers, resulting in the smallest number of client requests. A CDN or other cache is present to shield the origin from duplicate static content requests. Further, micro-caching  is applied to cache dynamic content. Your application server is cache-header aware, and avoids repeated work on conditional GET requests. Additionally, Russian Doll caching , or some other form of application caching is present to improve performance.

Now, you want to tune your web server  for optimal performance, taking into consideration real world hardware, and real world traffic.

# Processes & threads

There are two main tunables in web servers: the number of processes and the number of threads.

Workers processes are used to handle concurrent requests, but they can be very expensive in RAM. A Puma worker  or Apache worker  can take hundreds of MB of RAM, and increasing its count doesn't increase concurrency beyond the number of CPUs. Thus, it is capped to the number of CPUs.

Threads consume only tens of MB of RAM. But, it can only increase concurrency if the CPU is waiting on IO. The optimal number of threads is therefore the inverse of the fraction of time not waiting for IO. For most webservers, this is about 5, because the CPU is used ~20% of the time, with the remainder on iowait — either disk or network. This allows for 5 concurrent in-flight requests per CPU.

# Averages & maxima

Little's Law  states that the throughput of a system is the product of arrival rate and response time. The maximum throughput occurs at the maximum arrival rate, which is the inverse of the response time. For example, where the average response time is 250ms, the maximum average throughput is 4 request per second per CPU.

But internet traffic does not have a constant arrival rate. The time between requests is often modelled with a Poisson distribution . For this distribution, a rule of thumb is to over-provision by about 3× the average. While this might seem like wasted capacity, remember that this is to guarantee a good service level for the times when the request frequency is higher than average; which happens 50% of the time!

# Burstable instances

AWS offers burstable T3 instances  which are cost efficient where the load is variable. With T3 unlimited, there is no performance penalty if the average CPU load is less than some baseline. Considering the 3× over-provisioning target above, they are perfect candidates for hosting webservers; they won't be used to capacity anyway.

  • t3.nano: 5% baseline load
  • t3.micro: 10% baseline load
  • t3.small, t3.medium: 20% baseline load
  • t3.large: 30% baseline load
  • t3.xlarge, t3.2xlarge: 40% baseline load

The number of instances should be set such that the average load is below the baseline load for the particular class of instances in use. Where there is a choice for fewer-but-larger instances, or more-but-smaller instances, the latter is preferred to provide redundancy against failure, aka. high availability, as well as distributing memory and network throughput.

# Auto scaling

On the scale of minutes, over-provisioning by 3× provides a buffer against short term traffic fluctuations due to the Poisson arrival of request. On the scale of hours or more, auto-scaling provides protection against longer term trends.

When the average load over a sufficiently long interval increases, AWS auto-scaling  can increase capacity by spawning additional instances. Conversely, when average load decreases, it can reduce capacity to save costs. This elasticity is a major selling point for cloud services.

# Granular scaling

Different services within an application server can have different scaling requirements. For example, an e-commerce application's products service would likely have greater traffic than its purchasing service, which in turn has greater traffic than its login service. They could be scaled independently.

Carving out the application into multiple micro-services, provides for such an ability. In some scenarios, this might be the reason for splitting a monolith into micro-services.

# Summary

Tuning a web server begins with setting the correct number of processes and threads, based on the host instance size and the application server's average IO characteristics. Recognising that the time between requests follow a Poisson distribution, instances should be over-provisioned by at least 3× the average to cater for the 50% of time when the request frequency is above average. This makes T3 instances an ideal candidate for the host instance. Further long-term scaling is achieved through auto-scaling. Finally, to achieve granular scaling, a monolith can be split into micro-services.