Improving the resilience of API servers
The system at a glance
Before talking about resilience, let’s first see what kind of issue we are trying to solve.
The system we are working on is a Digital Rights and Rules Manager. Digital Rights Management (DRM) is an approach to copyright protection for digital media. A DRM system prevents unauthorized redistribution of digital media and restricts how consumers copy content they’ve purchased.
Our system generates a key for encryption and stores the encrypted content in the operator’s content delivery network (CDN). When a user wants to play this content, they need a license that contains a key for the decryption of the content and policies. These policies define the quality of the content (HD, UHD), set geo-restrictions and time limits, etc. To obtain the license, the user goes back to our service.
Resilience
The resilience of an application refers to its ability to continue functioning even when faced with unexpected situations such as:
- Hardware failures
- Network failures
- Partial system failures
- Spike load
We will discuss briefly discuss each of the situation but we will concentrate mainly on Spike load
Hardware failures
Hardware failures can occur anytime, even the hard disk has a Mean Time To Failure (MTTF) of 10 to 50 years. A system can crash for many reasons such as human error, power outages, and so on. When such issues happen, the machine/server can be shut down, and the issues are fixed either using master-slave or horizontal scaling approaches.
Network issues
Anything sent over the open network has a greater chance to fail than internal communication. Some common types of network issues include:
- Connectivity problems: problems with the physical connection between devices, such as broken cables or faulty network adapters, can cause connectivity issues.
- Configuration errors: incorrect configuration of network devices, such as routers or switches, can cause network issues.
- Latency and bandwidth issues: slow response times, high latency, and insufficient bandwidth can all cause network issues.
- Router and switch issues: problems with network devices, such as routers and switches, can cause network issues.
- Security issues: attacks from malicious actors, such as malware or hacking, can cause network issues.
Most of these issues can be fixed by the network administrator, but also by retrying the software for any network hiccups and by having a fallback to serving details from local memory instead of going to the internet.
Partial failures
A partial failure of software describes a situation where a software application or a system performs its intended functions only partially. This type of failure can be caused by bugs, configuration errors, hardware issues, or other factors.
To handle partial failures:
- always provide a fallback to each external system that may fail and
- have local highly available caches with the most recent state of the external service to serve, in case of the external service failure.
Spike load
Here, we will write a bit more to describe what kind of spike-related issues we had and how we tackled them.
Things to consider in our application because of a sudden increase in load or a spike.
- Flow control.
- Prioritizing the requests.
Flow control
Flow control is a way to make sure the API server is not overwhelmed by concurrent client requests.
As shown above, the N-number of clients will send requests to the server, and the server process the request, and sends back the response to the client.
Every server has limited resources like:
- Limited storage resources — Memory
- Limited computational resources — CPU
Flow control and memory
Every client who wants to send a request to our API server needs to have an HTTP connection, and a connection means allocating resources like buffers to hold request data on the server, so as the number of connections/requests increases the memory utilization of the server keeps on increasing.
All our servers run with limited resources (Memory), so over time, if the incoming requests continue to increase then our server will run out of memory and shut itself down with the “OutOfMemory” error.
This in turn means all client requests will start failing with a “Server not available” error and the whole client base will be impacted.
The diagram below shows a server that has a memory of 500 concurrent requests, and when it reaches 501 requests the server starts erroring with an “OutOfMemory” error and eventually terminates itself.
If there are already N concurrent requests at the server then any requests after N should fail with the error message “Server Busy”.
Flow control and CPU
To be able to handle more requests concurrently, we need to increase the number of threads (workers), that execute the application code for the request and return the response.
Each thread is also a stack of memory, which means it increases the memory.
As the number of threads increases, there are more executors than the number of physical CPUs available, this means that the operating system schedule needs to allocate more CPU cycles to each of these threads, and as a result, more context switches occur in the operating system.
An increase in the number of context switches can lead to lower performance because it introduces overhead and consumes CPU time. The more frequent the context switches, the more time the operating system must spend on saving and restoring process states, and the less time available for the actual processing. This results in reduced CPU utilization, lower throughput, and longer response times for user requests.
Moreover, context switches also cause cache misses and memory page faults, which can further increase the overhead associated with context switches. These events require additional CPU time to retrieve data from slow memory, which further reduces performance.
In general, it is important to minimize the number of context switches to ensure the efficient and optimal performance of the system. This can be achieved by optimizing the scheduling algorithms, reducing the number of processes and threads, and optimizing the memory management system.
So having a huge number of threads to handle more concurrent requests will create more problems.
Request prioritizing
HTTP servers normally open a TCP port where users can send requests. These requests are then put into a queue and a pool of workers will pick each request from the queue to process it and return the response.
Each application will have multiple endpoints to serve the requests, for example in our case we have one endpoint to serve the License (the player needs this license to play the video), and along with that, we have another endpoint “/health/ready” which is to inform the load-balancer (client) whether the application is healthy to serve the request or not.
This health endpoint plays a crucial role in horizontally scaled applications where you have multiple instances of your service and a load-balancer in front of it, that distributes the incoming load to each of these instances (round-robin fashion).
For the load-balancer to send a request to an instance it needs to know whether it is ready to serve the request, and it uses this “/health/ready” endpoint to see whether the instance is ready or not. The load-balancer uses this endpoint periodically to check whether the instance is still ready to serve the request or not, let’s say every 5 seconds. In our application at this “/health/ready” we check all the things required for the application are present, like configuration, connection to Database, and connections to Kafka, and then we return READY(200), if any of these are not available, we return UNAVAILABLE(non-200).
So, if there are a huge number of requests from the user for a license and at the same time load-balancer also sends a health check request, this health request will either be a queue waiting to be processed or it might return 503 if the queue is full.
If it is in the queue and takes more time to get a READY response or the queue is full and a 503 is thrown, the load-balancer will consider the application is not ready to process the requests. This instance is considered dead, and traffic will not be forwarded to it. This can happen in all the instances of this application and eventually, the service will be unavailable.
We might say we can horizontally scale the application when we have this huge spike load, but this load varies from some hundreds to tens of thousands or even millions in a matter of seconds, for example, if there is a football match all the users will tune in just some minutes before the game resulting in the huge number of license requests at the same time to our service, resulting in substantial spike loads like DDOS and none of the horizontal scaling will be able to scale this fast.
How current HTTP servers handle spike loads
Let’s see how HTTP servers handle these spike loads and what solutions they provide based on two frameworks that we use to build our HTTP applications,
- Spring-boot
- Vertx
With spring-boot or Vertx, both of them accept HTTP requests and process them using the application logic, returning the response.
Each HTTP server queues the incoming requests(connections) before giving them to the worker threads to process them, this approach provides flow control, as the excess of requests will be waiting in the queue to be processed and the application server will always process n-requests(number threads) at any given movement of time.
But the request queue can grow quickly when the number of concurrent users accessing the system increases and as a result application can run out of memory and crash. To avoid this situation, each of the HTTP servers provides a configuration for the allowed max queue size. This is so that any new request after the queue is full will not be accepted by the server and the user will get a server busy response.
This configuration makes sure the application server can work without crashing in spike load scenarios and provides flow control for the application.
Issues with this approach
As shown above, current HTTP servers provide flow control at the container level which treats all the resources of our application at the same level of priority. If users send huge requests for one of the resources, these requests will pile up in the queue and any other resource requests will either fail because of the queue being full, or they might end up in the queue and timeout or get processed later.
So, in our application because of the huge number of license requests from users during high-profile football matches, all the other priority requests like health which is used by load-balancer to check whether the instances of the application are healthy to serve the request or not, start to fail.
As the number of license requests to our service spikes due to the huge number of users playing the video simultaneously, all health check requests start failing either because the request queue in the application server is full or requests timeout.
As consequence of this, all our instances of an application were going down, resulting in “service was unavailable” during the most important times of the streaming.
How we fixed the issue
To fix the above issues, we need to have logic that handles flow control and provides prioritization of requests so that a huge request for one resource will not starve the other priority resource requests.
We choose the approach of having flow control per resource, so that when requests for license spike, we only discard any new requests for a license and not for other resources like health.
This approach will also provide us with a way to prioritize requests of different resources as they will be flow controlled separately.
In the case of the Vertx framework, we added an HTTP verticle which is an always-running event loop. It accepts requests and checks what kind of request it is, if it is a priority request it handles this request to priority handler threads, and if it is a non-priority request it handles the request to the non-priority processing worker pool. In both cases it checks whether individual requests are not beyond allowed maximum number of concurrent requests so that application always has N number of active concurrent requests. This is so that the memory of the application is always X amount as the number of concurrent requests is fixed.
For spring-boot, we cannot use this approach as spring-boot follows thread per request model. This means that the spring-boot container(jetty) accepts a request, it will hand it over to the container thread, and this thread is responsible for processing the request and generating a response.
And it is not possible to switch the processing from this container thread to another thread.
But spring-boot also provides a way to process the requests asynchronously using DeferredResult
. Using this DeferredResult
approach we can accept the request at one thread similar to Vertx and hand over the non-priority requests to another thread(worker) pool for processing and the priority requests to another pool.
Also, we can apply flow control per resource as we are dispatching the request to a separate worker pool.
Right now we are handling the priority requests like “health” on the event-loop thread itself, but it has its advantages and disadvantages as mentioned below.
- Handling priority requests at the event loop: No other requests will be accepted during the processing of priority requests as the event loop is busy. So if the priority requests are very often then the event loop will always be busy handling these priority requests and other requests will never be handled as event loop does not pass those requests to worker’s pool.
- Handling priority requests using a separate worker’s pool: If these priority requests do not arrive frequently then these extra worker pools allocated for priority requests will be underutilized.
So now whenever there is a huge load on our services for license requests, other priority requests like health will be handled without failing them, this way the application will continue processing all requests at its own pace without crashing.
Example code for an application that demonstrates how to use Vertx for flow control and request prioritization.
After fixing the resilience of the application, we noticed when the number of requests spikes, our application handles them gracefully and adds new instances of the service as part of auto-scaling, but these newly added instances serve requests slowly because of JAVA slow start issue(HotSpot VM).
Therefore, to address this kind of issue we moved to Ahead Of Time Compilation AOT approach with GraalVM, which will be explained in the next article.