Iranna Patil
7 min readAug 27, 2021

--

From Monolithic to a Highly responsive distributed system

The stand-off between the monolithic and the distributed architecture has started years ago, almost right after microservices premiered as a new approach to architecture in 2011. By now, it is generally accepted that both have benefits and drawbacks:

A monolith architecture usually has one code-base and is deployed as one single application. In contrast, a microservices architecture allows us to build up the application with small services, each is developed and deployed independently.

Under the right circumstances, both monolith and microservices can be the ideal option. For a real-world system, however, you do not need to choose between two extremes, but rather can adopt an approach at whatever point in the range you find the best suited for your product.

In this blog post, we will show how we adapted our architecture from a monolith to a distributed system in several steps, each time adding services to find the balance between the system complexity and responsiveness.

The system at a glance

The system we are working on is a Digital Rights and Rules Manager.

Generally speaking, Digital Rights Management (DRM) is an approach to copyright protection for digital media. A DRM system prevents unauthorized redistribution of digital media and restricts the ways consumers may copy content they’ve purchased.

In its essence, our system generates a key for encryption and stores the encrypted content in the operator content delivery network. When users want to play this content, they need a license that contains a key for decryption of the content and policies. These policies define the quality of the content (HD, UHD), set geo-restrictions and time limits, etc. To obtain the license, users go back to our service.

Stage 1: Monolithic Architecture

Initially, the system had a purely monolithic architecture and performed only 2 tasks;

  1. Register user subscription details, keys, and business policies and store them in a database.
    End-Users got the subscription for a content package via management UI.
  2. Process license requests from client devices when a user tries to play out a content item; and issue licenses with content keys and policies.
Monolithic existing system
Back end API layer

This design was good enough to handle multiple tenants with initial features:

  1. The application was horizontally scalable, as deploying n-number of copies of the whole application behind the load balancer is rather easy;
  2. It was easy to develop and test: we could easily test the whole application end-to-end in one place and deploy a single package.
  3. The latency was good, as method calls and shared memory access is much faster than inter-process communication using messaging, RPC, etc.

Yet, when adding new functionalities, we started seeing bottlenecks in this design:

  1. Coarse-grained deployment and scaling — meaning that it was impossible to scale read and write tasks separately;
  2. Data model compromise — having a single data model, led to situations when adding a new index to improve the Read task performance adversely affected the Write task.
  3. The shared database — any increase in reads or writes affected the other.
  4. Database scalability — The RDS database was only vertically scalable up to the hardware limits.
  5. Thread per request — Due to the thread-per-request model in the applications, the threads were waiting for IO operations, in particular, for database calls, and were idle during this time. That made scaling policies based on the CPU usage quite useless.

Stage 2. First Step towards Command Query Responsibility Segregation

The logical step was to group reads and writes into separate services, so they could be scaled independently in the so-called Command Query Responsibility Segregation (CQRS) pattern.

As we moved our infrastructure to AWS, we started using the replication feature of PostgreSQL. The database could scale up to 5 read replicas, allowing us to serve more license calls. But this feature didn’t support multi-master. So, there is only one master instance that accepted writes with 5 read-replicas.

CQRS with read replica database

This design was working quite well for small-to-medium loads and had the following advantages;

  1. Thanks to read-replicas, the system could serve more licenses.
  2. Having read-replicas in different regions, we were able to serve licenses from the nearest data center to the customers.

However, this felt only as a half measure, because the other bottlenecks remained unchanged:

  1. Reads and writes were still using the same data model, and normalized database tables were not good for reads since they required multiple joins consuming CPU resources.
  2. The database remained the main bottleneck with no scaling for writes because of a single master and the limit of 5 replicas for reads.
  3. The system was not responsive enough, as the response time to API calls varied as the load increases due to the limited scalability of the database.

Important Note:
By creating read replicas, we already gave up the important relational database property “consistency”. All read replicas became eventually consistent. We can’t guarantee that read replicas will always have the latest state as the data synchronization from the master to read replica nodes takes time. Starting from this point, the system is eventually consistent!

This design was working quite well for small-to-medium loads and had the following advantages;

  1. Thanks to read-replicas, the system could serve more licenses.
  2. Having read-replicas in different regions, we were able to serve licenses from the nearest data center to the customers.

However, this felt only as a half measure, because the other bottlenecks remained unchanged:

  1. Reads and writes were still using the same data model, and normalized database tables were not good for reads since they required multiple joins consuming CPU resources.
  2. The database remained the main bottleneck with no scaling for writes because of a single master and the limit of 5 replicas for reads.
  3. The system was not responsive enough, as the response time to API calls varied as the load increases due to the limited scalability of the database.

Important Note:
By creating read replicas, we already gave up the important relational database property “consistency”. All read replicas became eventually consistent. We can’t guarantee that read replicas will always have the latest state as the data synchronization from the master to read replica nodes takes time. Starting from this point, the system is eventually consistent!

CQRS using event driven data population

As a result, we obtained a truly distributed system that had separate data models for Read and Write operations and a separate database for each service. This design has brought the following advantages;

  1. Highly Scalable: As Redis is horizontally scalable and it is in the critical path for providing data for query services, the system is now highly scalable than before.
  2. Flexibility: As a microservices advantage, each service can be built in a different technology stack.
  3. Responsiveness: Each service by having its own optimized data model made the system more responsive and performant.
  4. Lower Latency: As Redis is used as a persistent data store for query services, the system has out-of-the-box distributed cache.
  5. Maintainability: Having separate microservices helped with the separation of concerns that made it easier to maintain the code.
  6. Debugging: By having all the events in Kafka, now we can replay the events in the development environment and easily debug issues.
  7. Pluggable: At any time, it is possible to add a new service to the system that they can build up their state. Eg: New reporting service in Datawarehouse, etc…
  8. Replayable: As each event is stored in Kafka in the exact order they happened, it allows us to replay the events and build up the state.

Things to be taken into consideration

  1. Eventually consistent: It would take some time for consumers to read the messages from the event bus and populate it in their data store. So, as the service responds, the data may not be populated yet.
  2. Constraints & Validation: As the data is stored in a non-relational database, all of the integrity validations provided by the relational databases, need to be handled in the application.
  3. Temporal: Parallel processing and critical path handling should be done carefully. Related events should fall in the same partition in Kafka in order to maintain the order.
  4. Scaling metrics: The applications can’t just rely on traditional metrics for scaling, consumer services should be scaled based on the number of messages in the topics they consume.
  5. Event versioning: Along the way, the events schema will change, therefore it won’t be possible to consume old events with the latest consumer, as they won’t be compatible. Whenever Redis goes down since it is not possible to replay the old events, it won’t be possible to recreate the application state. A possible solution could be to take regular snapshots of Redis and replay only the latest events.

As you can see, each solution worked well for a specific task, and it was a matter of fine-tuning to find the optimal approach for a particular challenge. When adding a new service to your system or splitting an existing one, you should always bear in mind what you are going to achieve and at what cost.

For now, our distributed system works well, while we are always ready to bring it to the next level — to deal with increasing loads and performance requirements — by heeding the emergence of more advanced technologies.

--

--