Reliability, Scalability, and Maintainability

Ahmet Lekesiz
8 min readDec 27, 2020

My notes for the first chapter of Designing Data-Intensive Applications book.

What is the Data-Intensive Application?

  • Today, bigger problems are the amount of data, speed(change) of data, and complexity of the data. We call those applications that have kind of problems, data-intensive applications.

Building Blocks of a Data-intensive Application

  • Database — Store data. e.g. PostgreSQL, Couchbase, and etc.
  • Caches — Remember expensive operations to speed up reads. e.g. Memcache
  • Search indexes — Search data quickly by keyword or filter. e.g. Apache Lucene, Elasticsearch
  • Stream processing — Send a message to another process. e.g. Apache Kafka
  • Batch processing — Periodically process a large amount of data. e.g. Hadoop

How to get the maximum benefit from our applications?

There is a lot of different way and tools for storing data, caches and etc. We need to know the use and combine appropriate tools for our case in order to get maximum benefits from our application.

Thinking About Data Systems

Why do we need to think about data systems?

  • Nowadays, applications have a variety of requirements for data storage and data processing. A single tool can not fill those requirements by itself. That’s why we a term like Data Systems. We break the task into small components and each tool handles one of them.
  • Application code keeps different tools in sync.
  • You are not only an application developer but also a data system designer. Congrats! 🥳
  • Some of the Questions for this New Title are:
  • How can you be sure that data will stay correct and complete if something goes wrong internally?
  • How can you provide stable performance to clients if some of the parts of the data system are down?
  • How do you deal with a lot of users at the same time on special days?

We will answer that question with the following concepts.

1. Reliability

Reliability for a software

  1. The software performs the function that the user expected
  2. It can tolerate the user mistakes
  3. The system prevents any unauthorized access.
  • Continue to work correctly, even when things go wrong.
  • The things that can go wrong are called faults.
  • We can not make a system fault-tolerant for every possible kind of fault.
  • The fault is not the same as failure. The fault is defined as one component of the systems act unexpectedly. Failure is defined when the whole system stops providing service to the client.

Chaos Monkey

A way to increase our confidence that our system will handle different kinds of faults by killing some of the individual components. More details in the following link: The Netflix Simian Army

Hardware Faults

  • Hard disks crash, RAM becomes faulty, the power grid has a blackout
  • The MTTF is a statistical value that defines after how much time the first failure in a population of devices may occur.
  • Hard disks are reported having a mean time to failure of about 10 to 50 years. Thus, on a storage cluster with 10,000 disks, we should expect on average one disk to die per day.

30(years)*360 = 10.800

Until now, redundancy was the solution. When one of the components is down, we could use the redundant component and it can keep the machine running uninterrupted for years.

As more applications started to use multi machines, there is a movement toward systems that can tolerate the loss of entire machines by using software fault-tolerance techniques in addition to or preference to hardware redundancy. It also gives flexibility over single-machine reliability, for example using cloud platforms such as AWS.

Software Engineering Software Fault Tolerance — javatpoint

Software Faults

  • This kind of fault is harder to anticipate


  1. A software bug that causes every instance of an application server to crash when specific input is given.
  2. A runaway process that uses up some shared resource — memory, disk space, CPU time
  3. A service which the system depends on, slow down, becomes unresponsive, or returning problematic responses.
  4. Cascading failures, when a small fault in a component causes a bigger fault in another component, and so on…
  • “often lie dormant” → it is not easy to trigger


There is no quick solution for that problem but some of the following things may help:

  • carefully thinking about assumptions
  • process isolation
  • allowing processes to crash and restart
  • measuring, monitoring, and analyzing system behavior in production.

Human Error

  • Humans are unreliable
  • One study of large internet services found that configuration errors by operators were the leading cause of outages, whereas hardware faults (servers or network) played a role in only 10–25% of outages. [13]


  • Design systems that are not easy to make errors, try to minimize opportunities of human error.
  • Well defined abstractions, APIs, and admin interface
  • Sandbox environments(isolated environments for untested code not to affect real users.)
  • Test thoroughly at all levels, from unit to integration tests and manual tests.
  • Allow quick and easy recovery. For example, make it fast to roll back configuration changes so unexpected bugs affect only a small subset of users.
  • Detailed and clear monitoring. It is called Telemetry in other disciplines (Once a rocket has left the ground, telemetry is essential for tracking what is happening).

How important is Reliability?

It is not just for nuclear power stations. Bugs in business cause lost productivity and lost revenue in e-commerce sites.

We may need to sacrifice reliability in some use cases, but we need to very careful when curring corners.

2. Scalability

  • Let say a system reliable and has 1 million users. This system might not be reliable in the future when the number of users increases to 10 million users because of scalability.
  • Scalability is the term to describe how a system deals with increased load.

Describing Load

Load parameters → Requests per second, the ratio of reads to writes in a database, the number of active users, the hit rate on a cache, and etc.

Twitter Example

  • Mainly two operations on Twitter. First, Post a Tweet(4.6k requests/sec on average, over 12k requests/sec at peak), and Secondly, View Home Timeline(300k requests/sec)

Two ways of implementation

  1. Post a tweet in a global collection, and when a user sends a get request, get followed users’ tweets.
  2. Maintain cache for each user timeline, and when a user posts a tweet, update the caches
  • If a celebrity posts a tweet, it is a problem to update the cache.
  • If a user with lower followers, post a tweet, it is better to use update cache

Twitter tends to use a mixture of these two implementations to give the best response time.

Describing Performance

  • If the load parameters increases and system resources(CPU, memory, network bandwidth) stays unchaged, the performance of the system will be slow down.
  • How much you need to increase system resources to give the same performance when load is increased?

In batch processing systems like Hadoop, we care about throughput — the number of recrods we can process per second.

In online systems, we care about response time — the time between the request of the client and receiving a response. Response time is not always the same. It changes over time, because of that we need to think as a distribution of values that we can measure.

  • It is better to use percentile rather than average. It gives a more clear idea. We can found outliers easily.
  • High percentiles of response times, also known as tail latencies are important because they directly affect the user’s eperience.
  • For an e-commerce website, it is important to work on tail latencies because the slowest request is caused by big data transfers, and these transfers are made by users who buy many products. So, companies need to be sure that valuable customers are happy.

Amazon has also observed that a 100 ms increase in response time reduces sales by 1%

  • Queueing delays often account for a large part of response time at high percentiles because of multitasking, so if we work on high percentiles then the processes in the low percentiles will wait a lot and average waiting time will increase. Due to this effect, it is important to measure response times on the client-side.
  • While testing, it is important to create artificial users that do not wait to send another request before getting a response for the previous request. Otherwise, the measurement will be skewed and will not fit with reality.

Approaches for Coping with Load

  • scaling up (vertical scaling, moving to a more powerful machine) and scaling out (horizontal scaling, distributing the load across multiple smaller machines)
  • It is expensive to build all system by using powerful machines. Mostly a mixture of two approaches is preferred.
  • Some systems call elastic, meaning that they can automatically add new computing resources when they detect a load increase, whereas other systems are scaled manually by a human. Elastic systems might be useful when the load increase is highly unpredictable, but manual systems are simpler and have fewer operational surprises.
  • It is more preferred to use a single node (scale-up) until scaling cost force you to make the system distributed. Because it is not easy to control a distributed system. However, with the new tools and better abstractions, it might be the default option to use distributed systems in the future.

3. Maintainability

  • More cost on maintainability(fixing bugs, adding new features and etc.) than development.
  • legacy systems — outdated platforms and technologies
  • Three Design Principles for not to create new zombie system


Responsibilities of Operations Teams

  • Monitoring health of the system
  • Tracking the cause of problems, such as system failures and performance issues
  • Keeping software and platforms up to date, includes security updates
  • Anticipating future problems and solving them before they occur e.g. capacity planning
  • Keeping the production environment stable

Good operability means making the routine tasks easy. Data systems can do various things to make routine tasks easy:

  • Avoiding dependecy on individual machines — a machine can be down, and system countuine to work
  • Good documentation — If I do X, Y will happen
  • Providing good default behavior and self-healing when it is appropriate
  • Predictable behavior, minimize surprises in the system.

Simplicity: Managing Complexity

  • When the software becomes more complex → difficult to understand the code for new engineers. It increases the cost of maintenance. It increases the risk of bugs in a change of code.
  • One of the best tools for making software simple is abstraction. — Reusable components!

Evolvability: Making Change Easy

  • After implementing a system, requirements will change a lot. New features, business priorities, legal changes, and etc.
  • Agile for adapting the changes faster.
  • TDD, Refactoring → We will think on the level of larger data systems. For example, refactoring Twitter architecture for home timelines from approach 1 to approach 2. That’s why we use Evolvability over Agility for data level.