Reliability, Scalability, and Maintainability

  • Today, bigger problems are the amount of data, speed(change) of data, and complexity of the data. We call those applications that have kind of problems, data-intensive applications.

Building Blocks of a Data-intensive Application

  • Database — Store data. e.g. PostgreSQL, Couchbase, and etc.
  • Caches — Remember expensive operations to speed up reads. e.g. Memcache
  • Search indexes — Search data quickly by keyword or filter. e.g. Apache Lucene, Elasticsearch
  • Stream processing — Send a message to another process. e.g. Apache Kafka
  • Batch processing — Periodically process a large amount of data. e.g. Hadoop

Thinking About Data Systems

  • Nowadays, applications have a variety of requirements for data storage and data processing. A single tool can not fill those requirements by itself. That’s why we a term like Data Systems. We break the task into small components and each tool handles one of them.
  • Application code keeps different tools in sync.
  • You are not only an application developer but also a data system designer. Congrats! 🥳
  • Some of the Questions for this New Title are:
  • How can you be sure that data will stay correct and complete if something goes wrong internally?
  • How can you provide stable performance to clients if some of the parts of the data system are down?
  • How do you deal with a lot of users at the same time on special days?

1. Reliability

Reliability for a software

  1. The software performs the function that the user expected
  2. It can tolerate the user mistakes
  3. The system prevents any unauthorized access.
  • Continue to work correctly, even when things go wrong.
  • The things that can go wrong are called faults.
  • We can not make a system fault-tolerant for every possible kind of fault.
  • The fault is not the same as failure. The fault is defined as one component of the systems act unexpectedly. Failure is defined when the whole system stops providing service to the client.

Chaos Monkey

Hardware Faults

  • Hard disks crash, RAM becomes faulty, the power grid has a blackout
  • The MTTF is a statistical value that defines after how much time the first failure in a population of devices may occur.
  • Hard disks are reported having a mean time to failure of about 10 to 50 years. Thus, on a storage cluster with 10,000 disks, we should expect on average one disk to die per day.

Software Faults

  • This kind of fault is harder to anticipate

Examples

  1. A software bug that causes every instance of an application server to crash when specific input is given.
  2. A runaway process that uses up some shared resource — memory, disk space, CPU time
  3. A service which the system depends on, slow down, becomes unresponsive, or returning problematic responses.
  4. Cascading failures, when a small fault in a component causes a bigger fault in another component, and so on…
  • “often lie dormant” → it is not easy to trigger

Solutions

  • carefully thinking about assumptions
  • process isolation
  • allowing processes to crash and restart
  • measuring, monitoring, and analyzing system behavior in production.

Human Error

  • Humans are unreliable
  • One study of large internet services found that configuration errors by operators were the leading cause of outages, whereas hardware faults (servers or network) played a role in only 10–25% of outages. [13]

Solutions

  • Design systems that are not easy to make errors, try to minimize opportunities of human error.
  • Well defined abstractions, APIs, and admin interface
  • Sandbox environments(isolated environments for untested code not to affect real users.)
  • Test thoroughly at all levels, from unit to integration tests and manual tests.
  • Allow quick and easy recovery. For example, make it fast to roll back configuration changes so unexpected bugs affect only a small subset of users.
  • Detailed and clear monitoring. It is called Telemetry in other disciplines (Once a rocket has left the ground, telemetry is essential for tracking what is happening).

2. Scalability

  • Let say a system reliable and has 1 million users. This system might not be reliable in the future when the number of users increases to 10 million users because of scalability.
  • Scalability is the term to describe how a system deals with increased load.

Describing Load

Twitter Example

  • Mainly two operations on Twitter. First, Post a Tweet(4.6k requests/sec on average, over 12k requests/sec at peak), and Secondly, View Home Timeline(300k requests/sec)

Two ways of implementation

  1. Post a tweet in a global collection, and when a user sends a get request, get followed users’ tweets.
  2. Maintain cache for each user timeline, and when a user posts a tweet, update the caches
  • If a celebrity posts a tweet, it is a problem to update the cache.
  • If a user with lower followers, post a tweet, it is better to use update cache

Describing Performance

  • If the load parameters increases and system resources(CPU, memory, network bandwidth) stays unchaged, the performance of the system will be slow down.
  • How much you need to increase system resources to give the same performance when load is increased?
  • It is better to use percentile rather than average. It gives a more clear idea. We can found outliers easily.
  • High percentiles of response times, also known as tail latencies are important because they directly affect the user’s eperience.
  • For an e-commerce website, it is important to work on tail latencies because the slowest request is caused by big data transfers, and these transfers are made by users who buy many products. So, companies need to be sure that valuable customers are happy.
  • Queueing delays often account for a large part of response time at high percentiles because of multitasking, so if we work on high percentiles then the processes in the low percentiles will wait a lot and average waiting time will increase. Due to this effect, it is important to measure response times on the client-side.
  • While testing, it is important to create artificial users that do not wait to send another request before getting a response for the previous request. Otherwise, the measurement will be skewed and will not fit with reality.

Approaches for Coping with Load

  • scaling up (vertical scaling, moving to a more powerful machine) and scaling out (horizontal scaling, distributing the load across multiple smaller machines)
  • It is expensive to build all system by using powerful machines. Mostly a mixture of two approaches is preferred.
  • Some systems call elastic, meaning that they can automatically add new computing resources when they detect a load increase, whereas other systems are scaled manually by a human. Elastic systems might be useful when the load increase is highly unpredictable, but manual systems are simpler and have fewer operational surprises.
  • It is more preferred to use a single node (scale-up) until scaling cost force you to make the system distributed. Because it is not easy to control a distributed system. However, with the new tools and better abstractions, it might be the default option to use distributed systems in the future.

3. Maintainability

  • More cost on maintainability(fixing bugs, adding new features and etc.) than development.
  • legacy systems — outdated platforms and technologies
  • Three Design Principles for not to create new zombie system

Operability

  • Monitoring health of the system
  • Tracking the cause of problems, such as system failures and performance issues
  • Keeping software and platforms up to date, includes security updates
  • Anticipating future problems and solving them before they occur e.g. capacity planning
  • Keeping the production environment stable
  • Avoiding dependecy on individual machines — a machine can be down, and system countuine to work
  • Good documentation — If I do X, Y will happen
  • Providing good default behavior and self-healing when it is appropriate
  • Predictable behavior, minimize surprises in the system.

Simplicity: Managing Complexity

  • When the software becomes more complex → difficult to understand the code for new engineers. It increases the cost of maintenance. It increases the risk of bugs in a change of code.
  • One of the best tools for making software simple is abstraction. — Reusable components!

Evolvability: Making Change Easy

  • After implementing a system, requirements will change a lot. New features, business priorities, legal changes, and etc.
  • Agile for adapting the changes faster.
  • TDD, Refactoring → We will think on the level of larger data systems. For example, refactoring Twitter architecture for home timelines from approach 1 to approach 2. That’s why we use Evolvability over Agility for data level.

--

--

--

https://www.linkedin.com/in/ahmetlekesiz/

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

AUTOOL 12V Car Battery Diagnostic Tester BT360 Automotive Battery Tester Analyzer Vehicle CCA2400…

Rewind Dev Log #2

Shortcuts in Atom (using your MAC!)

Magento Extensions Vs Magento Modules: What’s The Difference?

Magento Extensions and Modules

Internship journey with LGM

Wifi Strength Monitoring Robot

NumPy : Library for the Python programming language.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Ahmet Lekesiz

Ahmet Lekesiz

https://www.linkedin.com/in/ahmetlekesiz/

More from Medium

What is the SOLID principle?

Can tech companies use code reviews to keep their employees psychologically safe?

The Forgotten Abilities

The different stacks of modern development