By: Saurav

2018-02-07 08:52:00 UTC

Must know topics for system design interviews:


(list credit goes to Tushar Roy)

Vertical vs horizontal scaling ✓
CAP theorem ✓
Consistent Hashing
Optimistic vs pessimistic locking
Strong vs eventual consistency
RelationalDB vs NoSQL
Types of NoSQL
Key value
Wide column
Data center/racks/hosts
CPU/memory/Hard drives/Network bandwidth
Random vs sequential read/writes to disk
HTTP vs http2 vs WebSocket
TCP/IP model
ipv4 vs ipv6
DNS lookup
Http & TLS
Public key infrastructure and certificate authority(CA)
Symmetric vs asymmetric encryption
Load Balancer ✓
CDNs & Edges
Bloom filters and Count-Min sketch
Leader election
Design patterns and Object-oriented design
Virtual machines and containers
Pub-sub architecture
Multithreading, locks, synchronization, CAS(compare and set)

AWS: Database, EC2, Rails deployment
Solr, Elastic search
Amazon S3
Docker, Kubernetes, Mesos
Hadoop/Spark and HDFS

Detailed Notes(Not in order)


Vertical vs Horizontal Scaling

Horizontal scaling means that you scale by adding more machines to your pool of resources whereas Vertical scaling means that you scale by adding more power (CPU, RAM) to an existing machine. Horizontal scaling usually requires a load-balancer program which is a middleware component in the standard 3 tier client-server architectural model.

With horizontal-scaling it is often easier to scale dynamically by adding more machines into the existing pool - Vertical-scaling is often limited to the capacity of a single machine, scaling beyond that capacity often involves downtime and comes with an upper limit.

When servers are clustered, the original server is being scaled out horizontally. If a cluster requires more resources to improve performance and provide high availability (HA), an administrator can scale out by adding more servers to the cluster.

Vertical scalability, on the other hand, increases capacity by adding more resources, such as more memory or an additional CPU, to a machine. Scaling vertically, which is also called scaling up, usually requires downtime while new resources are being added and have limits that are defined by hardware. When Amazon RDS customers need to scale vertically, for example, they can switch from a smaller to a bigger machine, but Amazon's largest RDS instance has only 68 GB of memory.

A good example of horizontal scaling is Cassandra, MongoDB, Couchbase .. and a good example of vertical scaling is MySQL - Amazon RDS (The cloud version of MySQL).It provides an easy way to scale vertically by switching from small to bigger machines. This process often involves downtime. However, there are also SQL based database services like EnterpriseDB(PostgreSQL) and Xeround(MYSQL) which provide horizontal scaling.

An important advantage of horizontal scalability is that it can provide administrators with the ability to increase capacity on the fly. Another advantage is that in theory, horizontal scalability is only limited by how many entities can be connected successfully. The distributed storage system Cassandra, for example, runs on top of hundreds of commodity nodes spread across different data centers. Because the commodity hardware is scaled out horizontally, Cassandra is fault tolerant and does not have a single point of failure (SPoF).

Horizontal scaling comes with overhead in the form of cluster setup, management, distributed programming, maintenance costs, and complexities. The design gets increasingly complex and programming model changes with distributed architecture.

Note Now that we understand those database architectures which provide vertical scaling have downtime due to a single point of failure and database architecture which allows horizontal scaling doesn't have any downtime, if for a use case, availability is utmost, we should choose the latter approach.

Note Also, in case of geographically distributed applications, a distributed architecture (horizontally scalable) is preferable.

In real world, a combination of both strategies are used.

Load Balancer

A load-balancer program is a middle-ware component in the standard 3 tier client-server architectural model.

Load-Balancer is responsible to distribute user requests (load) among the various back-end systems/machines/nodes in the cluster. Each of these back-end machines run a copy of your software and hence capable of servicing requests. This is just one of the various functions that load balancer may be performing. Another very common responsibility is "health-check" where the load balancer uses the "ping-echo" protocol or exchanges heartbeat messages with all the servers to ensure they are up and running fine.

Load-Balancer distributes the load by maintaining the state of each machine -- how many requests are being served by each machine, which machine is idle, which machine is over-loaded with queued requests etc. So the load balancing algorithm considers such things before redirecting the request to an appropriate server machine. It also takes into account the network overhead and might choose the server in the nearest data-center provided it is available to service the requests.

The request-response can also be done in 2 different ways:

Load Balancer always acts as an intermediary program for every response - In this case, once the request has been handed over to the server by the load balancer, any response from the server to the user will go through the load balancer. So the server machines that are actually servicing the request will never directly interface with the user machine running the client application. The machine hosting the load balancer program will be handling all the requests/responses to and from the user.

Load Balancer does not act as an intermediary for the responses coming from the server machine - In this case, once the server has received the request from load-balancer, it bypasses the load balancer and communicates it responses directly to the client.

Note: Setting up a cluster and load-balancer as a front-end interface to the client application does not really complete our scale-out architecture and design. There are still lots of critical questions to be answered and a number of key design decisions to be made which will affect the overall properties of our system.

To be contd.

CAP Theorem

Consistency means that data is the same across the cluster, so you can read or write to/from any node and get the same data.

Availability means the ability to access the cluster even if a node in the cluster goes down.

Partition(failure) Tolerance means that the cluster continues to function even if there is a "partition" (communications break) between two nodes (both nodes are up, but can't communicate).

In a distributed data store, at the time of network partition, you have to choose either Consistency or Availability and cannot get both". Newer NoSQL systems are trying to focus on Availability while traditional ACID databases had a higher focus on Consistency.

Of the CAP theorem’s Consistency, Availability, and Partition Tolerance, Partition Tolerance is mandatory in distributed systems. You cannot choose it. If your problem requires scale out (horizontal scaling i.e. distributed servers) and multi-server --- network partitions will happen. Instead of CAP, you should think about your availability in terms of yield (percent of requests answered successfully) and harvest (percent of required data actually included in the responses) and which of these two your system will sacrifice when failures happen.

Read more here and here

CA - data is consistent between all nodes - as long as all nodes are online - and you can read/write from any node and be sure that the data is the same, but if you ever develop a partition between nodes, the data will be out of sync (and won't re-sync once the partition is resolved).

CP - data is consistent between all nodes, and maintains partition tolerance (preventing data desync) by becoming unavailable when a node goes down.

AP - nodes remain online even if they can't communicate with each other and will resync data once the partition is resolved, but you aren't guaranteed that all nodes will have the same data (either during or after the partition)

Note CA systems don't practically exist. CA is only possible if you are OK with a monolithic, single server database (maybe with replication but all data on one "failure block" - servers are not considered to partially fail).


Owned & Maintained by Saurav Prakash

If you like what you see, you can help me cover server costs or buy me a cup of coffee though donation :)