In this section, we employ a depth-first approach to teach you how to design systems. The systems described in Part 4 are as simple as possible in order to maximize the practical knowledge you can take with you. The inspiration for this section is the famous quote, “If you can do Hash maps and the Monte Carlo Method, you can solve any problem in computer science.”

🤥 OK, fine, you caught us—Einstein didn’t actually say this. However, this quote does come from a competitive programmer who made it to top 24 in the world, so that’s got to count for something!

This section takes the idea behind that quote even farther. If you understand the power of immutability, deduplication, enumeration, and how to get yourself unstuck, you can understand how to design any system. Part 4 is a master class in which you’ll get to observe the decision-making process of senior engineers as they build systems. Along the way, we’ll teach you additional tips and tricks to help you excel during an interview. Once you’ve finished reading Part 4, you’ll possess the knowledge you need to truly understand system design interviews. And you’ll see there was never any reason to be scared of designing systems in the first place.

Chapter One

Let’s get started. The best way to learn is to build, so we’ll begin by designing something that is:

Large enough, and
Does not have many interdependencies.

For something large enough, let’s look at TikTok and simplify it greatly. Here’s one way to do it:

Change from videos to pictures.

What other big element could we put aside for now?

Maybe pictures are too much?

Let’s take it even further, all the way through TinyURL to Pastebin. How so?

Let’s now look into Pastebin. It’s probably the easiest one of all the above.

To keep things simple, imagine we don’t need to edit Pastes; we just need to remove them sometimes.

This is important. Why?

Requirements

It helps to align on requirements before designing the system. A conversation with the interviewer flows better if functional and non-functional requirements are introduced.

Rule of thumb

Functional requirements (FRs) tend to map 1:1 to the functionality the service can expose to the outer world.

Functional Requirements

Non-Functional Requirements

Rule of thumb

Similar to DS&A questions, it’s easy to make assumptions about the problem. Think of the requirements stage as a fluid conversation between yourself and the interviewer. While you will drive the conversation, you should always check in with them to see if you’ve missed anything. Many people forget that they can literally ask the interviewer if they’ve missed any important requirements.

How to Get Yourself Unstuck – Tip #2:

Not sure if you got all the requirements? Think you’re missing something important, but you don’t know what? Turn your requirements gathering into a conversation and get the interviewer involved. Ask your interviewer: “Are there any important requirements you have in mind that I’ve overlooked?” This is totally allowed!

Estimates

While not strictly necessary, it’s worth it to be able to answer questions about the scale of a service in under a minute. This is known in the industry as the “back of the envelope” calculations of load estimates.

What’s the point of doing estimates at all in these problems?

Both in an interview and in real-life design, it’s useful to be able to estimate orders of magnitude, especially when it comes to users, data volumes, or costs.

An important note: There’s no “right” way to do these estimates, but there are “wrong” ways. So, in practice, this is one of the aspects of the conversation where it’s more important to focus on being “not wrong” rather than on being “right.”

A back-of-the-envelope estimate here could be along the following lines.

Estimates often can (and should!) be validated against real-life numbers, which are frequently public. Test your estimation skills here:

How to Get Yourself Unstuck – Tip #3:

Some interviewers hate this step and really don’t want to see you stumble through 5th-grade math calculations for 15 minutes. Similar to step 2, ask your interviewer if they’d like to see some calculations before jumping in and starting them—you might be able to skip these entirely if the interviewer doesn’t care about it! With quite a few system design interviews, you’d be fine as long as you do mention that the system you plan to present will be durable, resilient, and scalable.

How to Get Yourself Unstuck – Tip #4:

Sometimes it’s difficult to know what to calculate during this part of the interview. If you’ve already confirmed that the interviewer wants to see calculations as mentioned in Tip #3, then follow these rough guides to get the basic estimates for any system.

Storage Estimation:

Storage = daily data used by 1 user * DAU count * length of time to store data

Bandwidth Estimation:

Bandwidth per second = (daily data used by 1 user * DAU count ) / total seconds in a day

Also, there are roughly 100K seconds in a day, which is five orders of magnitude. If your API gateway expects to see a billion requests on a busy day, that’s approximately 10K requests per second, as 9 zeroes minus 5 zeroes is 4 zeroes. The true figure is ~15% larger, as 100K / (60 * 60 * 24) is around 1.15.

Let’s start designing!

Intuitive things first:

We need to figure out how to store Pastes. It might occur to the astute reader that all we’re doing is sharing files (aka the “Pastes”) between users. So if we are just sharing files, why not just have the “pasting” action simply copy the file with a “Paste” to a remote folder?
Folder sharing properties are a little niche and OS specific. Instead of dealing with them for each OS, perhaps we can access a folder from a tiny Python script?
Pastes always have those funky unique names in their URL. How do we generate those names for Pastes?

The shared folders analogy, from (1), immediately takes us about half way to designing a document store: a non-relational object persistence layer. It’s broadly acceptable to refer to it these days as a NoSQL database, even though along many axes it is nearly the exact opposite. It would not support transactions or joins, for example. On the other hand, for the purposes of designing Pastebin, neither joins nor transactions are required. Plus, we want the benefits of pagination. This allows us to leverage the strong sides of document stores / NoSQL databases, while not suffering from any downsides of them. The strong sides include features such as replication, automatic failover, and out-of-the-box horizontal scaling.

Remember

The takeaway message here is that document stores truly are not very different from reliable shared folders, optimized to be used by software through APIs calls (2), not by humans via a command line or via drag and drop. They also abstract away the underlying OS details (as well as cloud / containerization / etc.), allowing their user to focus on the business logic of their application instead.

The unique names generation task, (3), is a well-known problem by itself. It is sometimes referred to as KGS, the Key Generation Service. We will tackle this problem later in Part 4; for now it will suffice to say that as long as the keys can be long enough, it is straightforward to reduce the risks of collision to zero, or to an infinitesimally small number.

More Real Interview Stuff

The previous paragraphs outlined the key concepts that will help you reason about the problem. Depending on the interviewer, different parts of the problem might become more or less important with respect to getting an excellent score.

In this section we will cover some of those parts, from first principles, and loosely in the order of likelihood that you’ll be asked about them.

API

What are we not covering, but could be asked about for this Pastebin problem?

Caching. How to make stuff load faster? How to load our servers less?
How critical is deletion? How important is it to remove everything quickly (say, it’s private data, by mistake)?
Private sharing? How to authenticate? How to authorize? Is it possible to change sharing settings of a paste on the fly?
Sharing analytics? / Telemetry.
GDPR and other regulations to observe.
Authentication. By the way, ****If you are asked an Authentication question in the context of the Pastebin problem, congratulations. This means you have likely nailed everything else the interviewer had to ask you about.

Chapter Two

Time to look into mutability.

Remember

As we talked about in Chapter One, most system design problems are trivial when reduced to an immutable case.

A good way to dive into the depths of system design is to attack a problem that is all about mutability.

In other words, we need to look into a problem where the objects can be changed. The perfect problem to illustrate this point would focus on mutability and nothing else. In search of such a problem, let’s cover two warm-up ones first.

Warm-up Problem One: Deduplication

The problem statement for deduplication is deceptively simple:

The only way to call the service is: “Did you see X before?”
The service answers with is true / false.
The first time the service is called for X, it returns false.
If the service is called with X again, it should return true.

Without loss of generality (WLOG), let’s assume the X-es are strings of some reasonable length, say, under two hundred bytes. (If you want to learn more about the math-y term “WLOG,” you can do that here.)

Why is this important?

In computer science terms (or, rather, in “algorithms and data structures” terms), the problem of deduplication is effectively about building a largely distributed append-only hash set.

Remember

While system design interviews are not the same as algorithms and data structures ones, never forget that, ultimately, it’s all about writing code that works! The difference is this: doing vs. directing. In a coding interview, you’re literally writing the code. In a system design interview, your goal is to assist those writing the code by providing architecture / direction / guidance.

Let’s put together some requirements first.

Keep in mind, we work on requirements not because the script of the interview contains such a section, but because they help us to better reason about our solution.

Consistency

Availability

Persistence and Durability

What the above effectively tells us is:

The solution should be distributed. The total amount of data to store would not be huge. It may even fit in memory. Therefore, we will only be looking at a few machines—not hundreds.

Tip

Memory is fast, disk is slow. In this problem, since every record is a small number of bytes, we may be able to fit everything in memory.

The solution should be resilient. The main reason we need more than one machine is to ensure uninterrupted operation of the system if one of them is unavailable.

Now would be a good time to introduce the simple idea of R + W > N.

This is a very simple trick, and while it is not often used directly in production systems, it’s incredibly useful to know when learning about system design.

The idea is this:

If we have N machines in total,
We consider a write successful when the write has succeeded on W machines,
And a read is successful if at least one of R queried machines contains the data. Of course, if more than one machine has the data, the response from all of the machines that have this data should be identical,
Then we need R + W to be greater than N for our system to be consistent.

Keep in mind, we are still looking at a semi-immutable problem of deduplication: we can only “get to see” every possible input X; we can not “unsee” an X.

For a positive example, consider N = 5, W = 3, R = 3. What this means for our deduplication solution is:

There are N=five servers in total.
When we “write,” i.e., when we see an X we know is new, we:

Take a note that “X was seen” on at least W=three servers (you can think of choosing them randomly now), and
Consider the write successful only after these three servers have confirmed this write as successful.

When we “read,” i.e., when we need to see if a given X was seen before or not:

We ask R=three servers whether they have seen this X,
If at least one of them responds with a “yes,” we know the answer is “yes,”
And if and only if all three servers respond with a “no,” we know definitively that this X was not seen before.
These R=three machines can still be chosen randomly.

Tip

Take a moment to internalize why the above would work when R + W is greater than N, and why it may fail if R + W is equal to or less than N.

Of course, in practice many more tricks can be used to improve the solution. (These ideas, while important, are out of scope of this trivial example problem):

The redundancy factor is (N-W) for writes and (N-R) for reads; as long as there are enough machines to pick W of them to write to we can write, and as long as there are enough machines to pick R to read from we can read. The important part is that the write machines and the read machines intersect by at least one, so W+R should be strictly greater than N. This is because if W+R is equal to N, or is less than N, there's no guarantee that a set of W randomly picked machines and a set of R randomly picked machines will intersect. And if they do not intersect then the read will miss the data that was just written!

Tip

Always pay attention to solutions that keep working “just fine” if some nodes are down. Since things are going to go wrong, it’s best to have systems that can deal with that without issue, so you don’t have to! 😎

Last but not least: The above solution would use each machine’s RAM or disk capacity at its (W / N) fraction. In other words:

If one machine can only handle the key space of up to a billion X-es,
And you need to handle the key space of five billion X-es,

That is, five times that amount that one machine can handle.

Then you need (W / N) to be at most 0.2 (or at most 20%).

Always good to allow for some extra room, so, say, we plan for ~16%, not 20%.

You can go with W = 1, N = 6, but then R would have to be the whole 6. It's unwise to design a distributed system with R=N though. If just one machine dies, we can not longer have the total R of them to respond, and thus our reads will be stalled.

In other words, if you only write to one machine, you need all machines to be up and running while you are reading. Slow, and not safe with respect to machine failures.

You can go with W = 2, N = 12, R = 12.

This is acceptable since R+W, which is 12+2=14, is greater than N, which is 12. The problem persists however: losing even one machine makes writes impossible, since, with R=N, if only 11 of 12 machines are operational it is impossible to read from "any" distinct 12 machines.

Or you can go with some W = 4, N = 24, R = 22.

This is also acceptable since R+W is greater than N. And now, for reads to work, we can lose one or two machines safely, since both (N-1)=23 and (N-2)=22 is at least R.

Obviously, if you assign parts of the key space (i.e., different values of X) to N machines intelligently, not randomly, you can achieve better RAM and disk utilization. But the example is still quite illustrative: we now know that without any custom knowledge on how to design a system, we can build a resilient, highly available, multi-node deduplication engine, from first principles!

Before we declare ourselves done with the warm-up problem, let’s consider its extension.

Warm-up Problem Two: The Enumeration Problem

This problem statement is very similar to the above:

The only way to call the service is: “Did you see X before?”
The service responds with an integer: The unique, 1-based, index of this X.
From this moment on, for this X the very same index is returned.
The returned indexes should not have gaps. That is, they should go as 1, 2, 3, etc.

At first glance, enumeration is identical to deduplication, but instead of a hash set, we need a hash map.

But this first glance is deceptive. This enumeration problem is substantially harder than the deduplication one.

The challenge is in assigning IDs in a fixed order (as in: 1, 2, 3, and so on). In order to stamp the new, unique index, all other calls should wait, as this very next unique index “distributed variable” is a shared resource.

A real-life analogy to help you understand enumeration

A real-life analogy might help.

In the case of enumeration, the very same (R + W + 1) > N trick would not work, at least directly.

Because:

Shared state is the root of all problems

Tip

The more massive the shared state has to be, the more difficult it is to solve the problem in a generic fashion. And the enumeration problem is a perfect illustration of why exactly it is so difficult.

When attacking the Enumeration problem, the concept of a read-write ratio comes into play.

If there are only 1,000,000 distinct X-es in the world, eventually—and rather quickly—all of them will become seen. Since we don’t need to invalidate indexes, each particular node of our service could “memorize by heart” the indexes for these 1,000,000 X-es, after which the problem can be treated as an immutable, read-only one. And, as you know by now, immutable problems are easy to solve.

Exhausting the set of unique inputs is quite a rare case. But it does happen often that the vast majority of inputs (say, 99.9%) would be the ones that have been already seen. Since inquiring about the already seen X is an immutable operation, this would result in a 999:1 read-to-write ratio (which is safe to round to 1000:1, for all intents and purposes).

Remember (from the two warm-up problems):

Deduplication is the simplest case of a mutable problem.
Enumeration looks like a trivial extension, but it is far, far trickier.
The CAP theorem is a bitch blast, and it is important to understand the boundaries of what’s possible (and to ask clarification questions on the functional requirements step!).

Real problem: Unique ID Generation, aka the Key Generation Service (KGS)

The problem of generating unique IDs serves as an excellent real-life illustration for the ideas we’ve discussed so far. Unlike the two warm-ups above, Unique ID Generation is a real problem. And it is commonly used as an interview question, including this mock interview hosted by interviewing.io.

Requirements

We will, of course, be solving the problem in a distributed setting. In other words, the solution should be highly available and consistent. This means that we will be using several nodes that communicate with each other. This way, if some nodes die and/or become unresponsive, temporarily or permanently, the system remains:

Highly available — it keeps functioning, and
Consistent — it keeps satisfying its functional and non-functional requirements.

Pragmatically speaking, the problem is quite straightforward, as long as we answer three important clarification questions. All three fit solidly in the Functional Requirements (FRs) bucket.

Remember

During a system design interview, we focus on functional requirements not to please our interviewers, but to set the table for ourselves and to stack the deck of the upcoming story in our favor!

The three FR questions are:

Is it a strict requirement for the IDs to not to repeat?
How large is the ID space?
How many unique IDs per second should we generate (and what is the acceptable latency)?

Let’s dive deeper into why these are the important questions.

Question #1, about whether the IDs are allowed to repeat at all, is quite binary. Either generating a non-unique ID is absolutely unacceptable, or it is allowed once in a blue moon.

To answer this from a practical standpoint, consider the cost of generating a non-unique ID for the business. If this cost is bounded by a reasonably small amount, then, in practice, it is not entirely unacceptable to have a non-unique ID from time to time.

Consider Amazon, AirBnb, or Expedia order ID generation. What can go wrong if the ID is not unique? A customer might have a ruined experience. What’s the cost of this for the business? Probably in the order of dozens or hundreds of dollars; this cost is not likely to be above, say, $10,000. Thus, if the likelihood of a generated ID to not be unique is such that a duplicate may emerge once in ~10 or ~50 years, the “daily” cost of this “imperfect” implementation is less than what the company will pay for organic sweetener in the office. As a result, “fixing” this “problem” may not be worth it at all.

A strong user- and product-facing candidate should be able to articulate the above.

With this said, if the IDs are allowed to be not unique once in a while, the problem is sort of non-existent: generate a random 128-bit number, and you’re good for ages. In fact, for most practical intents and purposes, even a random 64-bit number would do.

Back-of-the-envelope math, which, unlike most SD interviews, actually is important here!

With the above deep dive being said—and you should somehow indicate to your interviewer that you understand it!—let’s approach the Unique ID generation problem under the assumption that the IDs should be guaranteed to be unique.

Question #2 effectively boils down to “64 bits vs. 128 bits.” 128 bits, as shown above, are a lot. For the record, a standard UUID is of 128 bits, although a quick Googling shows it “only” contains 122 bits of entropy.

And question #3 is only important in conjunction with question #2, as what truly matters is the rate at which the key space is being exhausted. Simply put, the very problem is almost identical if we add 10 more bits to the IDs and generate the IDs 1,000 times faster.

Tip

Though it is rare, sometimes system design interviews are indeed about doing math with estimates. A good interviewer will test your intuition on the above, and human intuition is notoriously unreliable when it comes to very small and very large numbers in general and to probabilities in particular. So if you skipped the above math-y deep dive, at least read up on the Birthday Paradox. 😊

Solution

So, effectively, the true problem is:

Do we try to utilize the key space effectively?
Or are we allowed to use a larger key space?

Really, that’s it.

For the unique key generation problem, just do a bit of trivial (*) math.

(*) Professors tend to refer to anything that’s semi-straightforward to follow as trivial. Thankfully, our interviewers are not that picky. Although the above is relatively standard math to someone who is fluent in probability and statistics, and, chances are, your interviewer may well be—system design interviews are often conducted by more senior people, and more senior people work with data more often than average.

If we need to utilize the key space effectively, we’re fully in the realm of distributed consensus, etc. Hard problem.

If we can be loose, we just need to employ a few simple tricks to minimize our risks. Easy problem.

Solution to the Easy Problem

First of all, if you are looking at the easy problem, it should be noted during the interview that just generating the IDs randomly would solve it for all intents and purposes.

Go ahead and explain this to your interviewer. Explain that even if each collision costs the company $10,000, it is still not worth an extra minute of your time, as an engineer employed by this company, to be solving this problem.

Because this is The Correct Answer, if this unique ID generation problem presents itself in the real life setting.

(Not to mention that any large company that employs you expects to make considerable money from you; for top IT giants this figure is over one million dollars per year per engineer, although things did turn south since 2020.)

Of course, after you cover this, the interviewer will ask you to actually solve the problem, in a scientifically and engineering-ly precise way, in an “academic” setting. This is what the next section is about.

But before you get too fancy, it won’t hurt to suggest a few trivial improvements. In short, having a lot of key space allows for plenty of “cheap” tricks.

For example, each node of your now-distributed service can just assign the IDs in the 1, 2, 3, … fashion. You can then prepend each generated ID with something that contains:

a) The ID of the machine (node/server) that has generated this ID.

b) The timestamp when this particular machine has started assigning IDs.

If you have a large key space (say, 128 bits), you can easily allocate ~32 bits for the ID of the machine (an IPv4 address) and another ~32 bits for the Unix timestamp, in seconds, when this particular machine has started its “session.” This leaves you with plenty of indexes per each node: 2^64, to be precise, as you’ve “used up” 64 bits of 128.

And if one of your machines suddenly runs out of whatever range you have allocated for it, you can “just” restart that machine. It will get itself a new prefix and continue afresh. Of course, the restart itself is not necessary, as the machine can just (automatically) change the value of that variable it uses internally.

Solution to the Hard Problem

The above solution should be good enough to pass the interview about Unique ID generation. Before we get to solve the Hard version of it, let’s add a few more assumptions. They do, by the way, qualify for Non-Functional Requirements. Here they are:

a) The IDs generated should not just be sequential.

b) The API is such that clients can receive IDs in bulk.

For (a), a simple solution comes from number theory. Security specialists know it. Just perform some relatively trivial mathematical function before returning the ID to the user, while keeping this function reversible, so that when F(x) = y, some G(y) = x. A Modular Multiplicative Inverse might do the job just fine.

And (b) is exactly where the solution to the broader problem comes into play!

Imagine that you have designed a system where the “users” can “request” IDs in bulk. (We will explain how to do it in a minute.) The trick is that each node of your system is effectively performing the same task: it needs to request a large “bulk” of IDs from some “central authority,” and then distribute these large bulks in smaller quantities to its “retail customers.”

It’s as simple and as genius as that.

To arrive at a “proper” solution, the “large” bulks should not be too large. Because, if a node dies, it is impossible to tell how much of its latest bulk it has “distributed” so far. In other words, when a node dies, some part of a key space will be wasted.

Moreover, if the node does not die, but terminates gracefully, it may then “return” the “unused excess” of its bulk to the system, which sounds very fair. But it may present a challenge to the “bulk provider service” to deal with “partial” bulks. It may well be better to deliberately discard the “rest of the bulk,” even if the node that has previously requested this bulk is terminating gracefully.

Simply put, the bulk size can be calibrated such that each node requests a new “bulk” to “distribute” approximately every several seconds or maybe several dozen seconds. That’s the sweet spot. Make this number (this bulk size or this time window—they can now be used interchangeably) too tight, and the “central authority” would need to handle more load than needed. Make it too loose, and, especially if machines do die often (or need to be migrated often), a non-negligible portion of the key space would be wasted unnecessarily.

Now, to the gist of the problem. The very “central authority”:

Can not truly be “central,” but should be “distributed,” and
Actually deals with very little traffic, as each node literally talks to it once every several seconds or dozen seconds.

So, in a true System Design fashion, use some standard distributed component there, FFS!

It could be Google Spanner, because, with this traffic, it will be cheap.
It could be anything with internal Leader Elections implementation, such as:

ZooKeeper,
or even Kubernetes/Consul itself!

It could be any distributed DB, as long as it supports strongly serializable transactions over a reasonable number of replicas.

Literally, both MySQL and Postgres would do, as long as you have a quorum of some N = 5 or 7 or 11 nodes. (We usually use an odd number of nodes for leader elections, so that they are impossible to break into two equally sized parts).
Kafka is another option, as you would literally be publishing single-digit messages per minute into one topic of one partition!

Even if each “call” to “reserve” another “bulk” takes a second or two to settle—because the request would have to travel across the globe, possibly more than once, to guarantee strong consistency—it’s perfectly fine with your design, as long as each node requests a new “bulk” those few seconds in advance, before it runs out of its “current” “bulk.”

That’s it for ID generation. Relatively simple.

Remember

Any design for Unique ID generation that attempts to store Every Single Issued ID is a bad idea.

Why is it a bad idea?

Because, as we know by now, if the total number of IDs to issue is small, well, there is no problem.
And if the total number of IDs to issue is large, the waste on the DB resources will be colossal, to say the least.

More junior, and/or less experienced engineers often make this mistake. As they say: “Forewarned is forearmed.” At least, store “bulks” only, not each issued ID. Better yet: only store the “latest issued bulk ID” or the “next available bulk ID.” That’s it—relatively simple still.

Want to know exactly what a FAANG System Design interviewer looks for? Get detailed feedback on your system design skills from our professional interviewers.

See available times

Chapter Three

All right, we now know enough about things like mutability, deduplication, and basic requirement gathering skills to be dangerous. Let’s now tackle a larger problem—not too complicated, but it’ll stretch us a little bit.

Design AOL Instant Messenger

This is almost certainly a system design question you haven’t encountered, and depending on when you were born, it might not even be an application you’ve seen before! 🤯 No worries, we can think of this as one of the simplest chat apps imaginable. Keeping it simple allows us to design any chat app from first principles and provides a solid structure for us to work with, regardless of what chat app we are asked to design.

For the younger readers who aren’t familiar with the AOL Instant Messenger (AIM) application before, here are a few snapshots so you can get a sense of the basic functionality of the application.

Requirements

Yikes! Those font choices clearly weren’t vetted by a UX designer! 😂 OK, so clearly the app has a few specific functionalities. A requirements list may look something like this:

Functional Requirements

Ability to sign up for AOL Instant Messenger
Ability to generate a screen name
Ability to authenticate yourself with a username and password
Ability to save username/password securely & auto-login
Ability to add a “buddy” to talk to on AIM
Ability to select a buddy and send chat messages to them in real-time
Ability to block, warn, and delete buddy

Wow, that’s still quite a large list for such a simple application. For now let’s ignore the authentication pieces—while they are important, they don’t really make up the core of this application. Let’s spend some time focusing on the actual messaging parts and assume we can log in already.

As we did in the previous chapter, we should discuss the scale of our app in terms of functional requirements. Again, this isn’t just strictly necessary for an interview, it’s also a useful tool to help frame what we actually care about in the design of the system.

Non-Functional Requirements

Ranking the order of importance with functional and non-functional requirements is silly because a single requirement not being filled will lead to a bad system.

Tip

Still, for any given design problem, there is usually at least one particularly relevant requirement. This means that there’s likely to be one requirement which is the linchpin of the system.

Question

Generally, what do you think are the most critical parts to an app that allow you to talk to other people? Seriously, think about it. I’ll be here when you get back.

Did you think about it? I mean it! Think for a moment about this. What are things that are just “givens” about a chat app? What have you come to expect when it comes to these types of apps?

OK, so you might have phrased it differently, but if you thought something like, “It’s important to get messages in a timely manner,” or maybe “it’s important to always be able to send a message when you want to,” or possibly even “It’s important to be able to access my messages when I want to see them,” then fantastic—you’re absolutely right.

In plain english, we’ve just named our most important non-functional requirements: Consistency, Availability, and Durability.

Availability

Durability

Consistency

This makes up the bulk of the most important requirements in our system!

Solution Design

Awesome, now that we’ve discussed the requirements, much of our work has already been done. Let’s talk about what the system actually looks like. 🧐

How to Get Yourself Unstuck – Tip #8:

It’s common to try to detail every part of the system’s design like you see people do on YouTube. Realistically, these videos are scripted, and the drawings are fast-forwarded. In a real interview, you won’t have time to actually detail every part of the system, and that’s OK! It’s expected that you’ll abstract away pieces that aren’t particularly relevant. It’s a good practice to call out what you’re abstracting, but just focus on the general data flow of the system.

In our diagram we need to show how data is getting from one client to another client and what it does in the system. The diagram can start simple and then evolve if the interviewer wants you to “zoom in” and discuss specific parts. In our example though, the diagram could be as simple as this:

Data flow:

Our first user (AOL Desktop User A) sends a message to our servers, and we have it hit a load balancer to avoid slamming individual servers.
We have a fleet of servers waiting to work. Upon receiving a message, they store the data in the database, which is also behind a load balancer.
The database piece is left intentionally simple, but we can deep dive into it if the interviewer wants us to. This alone can be an hour-long discussion, so we recommend that you keep this simple and call out that you’re handwaving this for now so that you can finish the rest of the system first.
The second user (AOL Desktop User B) can get data by accessing our servers (through the load balancer) and then getting sent data back directly.

Now that we have an idea of how the system has data flowing through it, we might want to discuss one other critical piece. The sending of the data makes sense from the diagram alone, but how does that second user know it’s time to receive data? Are they checking the service every few seconds to see if something new is present for them? Is there some way we can alert the AOL user that they have a new message? The app is real-time, so how do we ensure that messages are delivered as promptly as possible? These questions are answered by knowing a bit about key ways computers can interact with other computers. There are three major types of ways computers talk to one another: Long Polling, Short Polling, and WebSockets.

These can be best explained through an analogy.

Short Polling, Long Polling, and WebSockets—are we there yet?

Short Polling

Remember when you were younger and you’d always ask the question, “Are we there yet?” Repeatedly asking this same question every few minutes is a good example of short polling. Over a short period of time, we are constantly asking, “Are we there yet? Are we there yet? Are we there yet?”

This is short polling in a nutshell. We repeatedly bombard the servers with the question, “Are there new messages for me?” Just as this was a bad idea to do in your parents’ car when you were a child, it’s a bad strategy in basically every system design structure. It’s annoying to the servers (causes extra processing power) and doesn’t scale (can you imagine the amount of resources wasted by 1 million users all asking our system this every few milliseconds?).

Long Polling

Did you ever have a forgetful aunt or uncle? Imagine that you’re a little older now and driving on a roadtrip with that person. When you both get in the car, you ask them to tell you when you all reach a particular spot on the trip. This could be fine for a shorter trip because they’d probably remember to tell you. But on a really long trip, they might forget to tell you when you reached that particular spot, perhaps because the car ride took too long and they’d simply forgotten you ever even asked them to tell you.

This, again, is an analogy for long polling. Our client reaches out and asks the server to tell us when something new has updated for us, which helps us successfully avoid the waste of resources. But this fails when the time between responses can be particularly long (think more than a few minutes). So long polling can be good when we are expecting data repeatedly, but it’s not great when it could be a while before we get the data.

WebSockets

Finally, let’s imagine a car ride with our best friend. We ask them to tell us once we’ve come to a particular spot on our journey. They say, “Sure, I’ll tell you when we get there,” and then we patiently wait for them to tell us. They aren’t forgetful, so we trust them to let us know no matter the length of the car trip. This, in essence, is WebSockets.

A key part that distinguishes WebSockets from long polling is that with WebSockets we can have arbitrary lengths of time pass without needing to worry about the connection timing out and the server “forgetting” us. We also have two-way communication, so the client can talk to the server, but the server can also directly talk to the client (whereas in long polling the server can “forget” how to talk to them).

For a chat app, out of these three choices, a case could be made for either long polling or WebSockets, but WebSockets is clearly a better design choice since we don’t know how long it’ll be between messages being sent and arriving.

Conclusion

This is the core of every chat app, and hopefully it provides you with a good description of how we can model future problems. Start with requirements, decide what matters most for the application with non-functional requirements, show the flow of data in your app with a diagram (maybe a couple!), and then iron out the details.

Though we certainly could go deeper into any part of the design and unpack things further, this represents a good overview of how the first major chat apps were designed and shows the basic model that all chat apps since that time have been built on.

What did we leave out?

Chapter Four

The fourth and final problem we will cover in Part 4 is what is commonly referred to as “Design Ticketmaster”

There are two major parts to this problem when it is used as an interview question:

Consistency under concurrency. Ticketmaster, when seen in this light, is just an extreme version of a service used to book hotels or airline tickets. There just are vastly more seats in a popular show than in a hotel or in an airplane, so the system should handle far greater concurrency, and not crack under pressure.
Real-time updates. Savvy candidates, who crush the first part, are then tortured further by the second one: handle people holding on, while being put onto an “online waitlist” by the service. Granted, an online waitlist is not the best product decision for Ticketmaster (”You are in line, your number is 12,345, and if all these people before you refuse to complete their transactions, you’ll be given your chance”). Nonetheless, it is a separate sub-problem that enables probing the candidate’s skills in quite a few more unique ways.

Let’s focus on the first sub-problem. The real-time part is a special case that deserves dedicated attention.

How to Get Yourself Unstuck – Tip #9:

As usual, we begin from requirements. In fact, it’s best to postulate the problem right in the form of requirements! This way we also develop a habit of thinking of system design problems from the grounds of how to solve them, as asking the right questions is at least half of solving them.

Don’t worry if you don’t know in detail what the Ticketmaster problem is about. In fact, for any problem, if you don’t fully understand its statement, jump straight to functional requirements, and clarify them—with your interviewer or with your peers—until they are crystal clear!

Requirements

Solution

Here's our proposed solution

How to Get Yourself Unstuck – Tip #10:

The best way to reason about the value of consistency is to think of what could possibly go wrong.

What could possibly go wrong in the Ticketmaster system?

These things could go wrong:

SQL Schema Design

As a matter of fact, Ticketmaster is one of the few system design problems where designing SQL tables schema is an essential part of the interview.

For most problems, SQL tables schema is a bit like the API spec. For most problems, SQL tables schema is important, but not too important. If you have more important topics to cover, go on; and if you need to fill time with something not too valuable and information-dense, do that with the SQL schema overview or an API spec.

Tip

With Ticketmaster, since the gist of the problem is in managing concurrency, an SQL-first design is what interviewers tend to like a lot.

Strictly speaking, there are several points of potential conflict / concurrency / race condition. All of them can be handled nicely with an RDBMS, a SQL database, as a source of truth. These points are:

Each unoccupied seat can only be “locked” into one order at a time.
Each order can only be completed (paid for) within a specific time interval.
Corner case: If the payment API took too long, the seat lock may have been released by the time the payment went through, in which case the seat may already be claimed as part of another order, and the order should be refunded.
Availability + consistency: If the Ticketmaster server goes down between (1) and (2), or between (2) and (3), the freshly restarted server should be able to pick up the state of the system from the point where the first server dropped the ball.

Here, “order” (as in the “shopping cart in progress”) is a good way to refer to the user session of completing the payment for a set of seats they have selected.

Exactly once seats to orders

(1) is the canonical case of a SQL transaction: UPDATE IF. SysDesign-wise, this statement alone is sufficient. Refer to your favorite SQL tutorial for more details. In practice, at mid-level / senior interviews, your interviewer will not judge you harshly if you make it clear you understand that an SQL engine can handle this concurrency issue for you; it is not critical to know the exact syntax.

Timestamps instead of cleanup jobs

A neat trick for (2) is to avoid timers and overall timed or scheduled (”cron”) “cleanup” jobs when you don’t need them. Instead, just write the “lock expiration timestamp” into the respective column. The “lock expiration timestamp” is simply the current time, at the moment of transaction, plus the fixed delta (5 or 10 minutes).

You probably want to make it 5.5 or 10.5 minutes, not just five minutes sharp or ten minutes sharp, to be nice to your users; final seconds ticking down is a negative user experience, and the payments API may also take several seconds to respond.

In this design, the “is seat available” condition, on the SQL level, is not just “seat is not booked yet, and seat is not part of an open order,” but “seat is not booked yet, there is no order that is not yet expired of which this seat is part of.” The last condition may be easier to understand if it’s phrased as “there is no active order created in the past five/ten minutes that contains this seat,” but it’s a good habit to add the expiration time while writing to the DB, not subtract it while querying.

Keeping the payments subsystem at bay

The corner case (3) is very similar to (2).

We just give the payments API some “time window” within which it is supposed to respond. We can’t wait forever, although this really is the corner case.

Most large-scale products would have several payment providers to choose from, hidden behind some API gateway, so that a different team would be responsible for making sure the response SLA from this payment provider is right.

CAP, anyone?

And for (4), a savvy reader, as well as a savvy interviewer, would immediately ask: So, if you claim to have both availability and consistency, you’re sacrificing partition tolerance, right? Yes, this is absolutely true from the standpoint of the CAP theorem. Such an argument is rather theoretical though. In practice, it is a good thing that our system scales horizontally… as it gives us the indulgence to ignore the CAP theorem altogether!

Outlaw idea

If your system shards horizontally, just ignore the "P" part of CAP while designing, and focus 100% on "C" and "A" parts. Just mention that your data has “no inner dependencies,” that you plan to “replicate your DBs in a leader-follower fashion,” and your service will be in great shape in production.

Of course, our databases have replicas, and, broadly speaking, network split is a case of one or several of our nodes becoming unavailable.

But when our nodes become unavailable, we just take them out of rotation and spin up the same service from the backup DB shard.

Admittedly, the above is far more difficult than it sounds. But the very argument holds true. Without loss of generality, consider the “one show one DB” design. This DB would be leader-follower replicated, likely to one or two replicas. If a DB is down, we can, automagically, take one of these replicas, promote it to a leader, and restart the system, as if nothing happened.

This would require synchronous DB replication, so that whatever tickets we have already sold are correctly marked as such. But we want this anyway! Otherwise, moving from one node to another would be a cumbersome process that requires reconciliation, collating data from different databases (seats, completed payments, issued tickets, etc.)

When outlining the above, don’t forget to mention that it is important for the times on these DB instances to be somewhat close to each other. No need to be sub-sub-second synchronized; in practice, it’s enough to be within several seconds, give or take. As long as we’re conservative with timeouts, and as long as the “flipping” from an unhealthy node+DB pair to a healthy one takes more than these few seconds, no inconsistencies will be introduced.

That’s about it.

The Real-Time Part

Rendering the seat map, which the users see when they land on the page of a particular show, is just about maintaining a materialized view of the database.

If we are talking about a super-popular show, with thousands and thousands of visitors, we may just re-generate a snapshot of this view every second or so and serve it off some passive storage (or even from a CDN with a quick-to-expire caching policy).

And, especially if your interviewer is picky, you can describe a pub-sub solution, so that each and every viewer of the show page gets instantaneous updates while a seat, or several seats, goes through its lifecycle journey. This journey can be visualized as a simple Markov process:

There are five arrows on this diagram:

“Booked”: When a seat goes from “green” to “yellow,” from available to “someone’s trying to book it now.”
“Canceled”: The person who was “holding” the order to book this seat (possibly, among others) has explicitly clicked “cancel.”
“Timed out”: The person who was “holding” the order to book this seat (possibly, among others) did not complete the payment in time, and, thus, the seat(s) is/are back to the pool of the available ones.
“Paid for”: The order, which this seat was part of, was paid for, and the seat goes from “yellow” to “gray”—no longer available for purchase.
“User returned the order”: The person, who previously had successfully booked and paid for this seat, has refunded their order, which returns the seat to the pool of seats available for purchase.

Each of these five orders can yield an event that makes it into the PubSub bus. Each person, who is looking at the web page with the seats map, can subscribe to the updates to this bus, and the updates would be streamed to them, in a “push, not pull” way, likely via a WebSocket. Thus, each person would be looking at an accurate picture of the seat map, colored into green, yellow, and gray, for “available,” “part of an order,” and “unavailable,” respectively.

Bonus

To earn bonus points from your interviewer, consider…

Part 4: Outro

Fin. This is the end, my friend. 📖

If you read all 4 parts, congratulations are in order! 🎉 You learned high-level ideas to strategically approach system design interviews, the 15 fundamental system design concepts, a 3-step framework to solve any system design problem, and you watched us design simple systems from scratch. Plus, you picked up a bunch of useful tips and tricks along the way.

We want to thank you for reading this guide and for entrusting us with your precious time. We hope that you agree that you’ll be a better interview candidate and a better engineer because you took the time to experience this guide. Here’s a final tip: Give yourself some more time: for the ideas to integrate, to take the concepts further, and to practice on your own. Thank you for joining us on this journey! 🖖

Zoom call with everyone doing live long and prosper sign

Here is a delightful picture of the team that made the video Two Ex-Google System Design Experts Compete: Who will design the better system?

At this point we want to do three things: Tell you what’s on the horizon for us, give you a way to interact with us, and provide you some additional resources in case you’re still hungry to learn more.

What’s next for the team that made this guide

You! And hearing what you thought of this guide. What elements were the most useful/actionable? Which parts could benefit from additional examples or clarity? Once we grok that, we’ll know exactly what to do. The community might give us harsh critical feedback, shout from the rooftops that they’d love to see us make more stuff like this, or (ideally) some combination of both.

Some possibilities we’ve discussed:

Make this an ongoing thing. Maybe it’s a newsletter, maybe it’s a meetup, maybe it’s neither of those formats, but the idea would be to ensure that we consistently release stuff like this.
Make a V2 of this guide that takes your feedback into account.
Make a system design guide for other levels (if we did this, most likely we’d make a guide for an “above L5” or “senior plus” audience).
Make more system design content like the stuff on our YouTube channel.
Make cool live system design events (though we would want to hear from you about what kind of events you’d most like to see).

Other things we’d like to hear about:

People who want to make content with us (could be written, video, or both)

If you have feedback, ideas, or random thoughts → please fill out feedback form

(we promise a human will read every entry!)

Fill out feedback form here

interviewing.io-approved resources

Check out a bunch of our system design interview replays.
Watch this video: Senior Engineer Gets Mentored by System Design Expert
Read Designing Data-Intensive Applications by Martin Kleppmann. If you're an engineer who's above the senior level (aka “senior plus”), Kleppmann's book is the bible of system design.
Gaurav Sen on YouTube. In terms of simple and effective short-form videos about system design, Gaurav Sen is the best the world currently has to offer. A solid video to start with is System Design Primer ⭐️: How to start with distributed systems?
David Malan (teaches CS at Harvard). David is a legend. A great place to begin with him is his CS75 (Summer 2012) Lecture 9 Scalability Harvard Web Development.
Architecture Notes. This site doesn’t have a ton of content yet, but it's basically all comics and diagrams that help make hard concepts really easy to understand.
High Scalability is a good blog that will help you catch up on stuff related to system design. One fun piece of theirs to check out is called, “Scaling Kim Kardashian to 100 Million Page Views.” It's an easy read about a real-world system.
A Beginner's Guide to HTTP. This is a great lesson for anyone looking to get better mastery of HTTP (which comes up a lot in system design interviews).
The precise meanings of “must” / “must not” / “should” / “should not.” Tip: Over 99% of interviewers appreciate when candidates juggle these terms confidently. Our readers would generally benefit from understanding such terminology, as they can use it right away in their interviews.

Part 3: A 3-Step framework...

Credits

Creator and author

Kevin Landucci

Editor

Aline Lerner

Special thanks

All of the wonderful interviewers whom we’re not allowed to name 🙃

We know exactly what to do and say to get the company, title, and salary you want.

Interview prep and job hunting are chaos and pain. We can help. Really.

A Senior Engineer's Guide to the System Design Interview

4 System Design Examples: Problems, Solutions, and Walkthroughs

Table of Contents

Part 1: How to approach a system design interview

Part 2: 15 fundamental system design concepts

Part 3: A 3-step framework to crush any system design interview

Part 4: Watch us design simple systems from scratch, and learn how to get unstuck

About Part 4

Chapter One

For something large enough, let’s look at TikTok and simplify it greatly. Here’s one way to do it:

Change from videos to pictures.

What other big element could we put aside for now?

Maybe pictures are too much?

Let’s take it even further, all the way through TinyURL to Pastebin. How so?

Let’s now look into Pastebin. It’s probably the easiest one of all the above.

This is important. Why?

Requirements

Rule of thumb

Functional Requirements

Non-Functional Requirements

Rule of thumb

How to Get Yourself Unstuck – Tip #2:

Estimates

What’s the point of doing estimates at all in these problems?

A back-of-the-envelope estimate here could be along the following lines.

How to Get Yourself Unstuck – Tip #3:

How to Get Yourself Unstuck – Tip #4:

Let’s start designing!

Remember

More Real Interview Stuff

API

What are we not covering, but could be asked about for this Pastebin problem?

Chapter Two

Remember

Warm-up Problem One: Deduplication

Why is this important?

Remember

Let’s put together some requirements first.

Consistency

Availability

Persistence and Durability

Tip

Now would be a good time to introduce the simple idea of R + W > N.

Tip

Of course, in practice many more tricks can be used to improve the solution. (These ideas, while important, are out of scope of this trivial example problem):

Tip

Warm-up Problem Two: The Enumeration Problem

A real-life analogy to help you understand enumeration

A real-life analogy might help.

Because:

Shared state is the root of all problems

Tip

Remember (from the two warm-up problems):

Real problem: Unique ID Generation, aka the Key Generation Service (KGS)

Requirements

Remember

The three FR questions are:

Back-of-the-envelope math, which, unlike most SD interviews, actually is important here!

Tip

Solution

Solution to the Easy Problem

a) The ID of the machine (node/server) that has generated this ID.

b) The timestamp when this particular machine has started assigning IDs.

Solution to the Hard Problem

a) The IDs generated should not just be sequential.

b) The API is such that clients can receive IDs in bulk.

Remember

Why is it a bad idea?

Chapter Three

Design AOL Instant Messenger

Requirements

Functional Requirements

Non-Functional Requirements

Tip

Question

Availability

Durability

Consistency

Solution Design

How to Get Yourself Unstuck – Tip #8: