Flannel Artichoke: Well, Hello.
Dialectic Avalanche: Hello.
Flannel Artichoke: Nice nice. Thanks for your time in joining this. My name is Alex. And I've worked for one of the big tech company more than five years, and also working as an interviewer for systems design and I do the questions for the current company. So for your, this session I'll run the interview, like real interview for you and your practice, how does it sound?
Dialectic Avalanche: Yeah sure sounds great.
Flannel Artichoke: Okay, before we start, why don't you give me your background a little bit like your backend design experience as a developer so I can kind of just observe the expectation for your session for us.
Dialectic Avalanche: Okay. Yeah. So basically, I'm a backend software developer. I have I work in a news companies, basically online news. So we do news recommendations. And I have worked on distributed systems, and the back end of that particular team. And yeah, sort of, I worked on content integration, modelling and ranking of the news articles for particular users. And we recommend news articles across the world. So and I've also designed couple of projects, and one of my recent projects was picked up for a talk in our company's internal conference. So that is my design experience. And when it comes to system design interview, I barely have an experience, I would say complete noob. And this is just my second system design mock interview. I'm just trying to get a feel for it, as I'm preparing.
Flannel Artichoke: Okay, sounds good. Right. Okay. Let me copy and pasted my question here. Before that, as you know, system design is usually open ended interview there's no like, absolute answer or wrong answer. So you know, it just kind of open ended. So basically, we are supposed to discuss and share our idea. And as an interviewer, I want to check your technical background experience the backing of your solution, on the system design. Yep, in this portion you might have seen this before, but let's not care about that. Again, it can go in many many different way based on your suggestion and our idea, take your time and let you start when you're ready.
Dialectic Avalanche: Okay, so basically, they're asking me to design a certain version of WhatsApp. So okay, let me ask a couple of questions. So looking like you're asking me to design a group chat so do we not have oh we have one on one messaging? That's okay. And do you want me to cover notification
Flannel Artichoke: Notification on client? Yes, it should be.
Dialectic Avalanche: I meant are we just covering images are we also covering videos?
Flannel Artichoke: Just image and file is fine I think.
Dialectic Avalanche: Do you want me to cover things like encryption and security and so on.
Flannel Artichoke: It's good for you to mention this, but let's not cover that for now.
Dialectic Avalanche: So basically, a basic messaging can be in plain text and it can go from confirmation. Okay, so I can also assume this is similar to Facebook Messenger, right?
Flannel Artichoke: Yeah.
Dialectic Avalanche: Okay. So okay. Okay, so maybe I can ask this I'll need this getter while estimating so I can mention my assumptions. Please correct me if I'm wrong. So something like number of users. Right? Something that can be worked, let's say in WhatsApp has like a billion users. So I can we can say 1 billion daily active users.
Flannel Artichoke: Oh, let's say just 1 million daily active user for your assumption.
Dialectic Avalanche: And I'll assume that we're doing image compression and every image will be like what 0.5 MB.
Flannel Artichoke: Yeah, up to maybe 1 MB at this point, I think
Dialectic Avalanche: And every message will be probably 100 Byte
Flannel Artichoke: Uh what do you think? What should it be, like? max length of the each messaging for users?
Dialectic Avalanche: I think 100 bytes is reasonable on an average.
Flannel Artichoke: Okay.
Dialectic Avalanche: Okay, so I'll give you a topics which we can cover, you know, in one hour, usually, we cannot cover everything. So I'll tell you, what are the different topics I can cover. And let me know which ones you want me to emphasise. But I can tell you what I feel are important. Okay, so basically, obviously, you have mentioned requirements, but I can go deep into it, and I and I have to get into non functional requirements and so on. And in my opinion, it is important that I'm putting a star as a suffix. And we need some estimates, I think it's important. And we need some API design, I can mention it verbally, and may not dwell too much into it. And database schema, in my opinion is important. And we need component design. And then we have approach or algorithm. So this is where we, you know, the hash. How do we actually create IDs and so on. I can again, touch upon it or go deep into it, it's up to you. And then we have data, partition, and caching. So when I say caching do it include load balancing as well, it needs to go hand in hand. Then I can talk about telemetry and information. So if you tell me which ones are important for you, then I'll just keep focusing more on that as I go and extend
Flannel Artichoke: Sounds good, I think oh, we can focus on the component design obviously, and also the data replication Plus API design, that would be our main discussion topics, I think.
Dialectic Avalanche: So, you have mentioned the requirements. So, I will probably start with the non functional requirements.
Flannel Artichoke: Sure.
Dialectic Avalanche: So, this is a slightly, I would say the system has to be available not available. Definitely needs to be consistent to some extent, because the messages should not be lost. And I think availability can slightly take a hit Yeah, so the messages should not be lost and has to be available.
Flannel Artichoke: One second, can you define the worst difference between available and consistent or reliable? How is it different?
Dialectic Avalanche: Okay, so basically, reliability is more of making sure that the data does not get lost, making sure that the data is there at any cost. And even though it is delivered late, consistency is more like acid or atomic properties where we try to make sure that every version of the group message is same as soon as you can. Something like that. And the availability is you know, the server being available and accepting the messages and acknowledgments quickly. And taking the responsibility of the message transfer on its own. And the partition tolerance is the system you know, being able to handle the load even if one of the partitions went down and not losing any major chunk of data or process in mainline.
Flannel Artichoke: Okay.
Dialectic Avalanche: So that's why I'm choosing reliability that is, you make sure that data is there. We have to be as consistent as we can, as much as we can. But even though there is a delay of few seconds, it is understandable. And we definitely need it to be partition tolerant. That goes with reliability and preferably low latency as much as we can, and it could happen with real time. So I want to do an important non functional department. If it's okay, I can start with estimates. Okay, he said 100 million daily active users. So estimates in all four kinds of business traffic. Storage, we have bandwidth. And we have memory. Right? So let's calculate traffic. So we have one in a day that would end up at sorry im just opening calculator. Probably, I would say approximately 12 messages per second. And we cannot approximately say assume a user would send. Oh okay 12 users per second. So every user might send 12 uh 14 messages per day and every message can be approximately what I'lll take it as 500. I'm just trying to find the intermediate thing between photo and message. So on a daily basis, we have let's say, a 1 million times 40 messages times 500 bytes would be somewhat like 20 million but it will be I'd say 20 million kilobytes that will be ah, it's not like close to 20 terrabytes. Per day. Is that correct?
Flannel Artichoke: Yeah, that's fine.
Dialectic Avalanche: Giga... So that's 20 gigabytes per day.
Flannel Artichoke: So one megabyte by multiplied by 40 and 520 gigabytes per day.
Dialectic Avalanche: Okay, and let's say we are assuming it's for we are storing it for five years, let's say because you're to store older messages as much as we can. And so, this will come down to, 5tb which will be like a 36.5tb. So not taking into account replication, I'll come to that later.
Flannel Artichoke: Sure.
Dialectic Avalanche: So now we have bandwidth, bandwidth is more like. Yes, that is let's say 20 GB of data per day that will come down to...so .23mb will be to 230KB. Memory, we can always assume our 20% Hot users are always very active and we try to keep them in the cache. So our per day will translate to let's say 20% operatives 4GB per day of caching. These are rough estimates. And
Flannel Artichoke: Is the memory is about like a single host or just overall memory the memory usage per per day or per second.
Dialectic Avalanche: No, its overall memory usage per day.
Flannel Artichoke: I see.
Dialectic Avalanche: We'll be distributing it yeah. So yeah, I know the numbers look small. That because our estimation of number of users is 1 million. That's why it looks small. So
Flannel Artichoke: You got the files and images right? Including files and images as well.
Dialectic Avalanche: 4GB so I'm assuming the compression here I mentioned it. So I assumed a certain amount of because WhatsApp does that compression. If you see your original footage compress and they store a very lightweight version of it. So I assumed that compression and accordingly I had assumed it. So every message I said, I had five and accordingly I went on.
Flannel Artichoke: Okay. Okay. Sounds good to me.
Dialectic Avalanche: Let's say API design. So these are the API's we'll be calling the system with, so something like we can have not. So do you want? Okay, I had to clarify this in the requirements. Do you want the user active notification as part of this or not? User being active is online or offline? Do you want that notification? Or if not, I can skip that.
Flannel Artichoke: You can just skip that part typing for now.
Dialectic Avalanche: Okay, cool. So... and do you want the feature of deleting a message?
Flannel Artichoke: Yeah
Dialectic Avalanche: Okay, you want to delete in the message? So we'll have an app post and probably delete API design. So we'll have API del key will have his user ID, and we'll have is a new message coming along. We can call it as ID and message coming along. Or I can call it room ID as a generic thing. And then API deletion will also be something very similar. We'll have it dev key, we'll have a user ID, we'll have room ID, and we'll have is it ready for deletion
Flannel Artichoke: Why do we need an API dev key, what is the what is that for?
Dialectic Avalanche: So API dev key is to uniquely identify the application, the particular client and make sure that we don't get we don't we see that our network is not abused by by data. So yeah, after posts, we'll have the return ack with the message ID, and later we can have a read notification that the client other client has read, or how many people okay. So deletion is also we can have we return the app. Okay, and, okay, let's see database schema. So database schema is definitely important. So we can have a very simple MySQL database for user information. If we didn't have a MySQL, it can be no SQL also, because I don't think we are really doing many times. So we'll have user information and messages
Flannel Artichoke: Actually I have an issue I'm interested in to discuss about that first like MySQL versus no SQL, can you just give me like a pros and cons of each of each solution for your design?
Dialectic Avalanche: Sure so my SQL is basically consistent and it's a highly available database. And no SQL is more like partition tolerant. And either you have to choose between availability and consistency. So that is the main thing and the main design decision you take, based on how many tables you have, and how are you if you're making joints. So if you are making joints, then definitely MySQL is the way to go. But if you're not making much joints, and let's say your data is not so structured, and let's say you want to store data with variable sizes, and let's say every row has different number of columns and All kinds of different structures, then no SQL is the way togo.
Flannel Artichoke: So do you mean like there's no no SQL solutions supporting joint today?
Dialectic Avalanche: Not as efficiently as MySQL.
Flannel Artichoke: Interesting. Okay, why do we have to scale up our system? Since we're receiving huge entropy more and more, what would be the better solution between the two you think
Dialectic Avalanche: This application?
Flannel Artichoke: Yeah.
Dialectic Avalanche: Signal for storing the message itself, I definitely recommend no SQL. The user information, we can choose either one of them, I think, because user information is static in the sense. The fields are static, it's not changing. Like you'll have a user ID will have one, his creation when he logged in creation date, and last login, and so on. So these are all things which are not changing, we can keep it at MySQL level, because it's structured and simple to understand. And messages themselves are dynamic, and the sizes vary. And so we need to be able to write lots of small information and update it quickly. And such thing can be very well done in HBase. Because it is a balance between consistency and partition tolerance. So it is modelled after Google's Bigtable. And it is it runs on top of HDFS, Hadoop Distributed File System. So that's why for message itself, I think HBase is a good choice. It's very well designed for that. But user information itself is almost like a meta data. So it can be in my SQL or a small, no SQL like Cassandra or DynamoDB, it should not make much difference.
Flannel Artichoke: I see. So how many other tables are you thinking about the system rather than just the user table?
Dialectic Avalanche: So there is user info, and we need some level of message ID and metadata, I think is ID and syncing and synchronisation information, that is one another thing, and then we need a follower. Or I can call it as member table. So I'm slightly torn between whether I should maintain the group group chats independent of the individual sets. So I would say probably from the perspective of normalisation, it makes sense to have that. So like group table, and we can have individual or they can I can call it as on one, one on one test, something like that. So basically, here. Yeah. So here we maintain, like, I'll create a group ID. And we'll have members, member IDs, and so on. For probably group table itself, I would recommend something like Cassandra, because Cassandra is designed to have a key with unlimited number of columns, and it can change. And just because you're changing the number of columns in one row, it does not disrupt other rows. So everything is a container in Cassandra. So for group developers, Cassandra is good. But for one on one chats, maybe simple MySQL is fine.
Flannel Artichoke: That's interesting sir, because you're suggesting that we can adapt like hybrid of the database per table, which is a kind of some interesting idea to me. Is it what it's trying to say? Is it right? So for the group chat, you can use MySQL and group table you can use the Cassandra? And for other table you are suggesting, MySQL or no SQL? Is that right?
Dialectic Avalanche: Yeah, I don't, because we're not doing any specific joins between each other. So I don't see any damage in doing this.
Flannel Artichoke: Interesting. Okay. Have you ever do you have experience there operating a system consisting with, like multiple different database before just wondering.
Dialectic Avalanche: Yes in my team, we do use different databases for different storing different objects. Like for user profile, we have a separate key value store for all news articles. We use our own abstraction over Cassandra. So that we store it there and for all caching purposes, we use Redis. So I think it's fine. In fact, we parallely call them so it's, I think it's fine. It's fine. I think it's done very common in industry today.
Flannel Artichoke: Okay, okay, fair enough. I'll let you continue, please.
Dialectic Avalanche: Okay so, even one on one chat, we can have a here we just probably need to have could be a simple room ID and member per something like that right? Here, we have to see if just a particular one member ID has a chat with another member ID. We can call it probably memeber ID 1 and member ID 2, because there are only two of them so, so on and message ID and synchronisation. Alright, so basically, this is part of Okay, so we need to have a separate synchronisation service. And it can, I think can be dealt with separately that way. I'll come to that in probably that are in component is an accomplished synchronisation service. It's very important. So, in that I think we need this. It's important. So I think at a high level, these should be enough. We need some user info, we need group information about the group size information about the one on one sets. And we definitely need one table for messaging. So I can probably start talking about component design.
Flannel Artichoke: Sure.
Dialectic Avalanche: Components wise, probably. How about I'll share a Google drawing is it okay?
Flannel Artichoke: Sounds good
Dialectic Avalanche: Not a google drawing. I have draw.io which is linked to my Google. I think this will reveal my identity. But I think its okay.
Flannel Artichoke: Sure, it should be fine with me.
Dialectic Avalanche: But yeah, here it is. Just see if it works for you. It's just first time I'm using it.
Flannel Artichoke: Trying to connect let me see still loading it takes some time and think.
Dialectic Avalanche: If it doesn't work, I can probably create a Google drawing and give that that might be
Flannel Artichoke: Yeah, I think it's kind of stuck in I don't know. Let me just refresh the screen just in case. It's still not working for me, sir.
Dialectic Avalanche: Okay. Can you try the secondly link?
Flannel Artichoke: Sure, I think I'm in
Dialectic Avalanche: Cool yeah I can see. Okay. So okay, in component design it'll make sense if I just start drawing. Probably, we can start with something like a client. So this client can be a mobile, or even a desktop. Probably it'll only make a difference in the pagination. Other than that, not much.
Flannel Artichoke: Are you assuming that a single user have only one client or multiple devices at the same time?
Dialectic Avalanche: They're can have multiple devices, that's when it's synchronisation. That's
Flannel Artichoke: Perfect. Okay.
Dialectic Avalanche: So, so then we do one thing, so then we need some application servers, I'll call them extract service. So, there are certain details of chat service that I had to cover probably mentioned them verbally is that okay? So taxable basically, there are multiple ways we can push or pull the chat bit or long pole so on. So one way is you know, you push like every time you have something like something like, let's say, there is a client, client has to keep asking the server keep asking. Okay, do you have one? Do you have one something like that.
Flannel Artichoke: Is a polling okay?
Dialectic Avalanche: So trying to pull something. So usually that causes unnecessary burden on the network. So I think a good option is probably something like a long polling, where you ask the server, if it has something, and then keep and keep the connection open. And the server responds back, if there is something, or if that something is about to timeout, that'll just send a ping and it will just keep the connection open.
Flannel Artichoke: So both client and the server has to maintain the connection as open? Is it what are you trying to think sir?
Dialectic Avalanche: Yeah.
Flannel Artichoke: Okay, is that different from WebSocket? Just wondering.
Dialectic Avalanche: I think that's similar. I think it's slightly different. Based on the same concepts. I think that socket and long pull is slightly different knitted into exactly how they are different.
Flannel Artichoke: So which protocol are you thinking between the client and the server. What's protocol has the long polling just wondering? What protocol has the long polling feature? Just wondering.
Dialectic Avalanche: Must be HTTP?
Flannel Artichoke: Okay, HTTP?
Dialectic Avalanche: Yeah HTTP yeah.
Flannel Artichoke: Okay. Okay. Is there any downside of like maintaining the connections between client and server? What if we have tons of clients like more than thousand at the same time? Would they be burdened to server to maintain that connection as open? What do you think?
Dialectic Avalanche: Definitely the downside is the number of servers we have to maintain. So we'll have to, like let's say, the modern server, powerful server can maintain 50,000 concurrent connections, then we just need to literally scale linearly to the number of clients we have.
Flannel Artichoke: So all the connections should be in memory, I guess, right?
Dialectic Avalanche: Yes.
Flannel Artichoke: Okay. So the server shouldn't have a huge physical memory, I guess.
Dialectic Avalanche: Correct yeah.
Flannel Artichoke: Is there like a scalable solution you think? Just wondering? Or, can you suggest as a better option? If you can think of just wondering?
Dialectic Avalanche: Yes, yeah, we can definitely suggest, but it will come at the cost of consistency, and latency. So we can always use a messaging queue in between something like a Kafka or something similar, which, you know, takes it and then we can use a Pub Sub model to deliver the messages to the, to the, to the database, and that triggers a change log to return something to the client. So that is more efficient from the perspective of the number of service. But at the same time, it comes at the cost of latency and consistency. Yeah.
Flannel Artichoke: What's the latency that you expect, I mean delay? How slow is it? Just wondering, yeah.
Dialectic Avalanche: How much slower than keeping the connection open?
Flannel Artichoke: Yeah. In a rough estimation, I typically, I just want to understand the why, what's the kind of better option for us, and I want you to convince me of your design.
Dialectic Avalanche: Got it. So as far as I've understood, I think what Facebook actually follows is what I initially mentioned, which is it actually keeps the concurrent connections open. And as long as the user is active, if user goes offline and if he's not responsive for certain minutes, then it will just switch it off. So that's why they give a very engaging experience, you know, as soon as you text its immediately there and they respond, so on. So if we take that away that open connections away, then we will get something called as a Gmail experience, where you send something it reaches there, definitely, memory wise, it's more efficient, but it it will not create an engaging experience because you have few seconds of delay or something like that. So okay, that's so that is the key point. So depends on the experience we want to give to the user.
Flannel Artichoke: Okay, I think that makes sense to me for now. Let's continue on your design.
Dialectic Avalanche: Okay, so we need chat servers and you still speak to a synchronisation service and the chat service will directly talk to couple of databases like we can have. This is like the messaging database. And I'll just create for all other dbs, we'll just call it as metadata database we can have one more metadata here and obviously have to have this so we can always have a cache in front of it. So we can call it as read through cache where something is not present and they pick up from cache, or if something is present, we pick up from the if it is not present, they pick up from database and then write it to cache and then pick it up. So that service can directly talk to messages and metadata. But I would suggest having a separate two separate service like for data access, I can call it as block servers and the metadata service. This is just to scale them individually separately sometimes. It makes sense. And let's say something if you get services can handle photos more efficiently than we can just quickly and easily replacements without disrupting the other ones. So let's say chat service talk to block servers, chat service talk to metadata service they talk to synchronisation service and synchronisation service. Okay. So let's say the synchronisation service also speaks back to chat service. And triggers is something. So there are multiple clients, right? So if you see that our client sends a message in WhatsApp, mobile, and sees that the chatter was takes it and then finally save it in the database, synchronisation service reads it from here from the metadata service. And then it tries to update all other similar clients. So block servers read from cache and they can also read directly from the database. And listen to Plexus.
Flannel Artichoke: I'm sorry, I'm kind of have lost you Sorry, can you just do explain again, like roles of the block server and the metadata server in this diagram again.
Dialectic Avalanche: So blocks, this is that access object. So, basically, they they just fetch the data for you and give give to the chat service or whenever they ask for
Flannel Artichoke: I see.
Dialectic Avalanche: So, I just kept them separate just to give them opportunity to scale independently
Flannel Artichoke: Scale a bit independently as a micro service you mean?
Dialectic Avalanche: Yes yes.
Flannel Artichoke: So is this arrow like flow or synchronous or the other a or synchronous flow as well?
Dialectic Avalanche: I would say there are synchronous
Flannel Artichoke: Oh which parties are synchronous sorry, everything or?
Dialectic Avalanche: So I mentioned there will be concurrent concurrent connection between the client and the side service. Sure. And I will say everything else is almost asynchronous because you write it give it to blocks servers, blocks servers are connected to the database. So does the metadata service and synchronisation service probably has a long polling with the metadata as soon as it sees that a new message has come in, then it starts the triggers the synchronisation service to make sure that everybody in that room is synchronised with that particular information. So
Flannel Artichoke: Interesting. Why it has to be asynchronous just wondering I mean, there's definitely pros and cons of being asynchronous and synchronous, right? What's the particular advantage over being asynchronous in this case you think?
Dialectic Avalanche: We asynchronous probably less load on the service
Flannel Artichoke: less load on service hmm
Dialectic Avalanche: And not as many connections you need to open less load on the bandwidth so on. So but when it comes to client and server relationship, it makes sense to be available most of the times just to give a very real time feeling to the client that at least Okay, we have received it, just give that one tick. Tell him that, okay, we got it, we're sending it as soon as possible. But at the backend, we have a much more complex system and there is usually a lot more than this is happening there is a batch processing happening there is a real time data processing happening, there is a notification system which goes in parallel. So internally, It completely makes sense to keep it as synchronous. Okay. So, if we, let's say, hypothetically, if we have a synchronous system, right, we don't need this message database, we don't need this metadata database, you know, you just get it you send it right. So, the very fact that we have these databases is to say that we need it to be stored for a while for it not to be lost in case the connection breaks. So, there itself when as soon as you bring a database or a queue or any storage in between, we are already choosing an asynchronous format.
Flannel Artichoke: I think even for most of the synchronous service, you need a database to store the data and for debugging any purpose in my opinion, I think in usually, AC STEM is asynchronous is because it has to be scalable. With some time, like we use a queue message queue database as a as a queue some some other solution like a server, just we can keep like stored the messages in the queue. And we have a sequence of worker then pick it up asynchronously and handle it properly. So in such a way in front end server, it can receive the messaging no matter what type it is, right. That's a kind of a kind of one of the biggest advantage of the using the asynchronous service in my opinion. So I'm just trying to find a reason there why us system has to be a secret to impart impartially. Yeah, I think your answer kind of makes sense to me partially, but but time being I think you can continue to finish the component and we can just continue to discuss based on that, what do you think?
Dialectic Avalanche: Sure.
Flannel Artichoke: Sure, let's continue I want to see your design for one more
Dialectic Avalanche: Okay. And obviously, we need some load balancing. So let me mention that here. Definitely. One here, we can have one here as well. Need one between client and server. Okay, something like this. So we need one between the server and the databases need one between the client and the chat server to load as I'm writing this I'll mention, so load balancing the most rudimentary way is round robin definitely nobody will agree that that is a modern solution. So, eventually we need to have an intelligent system which probably takes into account let's say we can have something like a least number of connections or we can have something like a weighted least number of connections or we can have something like least response time so whichever server is giving least response time, give data to that or we can check the memory usage in each of the servers and whichever has the least okay put more load on it. So we can have intelligent systems which is a combination of these conditions. Okay, and that that can be a more efficient system. So
Flannel Artichoke: I think we can skip the load balancing parts because it is because that's obvious Yeah.
Dialectic Avalanche: Okay fine. So basically after chat service, so I'll have to pub sub model on next to it. Maybe I can make that change. So basically, there'll be a chat server and it will talk to a Pub Sub model. And then yeah, so basically, the Pub Sub model will probably make sure that the data is share, you know, divided to different account to different destination and client. No the pub sub model will probably, in a reliable fashion, send the messages to the destination servers. Let's say somebody's in USA, somebody's in India, and you're sending a message. So then the pub sub is supposed to keep track of which that will come in partitioning. So which user is connected to which server and send it to that server. So and so on. So probably I can talk a little bit about data partitioning.
Flannel Artichoke: Sure. Sounds good.
Dialectic Avalanche: So the component design, we had covered load balancing and caching mentioned here. So replication and partition, I think we can talk about that. Okay, talk more about the partition. So partitioning, we can partition based on user ID, like, so we will end up doing a partition of user ID and a message hash using consistent hashing. So, user ID mainly because of legal requirements. So sometimes, some countries have elected kinda have this strict requirements for user information of that particular country saved in that country. So during the, for certain countries, you might have to store the user specific information there and make sure you have encryption between while chatting and rest is as much as possible is there in that country. And for when it comes to messages, we can go on with different kinds of messaging algorithm like it can have a range based partitioning, we can have very simple user ID based partitioning, even infer messages. But once we are within a country, I think we can divide it into zones based on how interactive they are. And once that is done, I think it makes sense to have a consistent hashing kind of a partitioning, where every message is sort of, you know, in a in a hotspot, let's say equal USA, everybody's talking a lot and people are chatting with all the states, everything is activated. So it makes sense to have that the messages are divided. Every message is given a hash, and that hash is stored on different logical cores, and the inconsistent hashing, you store them in a round robin fashion in each logical core. And the logical core itself is randomly stored on different bare metal servers. So that way, the disk is distributed. And replication is also simultaneously done here. So for every hashing, you create three replicas, and then the replicas themselves are randomly distributed across multiple data centres.
Flannel Artichoke: So what will be the initial number of a partition in this case, to handle the 1 million DAU? And based on your calculations?
Dialectic Avalanche: Okay, number of odds. So let's see how much data storage we have. We had like 20 GB per day, right? So, yeah. We're gonna probably like, let's say 20GB into 360 GB per day. And let's look at the 36 terabytes. Which is like lets say 100 terabytes, 100 terabytes, we can have maybe 10 shards each with 10 terabytes.
Flannel Artichoke: Okay, let me just take back the my question, I don't want you to take the pain of the calculation. I don't care what the specific number. My main point is, or is it better to start with a single partition and as we receive more traffic we go split the partition into multiple or do you think the developers that have multiple partitions at the beginning to handle this much traffic? What would we use the starting point between the two?
Dialectic Avalanche: I would say reliability is still a major component of this application. So we should start with at least more than one, at least two to three and then We can scale as the traffic increases.
Flannel Artichoke: Okay. I see. That makes sense to me. Yeah.
Dialectic Avalanche: So, yeah, so yeah, replication and partitioning sort of goes hand in hand. So as you partition itself, we create application and you put it under different tracks. So I think that almost comes to the end of my design. I think I've covered the major points we had.
Flannel Artichoke: Okay, let's come back to your system component. I have a few question on there. So, yeah, if we receive a more more huge traffic, what which component would be the bottleneck in the future you think? And how would you resolve that?
Dialectic Avalanche: I would say synchronisation service, I think is the bottleneck. Because block service and, and metadata service, just writing to a cache and dating from them is a solved problem industry wise and it can be individually scaled. Like as you throw more boxes, it will linearly scale its performance as well. So I would say synchronisation service is definitely a bottleneck. And even this can be handled with Kafka queues and distributed messaging services, which are available. And if we can, you know, keep it, well, well powered and well creased in the sense, you have as many nodes, you know, it's running on Kubernetes, and Docker or something like that. And it can handle partition tolerance, when one part goes down, another part comes up. So if it can handle it, then I think even this can hold the tension.
Flannel Artichoke: I see. So another question is, as you know, one of the requirements is to store files that are right, so in your system, I don't see any storage there where we want to still file into is it like in metadata or different storage that you're thinking or is it just missed here.
Dialectic Avalanche: I think I ended up so when I use block service, I think we should divide this into message as well as this thing. So something so something similar for distinct as well, for files, so it's not the same data, it's not the same database. I just did not write one more database. Okay. So you can imagine the similar you know, triangular structure for file as well. It will have its own cache, it will have its own database. And that will be headspace for file and for my it will be. We can use Cassandra I think Cassandra or even HBase will work.
Flannel Artichoke: Okay, right. Right, I think, yeah, let's wrap up the interview and move on to the feedback session. What do you think? I think times almost up first? Okay, my time so overall, I think you did a pretty well, I didn't find any red flag on your demonstration during the session, which is good. But that being said, I have a few things to mention that you can improve on for your future interview, in my opinion. Overall, I think the crux of the systems design interview from my experience is pure parts, like first one. As an interviewer, I want to exam a candidate, technical background is solid. And experience is real or not, right. And also, technical knowledge check as well. So in that part, you did pretty well, I can see your technical background and knowledge is kind of solid and you know, you do know about the solution that you're suggesting, like a Kafka or Cassandra or partitioning, which was good. But I think the you can work on or improve the second part. I think the your system shouldn't be convincing and suggesting or backing up by good knowledge and rational so I think your system is okay but a few questions that I asked if I didn't feel that you are strongly backing off the your suggestion with your own rational. So in other words, your system as an idea was not as strongly convincing to me. So yeah, I mean, yeah, it just just again open ended question. There's no absolute answer. But at least as your interviewee, you have to kind of try to convince your interviewer of your system design and your idea or suggestion during the interview, right?
Dialectic Avalanche: Got it. Yes.
Flannel Artichoke: Yeah. So you can just prepare and improve that part for the future interview thats just over my impression? And
Dialectic Avalanche: Can you empathise with you or the parts where you felt I had knowledge gap?
Flannel Artichoke: First of all, the asynchronous versus synchronous part flow, I didn't feel that you are well aware of all like, the main difference between asynchronous and synchronous, maybe you are, but I think at least the you didn't kind of convince me strongly. My impression was like, my this candidate. Maybe he didn't know about the main difference. Or maybe he didn't have the enough experience of designing with asynchronous components or synchronous components to just my impression, I could be wrong. But again, this is an interview. And this is my impression as an interviewer. And, yeah, and also, on going back to the non functional requirements. Here, I think this is where you can impress your interviewer with good questions more, I think you did pretty good in terms of the giving a number like 1 million DAU images size and message. And I think there are a few more good questions that you can try, like the number of a maximum group size, for example. And number of a concurrency per second, that which system your system should the handle. So you can think of a more good questions on the non functional requirement, I think. Does that make sense to you?
Dialectic Avalanche: Yes, yes.
Flannel Artichoke: All right. And final thing, which I want to mention for you is one second. And also your system diagram didn't cover the file flow until I mentioned. So basically, you have to verify your system meet the given requirements, but I think this kind of missing part here so it's better for you to verify that this system kind of work on the given the traffic DA 1 million user in some way. This is pretty much if I if I said I didn't find any negative negative thing from from what you did. And actually, the other good things that you did I was impressed because I gave the same question to multiple candidates on this level. I think on the beginning part you did a pretty good at narrowing down the ambiguous this system question into specific topics. And the you try to set like expectation between you and I, which was very great so the system design can go anyway, I saw a lot of candidates just kind of a waste of their time on some non important topics. And always I try to give them feedback for that guys but about the time management but for you, you time management was pretty good by focusing on a specific topic and asked me like what are kind of main topics that you want to discuss as an interviewer which was great. I think you should do that. You should keep doing that for any your future interview.
Dialectic Avalanche: Got it yeah, this was a feedback I got from a previous mock which I incorporated in this mock.
Flannel Artichoke: Nice. Nice improvement. Nice. Yeah, I think this is pretty much it that I have do you have any question or concern or things to discuss with me before quit? By the way, after this, I'm gonna provide okay written feedback with more details.
Dialectic Avalanche: Yeah that would be great, yeah thank you so much. Right. One last question. You said that somehow I have to convince you that this system design can handle 1 million DAU. How do I do that? Because initially we discuss we will not cover telemetry. Telemetry is usually analytics is one way usually how we can convince. So how can I do it without telemetry?
Flannel Artichoke: Oh, maybe telemetry I interpreted in different way. I think in that case, maybe you should ask me before finishing that just to make sure that we agreed not to cover telemetry at all or maybe it's still you might want to verify your system covered this much tracking with me as a kind of some interview skill, I think.
Dialectic Avalanche: Okay, okay.
Flannel Artichoke: Yeah. Right. Then Goodbye and have a good night
Dialectic Avalanche: Yeah thank you so much this was very helpful Yeah Good night have a nice one
Flannel Artichoke: I'm happy to be helpful to you bye bye