Watch a technical mock interview with an Amazon engineer
Hot Gyro, an Amazon engineer, interviewed Wicked Gyroscope for a System Design interview
Share
Summary

Problem type 

Video upload API

Question(s) asked 

Design a backend service for uploading videos on a large scale.

Feedback

Feedback about Wicked Gyroscope (the interviewee)

Advance this person to the next round?
  Yes
How were their technical skills?
4/4
How was their problem solving ability?
3/4
What about their communication ability?
4/4
Good job tackling the design problem today. You did a good job coming up with a high level design and has all the right components for this service. Because of that, I rate this interview as a pass. We discussed the feedback in the last 20 min, and below is a brief summary: - whenever you use a dependent, make sure to find out as much as possible its SLA - consider using a coordinator to manage the fleet of workers that allows you keep track of its health and scale - focus on discussing about how to scale your system and how to make your system fault-tolerant

Feedback about Hot Gyro (the interviewer)

Would you want to work with this person?
  Yes
How excited would you be to work with them?
4/4
How good were the questions?
4/4
How helpful was your interviewer in guiding you to the solution(s)?
4/4
N/A
Transcript
Hot Gyro: Hello. Wicked Gyroscope: Oh, hi. Hot Gyro: Can you hear me? Wicked Gyroscope: Yeah, I can hear you. Hot Gyro: I can hear you. Okay. So looks like we have a system design session. Wicked Gyroscope: Yeah. Hot Gyro: So I think the way how it works is that I have a question. And we'll spend about 40 minutes for you to design. And then I'll give you some feedback. And then after that, we'll do a quick q&a. How does that sound? Wicked Gyroscope: Okay, great. Hot Gyro: So, here, here we go. So imagine you are designing a video processing system. Okay. So let me explain a little bit here. So we're talking about a backend service, right. And this service is a video conferencing system, which will handle a lot of the video uploads by the client. And the way the system processes the videos is by calling a third party, or sorry, calling an internal service that another team has to actually do the processing, right. So you're not building the actual processing unit, you're calling another service to actually process each video, but you need to design a workflow to handle a lot of videos that are coming into the service. Right? So video processing system that uses an internal service, which has a set of API's we call for each video, right? And it doesn't really matter what that process is. But you can think of it like, it's trying to extract a video, identify some objects or a scene detection, things like that. And that process will take care of any kind of metadata or anything, all you need to do is pass in the video, let that internal process do its thing. And then that's it. Right. That's what your video processing systems does. Wicked Gyroscope: Okay. And what my system should do is internal service already done this functions? Hot Gyro: Yeah, so there is an internal service available for you to use. But that internal service only processed one video, right? And you need to build an overall video processing system to handle a lot of videos that are uploading to your service. Wicked Gyroscope: Okay, so. Okay, and so can you explain what processing means, you talked about, okay, I sent a video as is to this API yes and to this internal service, and on the output I have some metadata, yeah. And it does store this metadata in another component for using in other components. Hot Gyro: Yeah, maybe I can, I can kind of explain this again, slightly differently. So assume you have like n videos coming in, right, into your video processing system, right? That's what you're going to design, right? Now for each video, right? For each video that comes in, you will call this internal service, let's just call this the kind of machine learning type of processing, right? And it's an internal service, this is the internal service that you need, you don't have to worry about what they do. It's already it's already given to you. It's already given to you in a set of API, right? You just call it and then pass in the video. Right? And you let that internal service do its processing. And that's it, you don't have to worry about anything else, right just pass it in. Let this internal service do something. And then you just make sure that every video comes in calls that internal service. Wicked Gyroscope: Okay, so now I understand more. Okay, and after that I not care about it, so I need to get on the bus how I store, how I take video in from user. Yes. And I don't care about the processing. Okay. Hot Gyro: Yeah. So I think you asked a good question here about storage. So yeah, every video that's uploaded to your server will need to be stored somewhere for at least five years. Okay, so let me write this down. That was a good question. So any video uploaded to the VPS will be will be stored for five years. Wicked Gyroscope: So I wrote down some features. Not to forget. So yeah, we need to store, user can upload videos. Yes. And you have this stored for five years. Okay. So I need to send them for internal processing. Okay, so I have questions. For example, what's the latency between when video uploaded here for my internal storages? And the video already processed? So it should be synchronous process? Or it can be asynchronous? Yes. Is this signal processing? Could start after, for example, five minutes for it to stop after one second after uploading? Hot Gyro: Yeah, that's a good question about the requirements here. So if I understand your question, you're asking how much time between the upload to the actual processing, right? Yeah, there is no requirement here. So you are free to design, whatever it is. Because... Think of the user is just uploading video to your server. They don't need to do anything. They don't need to retrieve the data if you upload it. Right. So this permission, right? To your service. Wicked Gyroscope: Okay. So, okay. I wrote here as soon as possible. Yeah. Because, okay. We can assume some latency but up to do this as fast as possible. It will be optimal. Okay, so another question about the requirements. So I think about features, that's all yeah. So anything else about it? Okay. We're not there yet. Except upload and store videos. Yeah. I think it's all features. Yes? Hot Gyro: Yeah. In terms of functionality, that's all. Wicked Gyroscope: So now, okay, let's focus on requirements. So as soon as possible. Some to both? Yeah. And what about users of the system? So many users will use the system, they need to care about this in terms of scaling. Hot Gyro: Yeah, you can think of this like, like popular video, uploading services, you know, think of think of it like TikTok, for example, right? People, you know, do a lot of these videos, and then they just upload to the server. Right. I know, there's another part of it like discovery, searching as well, that we don't have to worry about that right now. Let's just focus about storing lots of videos into this service. Right. So think of it like a very popular app, where you are designing the backend. Wicked Gyroscope: Yeah. So that's interesting. Hot Gyro: What is VHS here? Wicked Gyroscope: Video hosting service. Hot Gyro: Okay. Wicked Gyroscope: Yeah, so. Okay. Yeah, I think it can be really large videos. Send it to us by seconds, yes? Because if a lot of people started to send me video and after that video can be okay... Let's return to query per second a bit later. Yeah. Let's start... I have another question about requirements. What is the weight of one video? Yeah, so could it be two gigabytes or five gigabytes or will it be like TikTok and not more than... I don't know for you, what's the largest size TikTok can be so I think it can be more than 20 megabytes here, but maybe I can choose them. Okay, so what was the maximum? Hot Gyro: Right. Yeah, that's a very good question on the limit of the input here, let's assume 15 second video as maximum for now. And if we have time later today, we can talk about, you know, what would the architecture change if we allow, say, three minutes of videos or you know, an hour video, things like that. But for now, let's just assume each video is no more than 15 seconds. Wicked Gyroscope: Okay, um, yeah. So 50 seconds. Okay, but we need to... Hot Gyro: I mean, 15. Wicked Gyroscope: Oh, okay. 15. Okay. So it's not so much? Yes. So can I assume about the disk space of this video? Hot Gyro: Yeah, you can assume that. So it's not a big deal what that number is, but you can definitely make an assumption of what a 15 second video will look like. Again, it doesn't matter if it's, I mean, obviously has to be in the in the right ballpark. Like it's not gonna be like 20 gigabytes, right. But any number that's reasonable would be fine. Wicked Gyroscope: Yeah, so yeah, let's not calculate performance here. Not spend time. Let's be about and the 300 megabytes its maximum. Because it's only 15 seconds. And if it's, for example, TikTok. Hot Gyro: That works. Wicked Gyroscope: Yeah. Okay. So I think in another... it's going to return is two users per second that can send our video? Yes. So I think for real, this system can be called say, we can calculate users, for example, in whole world, yes. And then divided by user, assume user per second bytes. And for example, if it's Black Friday, for example, yes. And when some situations happens in the world, some news happens. Yeah. So people started to load a lot of more data, more videos. And should we care about that? Oh, we can assume some query per second, for example, for now, videos per second. So is that the user, that user send it for our system? Hot Gyro: Right. Yeah. So I think you can assume this is a very popular app, you're designing this service, assuming that you are designing for a very popular app, like TikTok, right? So think of it like, you know, global scale. And then again, you can assume what that number is. But it definitely needs to be a very scalable service, because a lot of users are using it. That would be the requirement, like you're designing it for global scale. Wicked Gyroscope: Yeah. So let's assume that we have on the start in our system, for example, 15 million users per day? Yes. It means that... Okay, let's start from this number. 15 million users per day. So okay, if we assume that in one day, this number of seconds, so it's about 100 seconds in a day. So then 15 million divided by this one? Five zeros. Okay. It means on average our users make of 500 requests per second. Yeah, but not means that their upload. So a lot of videos, yes, but it's our server. So in this 15 million users, so I think we can assume here, so the maximum if it's TikTok, it could be in 10 or 100 times more requests per second. So that assumes it's in... That could be something like that? So it should, it could be. Yes, but... Okay, so we need to care about the maximum requests per second. So let's, how many percent of requests is video uploading? Yes, it's not so a lot. Yeah. Because in for example, TikTok, or Instagram... Not so a lot is uploaded, but only... people only see the upload. And another question: Do we need to care about watching videos or only uploads? Hot Gyro: Yeah, no, you don't have to worry about any of the read kind of operation. You only need to worry about the write. So think of it as like an upload service. Right? You're storing all these things. But the additional thing here is unique. For every video that you store, you need to make sure that you're calling the internal service to do the processing. Wicked Gyroscope: Gotcha. So let's assume for now that we have 50K uploads maximum. So let's assume that 50 videos per second user wanted to upload. So later, maybe, maybe we can scale this a bit more as this number later. But start with it. Okay, so now we know about the requirements. So we want to, as soon as possible central processing is, yeah, we need to care about parallel video uploading, because users can send more than one video by second. And our videos not so large. Yeah. We need to care about only 300 megabytes. So I'm thinking about one requirements more about network traffic. Should I care about network here, because when you send for example, to come in, 600 gigabytes by internet here. And all users started to upload these videos, a lot of internet traffic today use, do I need to care about that? Hot Gyro: I think for now, we don't have to worry about it. Wicked Gyroscope: Okay, so but I ask this because for a real system, when you, for example, wanted to store some videos in cloud storage, they care about network traffic. You need to pay for network traffic and for storage, because that's why I asked. Hot Gyro: Yeah, yeah. We don't have to worry about that for now. I think that something that can we can easily take care of later on. Maybe we can talk about that later. But for now, I think we don't need to worry about it. Wicked Gyroscope: Yeah. So I think that's all requirements. So I think now I can go to the common theme, or common component of our backend system. So as we just wrote up. Hot Gyro: Yeah, sure. Yeah. That's okay. Wicked Gyroscope: Yeah. So let's assume now how the whole process goes. Let's write our users here. Can you see what I'm drawing? Hot Gyro: Yeah, I guess here, I see a line, and a person. Wicked Gyroscope: Yeah. That is our user. So and another another question here. Do I need to care about what application was the front end of my system is? Is it a web service? Is it a mobile application uploading this? Or I don't care about this? Hot Gyro: Good question. No, you don't have to worry about any of the clients. You're building a back end service. So you may want to expose a certain view API, so that any client can call your service through these API's. Okay, but you don't need to worry about the client at all. Wicked Gyroscope: Yeah, okay. And what about my API can choose what is the best way? Or I choose, for example, I need to choose for example, REST API, HTTP. Hot Gyro: Well, HTTP would be in particular implementation. So you're talking about REST API, right? It can be a REST API. It may not be. It's up to you. Maybe you can explain why you chose a certain set of API's. But this is going to be a distributed service. So I'll leave it up to you to decide how you want to expose this API to the users or to the client? Wicked Gyroscope: Yeah. So for talking about that, so I prefer to use REST API here. And also some disadvantages for example, but it really wide scale and a lot of information about threads. And it's really easy for external users. Yes. And for external applications, for all, for example, mobile applications and websites. And that's the application could easily use the REST. Our user sends a request. So here, I want to have a special API, stateless services, service here, it's the REST. Is our stateless API services, service, and they will get our videos. So I wrote a lot of things instances, because we, we can put a load balancer because it's really important part of the system that because we have a lot of parallel requests, and yet and load balancing between and then split all data between states of service. So yeah, and when I've uploaded, finally got one video. Yeah. So I want really to put them into some storage. And for real, like S3, it will be a good idea here. So I want to store all videos. Yes. And for that I choose S3. And because best idea is to store videos on API. Yes. API state here because we need to split this data between services. So when it gets a video, let's send them to s3. So s3 is classical example how we can store images and videos. Yes, it is a scaling storage. Because that is a stream. And, okay, we need to care about metadata. Of course. Yes, yes. So, for example, yeah, we need to store some metadata in another database. So, that makes sense, what I mean? S3 is where we store videos is easiest. And another database we store here, data about our videos. Hot Gyro: Yep. Wicked Gyroscope: Yeah. So for real... Yeah. We need to care this in this moment about the type of our database. So I think MongoDB will be a good choice for now. Here. Yeah. Because why? I think relational databases will be not possible here. Because, so we want only store metadata. So yes, it's no strong dependencies, strong relations between different parts of our database. Here, we know, we never need to care about something that says... Yeah, we need to start simple. Yeah, but a lot of data, a lot of amount of simple data, because I think NoSQL database like MongoDB for now will be a good choice. Hot Gyro: Okay. Wicked Gyroscope: Yeah. And yeah, here, let's talk about the caches. Because we cannot do requests for long every time. Yeah, I can trust another database for caching. It could be Redis classical example. Here, and not to forget the future. So first one, if we upload videos and upload metadata, so let's put it into our database. And okay. Yeah, now I think. Okay, so yeah, and... Okay, now, I think it's based components how we store video? Yeah, we can we can. Okay, one interesting moment about sending video from user. It's okay, I care about first, first party. So when you send video for this backend system, it's really interesting how HTTP will... Got it. So one possible example is to use HTTP headers, yes. And send video one by one, part by part. So if you cannot upload, for example, so 300 megabytes? Yes, here. It's, I think it's a lot of amount of data for one HTTP request. Yes. And if you use some HTTP headers, you can simply do by chance. Yeah, it will be good idea here. Okay, yeah. So, yeah. And what this particular API service will do is that, again, let's upload the video yeah, and combine all channels, three of them sent to s3. Here was sent metadata about this. So in metadata we can store links for s3 videos. Yes. Yeah, that metadata and put the key. Yeah. What you store in MongoDB because our system and feature want to get some data, our internal systems want to get some data. Yes. And because of that, it will be faster for us. So let's, let's now, I think that's how our uploads inbox will work. So now let's talk about how we will send this video for our internal processing system. Yeah, so let us use it. Yeah. So I'll put this component in this way. So it's a black box. So let's assume that it could. It could have multiple servers. For our part, yeah. Does that make sense? Yes. What is that here? Yeah. Yeah. And for now? Yeah. We want to send our video for this internal processing system as soon as possible. So for doing this one good ideas it comes from me, it's using workers and using queues. So what they want to do? Yeah, I want to get when video uploaded, I want to send some event for queue. It's like pipe here. Yeah. So I think it can be it could be Kafka. It could be a SQS, something like that. And we have workers now. What's the sort of this worker will do? Yeah, it's again stateless servers, as they read. They read events from queue. Yeah, and we need to understand that the video uploaded DSM is like, exactly what our event sees. Yeah. And after that, our worker, okay, they want to get metadata from data. Yes. So first we go going to check our cache if information about our video, so let's simply gut it and send for our internal system. If it's not, okay, let's go for the metadata to our database. Yeah. And send us information to the video processing system. Yeah. So yeah, this is how our videos... Yeah, upload it and or send it to internal processing. For real. If we're going to talk about part how workers, both between workers and their system, it could be a threat it could be GRPC. For example, yeah, because between internal internal components inside your company GRPC will be a good way. Yeah. Because it used to give us a lot of new features about, like HTTP2. Like that. Yes. And you can easily generate clients. Yeah. And now the internal processing system knows about metadata features here. And if, okay, it's a black box about it use video from s3. Yeah. And do not care. We already care about that yet, because our videos stored in s3, and only by the end... And the processing system can access the video by metadata. Yeah. Yeah. So I think it's how common broad to see a common pipeline of our uploading and processing system should work. Yeah. And does that make sense? Hot Gyro: Yeah, yeah, I think I follow every step here. So there's still 10 minutes left before the end of the interview here. Do you wish to talk about any of these components in more detail? For example how do you handle failures? What kind of failures do you expect to see? Things like that, do you want to talk about them? Wicked Gyroscope: Yes, I do. So yeah, I think I can talk about every component here because every component can fail here, and we need to care about it. So let's talk about... Okay, first problems that I see you started from clients. Here. Yes, it means that, for example, user status upload video and then some keylock connection, yes. And one chunk to your last one chunk. Yes. And for doing that, to do that, we need to support retries on our client. Yeah, I think it's on the client base. So if we support that it'd be better because our network when connection, return for user retry system started to send another term. And chance here good idea, because for example, we not to send all video again. Yeah. Because if we send, if we already upload about, for example, 80% of video on here, we will not so we shouldn't from start here. Yes. For using retries are better. So yeah, so let's go for API center. So I talked previously, that the status here, yes, and a good way to use load balancer here. And okay, let's assume our system is really, really large. And if we, if our lower level system dies because every system can die, in this case, I prefer to use a DNS balancing. And you can create different records in DNS. And yeah. And the overall system can can be more scalable. Yeah. So let's talk now about another component. So yeah. So I see Redis here. Redis can die of course here and because of that, for scalable system, it will be better for use Redis cluster, yeah. So, in Redis cluster, you create different masters and slaves here and if one master will die, so replica will replace them. Yeah. So because of that, yeah. And in Redis you only store keys and values. Yes. And because of that, yeah. And for now, system can die sometimes. But yeah, if we cannot find any key Redis we go for our Mongo storage. And if Redis died in some nodes, those data so we know that Redis store data in RAM do not install not on these gears sometimes. Okay, some machine can flash all RAM data, so we not lose any data because we're storing consistent storage. Yeah, about Mongo. Okay, good thing we're using MongoDB because Mongo is easily sharding. And is easily scalable system, you do not need to care how to combine different shards. You can easily split data. Because we store only the metadata and yes we can shard it for example, by months or by years. Yes. So we did another company that an adult is that we store. We store data only for five years. Yes. So let's assume that our company have no money to store more. So another good idea here to care about cleaner water. Cleaner. Special worker that will send a request. If we can call it not so often, for example, once in once in a month, for example, you have to do a call for MongoDB. And look up for the data that we just stored more than five years as if you found this rows in our database, documents in the database, let's delete from s3 as well, because we need to care about this storage. So I think we need to care about the queues and about... because it's only not so a lot of events. We can flush our data only once in a month, for example. Yeah. And for doing that multiskilled Let's do request a data for our Mongo database. So s3 usually is easy to scale into because it sounds like... it sounds like Redis in some cases, sometimes here because industries, here it is buckets and the RGCS and to store data like Redis. And so because of the scaling and sharding system. And another idea that I think here is to get wet our water. So we will send data for our internal processing system. For you in real world, this internal system can die yes, sometimes can return the bad responses for us. And I think good way here for us retry system. Yes. Does that make sense what I mean? Hot Gyro: Can you draw the retry system and other components that comes with it? Yeah. So when, when our workers got that request from queues. Yeah, we want to send you video, you metadata about the video to enter our processing system as the system sends for us. But cold, yes. Yeah, I think here, a good idea for user retries. So we have queues. So here in queues, yeah. it's sometimes hard to realize retries but it's possible. Yeah. If we used two queues for example. Yeah, and... Wicked Gyroscope: Okay, yeah. Hot Gyro: Okay, so we have about a minute left. Let me maybe ask a couple questions here. Yeah. You mentioned about the retries in the workers. What happens if a worker just crash, right? Who is going to detect that worker is no longer functional? Wicked Gyroscope: Yeah. So for real, one worker will die and so one worker reads when worker wants to get a new task for him. Yes. So it's good idea for us our queue each... okay as our queue new messages and if one message for example, one worker got one message and not sent, for example, if you use SQS, what worker needs to send... he needs to receive and he needs to delete data. Yes. And when work is not delete message, it will return to the queue again. So in this case, when workers died and workers are not sent delete for messages for a long time. Yes, it means that our message again appended to our queue and new worker will again got your message. That makes sense and that's because our workers is stateless machines. Yeah, if one worker can die and another worker will do this. Hot Gyro: Okay, yeah, I think what you explained makes sense in terms of using the queue, sort of timeout or time to live to as a way to think about. Now what happens if all your workers die, then your queue just keeps piling up, right? And you have a mechanism to recover from all these workers dying. Wicked Gyroscope: So it's really hard to assume all my workers that because in real system, yes, we split it in by different data centers. Yes. And we kept a lot of instances and all workers died, it means all the data centers, in Europe, in America. Yeah. Okay. Hot Gyro: No, I guess the question is more about how do you know when a worker dies. So you're saying that the queue will timeout, and then another worker will pick it up, but if more workers, it doesn't have to be all workers, so let's say you just have a few workers, let's say you started with 10. And then all of a sudden, half of them died. Okay, and your queue just keeps, you know, piling up. Right. So what would be the way to handle this? Wicked Gyroscope: Yeah. So you want to hear I want to talk about monitoring system. Yeah. Because in real world, a lot of monitoring system already exists, like zookeeper or more modern systems like etcd. It's scalable solution here. And you can fast know when you work your dies, for example. Yeah. And special system? Yes. If you combine with some system like Kubernetes, it means that if worker dies, the system knows about it, and we started it back up. Hot Gyro: Yeah, sounds good. I think a monitoring system or some sort of way to keep track of the health of the workers is a very good choice. So I think that's, that's a good call out there. So at that point, at this point, I think I will end the interview takes about 40 minutes or so. And that's usually the right time for system design interview. So let me give you some feedback about everything that you talked about. And then you can we can do a q&a later on as well. So overall, before I get started, what do you think about the question? And how do you think you did? Wicked Gyroscope: So always Interviewer ask that. Yes. So I think, yeah, I cannot see any of my mistakes or what I'm not asking. So I think information I realized all features that you asked, yeah. There's some requirements. Okay, my, my feelings might be I'm not so good about requirements and features here. But maybe I need to ask, maybe I need to clarify all these requirements. I think my whole systems works. And so I asked all of your questions. I see here, a lot of another scenes that we can talk about but in common? Yeah. I focused on online companies. Hot Gyro: Yeah, yeah, I agree with you in a lot of these parts. First of all, I think you have a lot of all the good pieces of this design. Right, the queue, s3, workers. It's really sort of the template of a workflow system. Right. And I think you have all the pieces. So that's a very good part. I do agree with you at the requirements part, there is a few things I want to call out that might be missing. I think some of the questions that you asked are great. But I think there are some areas that you want to clarify even more, but I'll go through that in detail with you. But overall, at a high level, I kind of agree with your assessment as well, I think that the requirements part, can can improve a little bit there. But in terms of the actual high level design, I think you have all the pieces. So let's go back to the beginning. And I'll do this and then I kind of go into detail on each section. Okay. So let's start with the features or sometimes we call the requirements, right? Yeah. And in the requirements here, actually, you do have an have it at the bottom sorry, I don't mean to over like that. I think you know, requirements here. We should really separate the two here, one being the functional and the other being the non functional, right? Yeah. All being what the system guarantees, right? System guarantees. And typically some people will call this the SLA, service level agreement. What is the system guarantees? What are the functional parts? I think you, you talked about what features we have in the functional, right? It is about uploading videos. So sort of about what you had in the features is about uploading videos, storage for five years. And then for each video, you send it to somebody in the processing. I think that that right there. That's all you need to do for this problem. I think you have all the most important features of this system. That's good. And then in the requirements for functional. But I think a better word for this is more like capacity estimation, like how much resources do you need, that you talked about the storage size, and all that, but let's come back to that estimation a bit, I want to talk about the non functional requirements, because I think that's actually missing in our discussions of what I'm looking for here is sort of the typical distributed systems keywords like you want the system to be fault tolerant, right? That means if some parts of your system fail, the system still continues to run, right? You want it to be durable. And videos needs to be... cannot be lost, right? So it has to be uploaded. But you just need to mention that, like, these are the requirements of this system that when you upload a video, you can be guaranteed, you can be sure that this video will never be lost. Another thing, you did mention that you want this to be, I guess responsive or performance. And that the process that needs to be have needs to be needs to happen right away, in line 18 here, so I'll just because the only thing, I don't think that's really a requirement, you don't really need to process each video, like right away, because it's really for back end processing. There's no need to do it fast. But you can do it later. But you chose to do it fast, that's okay. But you need to make sure that your design actually is able to deliver that. So I would say that is not that important. But it's okay to make that a requirement. It's up to you, when you have so much requirements, then your system will be more complex. Right. But that's okay, you want to do that. That's okay. And I think the other part is having highly available, right? And, and I know I'm just calling out these big words, but later on, when you're going to diagram, you are going to come back to these requirements and say, hey, why am I writing? Why do I need a load balancer? Why do I need multiple copies of the API web service is because I want to make sure that this is a highly available service, right? You want to make sure that when somebody calls the service, but you have so many calls, I don't know, I think you have a 50,000 calls per second 50,000 RPS, you want to make sure that in every call we will be handled by so you make sure you have a highly available service. And all that. So I think the durability here is good, because s3 takes care of that for you. Right? S3 has 11 nines of durability, you know, by using these kind of services off the shelf, that you have to also explain why s3, right? Why not maybe a hot commodity hard drives, right? So basically, the s3 actually gives you that durability on the cloud, then that provides the whole durability of the service. Right. So these are the requirements that you you should always talk about. And then when you design it and draw the diagrams, you explain why you need certain components to fulfill these requirements. Right. Wicked Gyroscope: Yeah. So it's, it's common for... Sorry for interrupting you. And it's common for all interviews? Yes. So I think you're right. Hot Gyro: You're right that most questions have these kind of requirements. But there are always something that's a little different. Like, for example, this one here, we don't really talk about latency so much. Like some systems require rapid response, like, here, this is more of a write service, right? You don't need to do any reads. So there's no requirements on latency. There's some services that require security as well. Not many, very few, like... that happened to that... we don't have that here. But you can actually think about it like what if somebody tries to abuse your services just keep uploading garbage? Right? How do you handle that right? Now, that's something that I don't ask you about. But you can talk about it, say, hey, what are some vulnerabilities of the service. In fact, the more senior the role that you're applying for people expect you to think about all the problems that could happen in any service like security is a big concern. So for you to think about that and call that out, it's actually a big plus. You cannot expect the interviewer to tell you all these things, because they try to keep the problem the problem vague, by intention so that they want you to ask the right questions. Now, we talked about all the requirements of the service that you need to build. So there's one thing that's missing in the requirements that I think you should also consider is that part of this problem is that you are using a dependent service, right, like this internal service is a dependency, right? And whenever you're using a dependency, you need to worry about what is the SLA of that dependency? Right. And so the kind of questions I want you to ask here was, how long does it take to process each video? Right? And if you ask that question, I would say no, for 15 second video, roughly one minute, right, something like that. So because you don't know, you know, what are some requirements or performance of that API? In terms of it, you could ask, what kind of errors? What kind of errors can occur there, right? And you can ask what the are the usage limits? Right? Because if there are if they need to handle throttling, right, how do you handle a service, an internal dependency that throttles you, but you have to do a retry, right? And that's where you need to kind of talk about all these things. I think the most important point here is asking those questions. Once you ask those questions, you can assume, right all these things, and then your service, when you build your high level diagram, you can talk about, okay, my my workers, even though it's stateless, it's such a few more things than just calling the API actually some error handling, whatnot. And if a video still needs to process three times or whatever, you get something about it right, but you put it in some dead letter queue or you handle it somewhere else. But talking about the error handling is also the important part. But that's the questions that I wish you asked you asked about the dependency service. But other than that, I think the requirements is everything that you talked about, and everything that I talked about as well, I think that that should be a good coverage of the requirements there. Now on to the capacity estimation there. Again, the you can assume a lot of things you can say maybe there's a million requests per day or a billion requests a day, it doesn't matter. What matters is how do you use that number? Right? If it's a million per day, that's usually about 12 per second, right? It's a billion, that's 12,000 per second, right? So 12,000 requests per second is your UPS, right? OrRPS, and things like that, that's a good number. When you have a high load, that means that you need, you're likely to need to have load balancers. Some people would even say that, Okay, I will not assume that each machine handle 1000 requests per second. So I need about 12 machines, right. And again, these are just numbers you make up that kind of gives you a very logical flow of only 12 machines as load balancers or as web services to handle these kind of requests coming in. Right. And that's what explains the load balancing thing or the web services. There, right? Yeah, just to get this hot note. The other thing that I think we need to also talk about is the storage, right? You did talk about being 300 megs per video, I think I didn't see the number then you calculate for five years of storage on one 1 million videos per day. So the other 300 is a little too big. And also, in general, when you do these math, always use very simple numbers like 10 or 1 like that makes things a lot easier. So let's say let's say we do 10 megabytes per video, right? We have 1 million videos. So that's going to be one, you know, 10 billion. 10 billion videos. Right? All right. But you know, the math of that. And I think it's actually 10 gigabytes if I remember. Yeah, I think that's about 10 gigabytes. The right then? No, no, we weren't sorry. But my math is not very good right now. So let's say you have some X number of gigabytes per day, right? And the thing is, how much is that for five years, right? So one day is is about five years is about 2000 days. So the reason why a lot people use 5 years is because it's a relatively quick number. And they use frequently 2000. Again, quick math, but I think that's what the interviewer expect to see as well. Wicked Gyroscope: For you, sometimes to calculate gigabytes. Yeah. When you have millions if you have here 10 million megabytes or gigabytes, and on a previous interview, I spent a lot of time for this one and finally got it. Hot Gyro: But yeah, even even I'm having a little bit of problems at this time of the day. So I understand it takes a little practice. But if you make the numbers easy, then it should be doable. So, yeah, so let's, let's keep going for a bit. So at this point, usually, it's about 10 minutes of time to get to this point, right? Because you wanted to have the remaining 30 minutes to talk about your high level design, and even the low level design, and most importantly, the how you handle failures, right? So at this point, after 10 minutes, you should be talking about your high level design, level design, like a schema of your NoSQL database. MongoDB. What is one of the table? Briefly, what do you store that? You mentioned metadata, which is good. But yeah, we'll go into detail in a bit there. But I think most importantly, is how to handle failures. So high level design, I actually agree with you on almost everything that you mentioned, I think the workflow there has all the components I mentioned earlier, it fits the bill. Except one thing that I wanted a little bit more detail on would be the worker, I think everything is good. Everything I agree with, except for the workers were I think I asked a question, what happens if the workers die? How do you keep track of that? Normally, I would say one solution to that is nobody there is like a manager or coordinator. And the coordinator, what it does is it tracks all the workers. And it does a health check by the campaign the health thing every 30 seconds every minute, and then uses that information to determine whether the worker is alive working or not, right? If overstayed consecutive pings, the workers not there, then you know that you're one worker short, right? Then you can figure out okay, maybe I want to scale up or whatever. And the coordinator is also there to help you decide, hey, if my queue is getting large, that means more requests are coming in. Now I need to scale up my service, right. So that's how you make your service scalable, right. And that's a non functional requirement. So this is something that you want to talk about, right? You may not need to join things, but at least talking about it, let the interviewer know that, hey, that's one way of scaling a system right to have a way to a mechanism to add more workers, lower workers, things like that. And that's really the only thing I would add to your design. Everything else I really like especially the, you know, the S3 storage for the actual videos, and then the metadata database there. The next thing you really want to talk about is the schema of your metadata, like when you store there, you can also say, okay, you know that because you're storing just metadata per video, the space here is negligible, but you still can come up with like a small estimate, maybe 10 megabytes per roll, or per, per video. So every day, you know, your whatever that set will come up with the first one number of rooms approachable, but you just have to have that. But again, I think the table schema you would be would be helpful to talk about. Maybe you have one table, right? You may also have a user table again. Yes. So that's that's the level. But I think one thing I forgot here is API design. Now, I think we spent some time talking about why REST API is good. That's actually great. That's a good thing about you chose REST API, because it's good for mobile, it's actually using the web or popular or that that's good. But what you also want to spell out is, what is the actual API? Like, gonna be like upload video, for example, right? And what do you have as a user ID, the actual file path, maybe some metadata, right? Things like that. Again, keep it as simple as possible. And the way to keep it simple. So that if the interviewer comes to you as Okay, I need a little bit more, maybe you can do this as well, they will ask you, they will expect you to ask for it. Otherwise, keep it as simple as possible. But one thing you need to also figure out here is to figure out whether this is a synchronous call or async call, right? You might say okay, I'm gonna do both right video, above video, and we'll do a six by four by two functions, and explain what these two are for or maybe even a batch, right, I can do a batch, a batch, whatever. Now it's really up to you how much you want to design because if you do batch now you have to talk about how you design that will start from something simple. I would just like ignore this for now. But just keep that in mind. If you have time or you want to challenge yourself, you can always do that. Now you need to really decide what you want to do here, right? If it's async, then maybe when you call this function, it returns you some kind of video ID or video ID. And then you need another function to say, Okay, I need to get status later on, so that I can check what is the status of this processing? Right? Maybe this is a synchronous call, I just call it, wait for it to finish, and then I exit. So that's a decision you have to make in your service. Any questions at this point? Wicked Gyroscope: Yes. So about common API. So you talked about a based on features, I assume that yeah, we care only about uploads. And we don't care about latency percent, to our uploaders, because they assumed that this was asynchronous system, they not care, I assume that's enough care about single request, because you need to get the upload and on this end, for another person. Correct? Hot Gyro: Can you repeat the question again? Wicked Gyroscope: Yeah. So I've seen your API interface. Yeah, it's supposed to upload a video and get up as far as I not thought about that because in features as we previously talked, we need to care only about upload and only about sending for future internal processing. And we said, this is asynchronous process here, because I am not cared about synchronous API. Hot Gyro: Yeah, you're right. You're right, that so I think what you're saying here is we don't need to worry about the async? Right, you upload, and then forget about it. So fire and forget is what you're saying. Right? So the approach is fire and forget right? And doesn't matter, right? Well, we need to make sure that the video is uploaded, right? You need, so how do you guarantee your video is uploaded? Right? What if, what if internally, something's wrong? How do you repeat? So I think that that's the point that kind of you can argue about that. But you're right that you did ask all these questions in the beginning. And so if I were to say that as this requirement here says, I just need to call upload. I don't care whether something is uploaded or not. I think that would be that wouldn't be a good experience. Right? If you say, okay, I'm API developer, right. I don't guarantee anything. I think that's not really the requirement, the requirements that mean that you when you upload, it has to be done properly. So having the functionality to check whether it's successful, I think that's also part of the requirements. But I agree with you that I think there is some gray area here because of what we talked about early on. So I'll leave it as, as this that it's better to have it than not, but you can also argue that is it. Okay. I know we're over. I need about five more minutes. Is that okay? Wicked Gyroscope: Yes. Okay. Yeah. So for real easy, probably, for me, is focusing on requirements and features that because for me to have to travel, because for a lot of questions that you can ask a lot of you can only spend one hour plus question. Yeah, but we need to focus on main components. Hot Gyro: Yeah, you're right. I think if I were to kind of evaluate the time management that you had, in your interview, I think you did the right thing. When looking at my notes, I think that you start drawing the diagram at the 20 minute mark, which is good, because that's like the halfway point. And then you start drawing diagrams, and then find a 30 minute mark, you finish the diagram explained everything. And then you start talking about error handling. So you have 10 minutes to talk about the error handling, I think that that's actually a very good time distribution. But keep in mind that I think the more senior role you're looking for some more discussion you could have in the error handling part, because that's also the part where one of your systems scale into a next level, like, you know, 10x, 100x. How do you change that system when you want to have that kind of discussion with the interviewer right? How do you scale? How do you handle different things? Or how do you handle additional features? Like we talked about a right now it's just 15 second videos, what happens if you know you're able to upload movies right? Now you can talk about the movies, through the uploads, and then how do you handle? How do you redesign or re architect your system to handle different things like or even the security probably what is the interviewer comes to you today? Somebody noticed DDoS in your service on API, what do you do like, when do you go the rate limited or other things so that. You want to give yourself I suggest 20 minutes of time to talk about error handling. So the rest of the 20 minutes will be everything that we talked about so far right? Requirements, questions API design, high level. So having said that, I think again, high level design is perfect, low level, I just want a little more detail. And then on the error handling, I do have a few things I want to call out and feel free to discuss here. One thing is the the cache you did talk about the specific technology that you want to use for the cache. I think that's great. And I think that should be part of the low level design, when you say, Okay, I need a persistent storage, then you can talk about why should it be SQL? NoSQL? You, I think you you went right away to a MongoDB. So and I forgot you actually mentioned the pros and cons of like, why Mongo and not a relational database? I think if you didn't talk about that, it is important to explain every design choice you make, right? Okay. Now the part that I have questions with is the cache there. For write service, like a write heavy service, I think a cache is not very helpful. Because the caches could have a limit that amount of space. And very likely, you will have to you will be invalidating your cache when you your workers are reading through it. So I don't know that the good value here of having a cache because you're it looks to me that your whatever request something you'd write to the cache, and then you'd write to your database, based on these tables. And so what the one thing you might like somebody will look at this and say, what happens if one of these connections fail? Right? So now you have a data inconsistency problem, right? So that makes it a very, something that you can talk about, right? As well, you're right to the queue. So you're doing three things. And you're hoping that these are all atomic operations, or collectively as a transaction, but but they're not right? One thing could have failed. So whenever you talk about something like this, whenever you draw a diagram like this, you need to talk about it like proactively like, because if I had more time, I would ask you, what happens if, you know, the message to the queue fail, or what happens to you know, the cache fails, but not the, sorry, the MongoDB. So these things are talking points, right? Or you can say, hey, maybe there's a simple way to make it as much as a transaction as possible like that. I think there's multiple answers. I think these are kind of common distributed system problems that are already solved out there. You just need to call them out and say, Okay, this is one way to handle something like this. Right. Yeah. And other than that a little bit about I mean, we talked about the internal service, like what the API looks like, what is the SLA on the API? How do you handle failure funnels? Or other things? Then there would be some discussion there. But yeah, so these are the things that we're calling out on on the error handling part. And other than that, I think the rest was just the kind of natural extension of, you know, how do you scale it to some even more? How do you handle security? So these are the other two topics that as the discussion could come up to? With that, I think that's all that I have. If you have any questions, please let me know. Wicked Gyroscope: Yeah. So I have one last question. What information do you think I pass is a secret or not? And what level you company I should assume? Hot Gyro: Yeah, I think that, like in talking to you, and kind of going over the question, I think you're very experienced. And I think that just by seeing you drawn a diagram, I think you know exactly what a workflow system looks like to do this. I think that the internal system, internal service part here, might be might throw you off a little bit in that maybe maybe there's just a discrepancy in what the interviewer is looking for that what I'm looking forward to what you were expecting, maybe this is all like, a perfectly scalable service. But you don't know you cannot assume a dependency to be always the scalable and all that you have to ask all these questions. But that is a little bit intentionally vague. And other than that, I think the the other part is the the error handling part, I would like to see a little more discussion there. And I think, like I said, the more senior role you're looking for, the more the interviewer is expecting you to talk about these parts. So that will be the part that is missing, but it's not that you don't know about it. I think it's more about the time management and how much time you want to allocate to that part of the discussion. I think once you have that mindset with like putting a little bit more emphasis on that aspect and like the error handling and scalability part, then I think you will have no problem facing it completely. Right. I just think the issue here is maybe a little bit about time management and expectations of using the dependency service. Other than that, everything else, I have no concerns about that capacity estimation is sort of like, I think there's no value to it in the sense that, who cares about these numbers, but it is a way to kind of do some quick math and explaining the one of the resources that you need to do a service like that, but it's sort of like a common practice that everybody expects you to see that, like the quick math like, like numbers that everybody every engineer needs to know kind of thing. It's just a way to demonstrate that here. So with that, I think, if I were to rate you as an interview on a system design, I would say this is a pass. And the reason for that is because I think you have all the pieces and I cannot fault you completely on maybe not asking certain questions. Maybe just depending on the question, but not every question is a very clear cut questions, right. You know, design, WhatsApp, Instagram, same kind of thing. Some things are a little different. So I wouldn't call that as a big issue. So yes. Wicked Gyroscope: Yeah, thank you for like it was really helpful for me. Hot Gyro: Okay, great. Then, if there's nothing else, I think we can end the session for today. Wicked Gyroscope: Yeah. Okay. Have a good day. Okay. Hot Gyro: Yep, yep. Bye.
Want to get some practice yourself?
Become awesome at interviewing, and get actionable feedback from engineers at top companies – it’s 100% anonymous!