We helped write the sequel to "Cracking the Coding Interview". Read 9 chapters for free

Job Scheduler (System Design)

Watch someone solve the job scheduler problem in an interview with a FAANG engineer and see the feedback their interviewer left them. Explore this problem and others in our library of interview replays.

Interview Summary

Problem type

Job Scheduler

Interview question

Design a Job Scheduler that runs at scheduled intervals

Interview Feedback

Feedback about Mutant Anteater (the interviewee)

Advance this person to the next round?

How were their technical skills?

2/4

How was their problem solving ability?

2/4

What about their communication ability?

3/4

Strengths: + Nicely gathered functional and non-functional requirements. + Nice suggestion on AWS lambda. + Decent high level designs and good discussion on various tradeoffs. Nice suggestion on using Kafka queues. + Good communication skill. Weakness: - When probed into details, TC didn't perform too well. Specifically, schedulers interaction with the database and how fault tolerance and cancelation will work. - Missed providing high level system design. - Missed status updates in the database and hence database (persistent layer) didn't know what was happening with rest of the system. - Haphazard suggestion on using consistent hashing to assign jobs to scheduler. Overall: + Visualize overall user journey in your mind and then use those functional requirements to derive the interview. + Use time wisely to get to most interesting part of the question ASAP and not bore the interviewer with mundane stuff. + When answering question, discuss with the interviewer your design choices and keep seeking where they want to go into details. + Change the functional requirements if system is becoming too complicated due to unreasonable requirements. Challenge status quo!

Feedback about Doctor Squab (the interviewer)

Would you want to work with this person?

Yes

How excited would you be to work with them?

4/4

How good were the questions?

4/4

How helpful was your interviewer in guiding you to the solution(s)?

4/4

Interview Transcript

Doctor Squab: So I've been working at FAAng for about six plus years. My current level is l six. And I have interviewed about 100 plus candidates, a lot of them for system design, a lot of them for coding. I have set in hiring committee meetings and stuff like that. So hopefully this interview will be helpful in that regard. My core expertise are in machine learning, mostly in images and videos. But in past, I have kind of done all sorts of stuff. Yeah, that's about me. I wanted to understand what kind of roles are you kind of applying and so that I can tailor the interview according to that.

Mutant Anteater: Sure. So I'm applying for e commercial role, especially the risk management system, something like that. So, okay. Yeah.

Doctor Squab: And what is your expertise? Are they in back end, front end? What kind of back end? Okay, cool, cool, cool. And the rules you are applying are also related to back end, I assume, right? All the back end roles? Yeah. Cool. Yeah, cool. Okay, that's good. And may I know what kind of level are you targeting?

Mutant Anteater: I'm targeting maybe. Actually, I'm targeting l four. So that's okay.

Doctor Squab: Yeah, that's good. No, it's good to know so that I know I can kind of calibrate my feedback for that role. Cool. Fantastic.

Mutant Anteater: You can assume, oh, so if I perform really bad, then you can assume that I'm applying for l four, but if not, you can use high level, maybe l five tricks to give me feedback. I don't know.

Doctor Squab: I can discuss this later also. But typically you can when you are reviewing a candidate. So let's say if recruiter has mapped you to l four, right? So then most of the time, candidate, the feedback will be. So the interviewers will be looking for Al four signals from your interview. If you do extremely well, if you do extremely, extremely well, more than l four. Right. Then they might recommend for l five, but that rarely happens. That is like one in, I don't know, like one in 200 candidates. Right. So that's very rare that they might give you a feedback for l five. But let's say if you had applied for l five, they might give you feedback for l four that it can happen. But l three does not usually do design questions. So I would say that if it's l four, let's focus on l four s. And of course I'll tell you that what would have made it alpha, right. So I'm happy to tell that later. Okay, cool. So these interviews are mostly, although they are scheduled for 1 hour, typically they are done 45 to 50 minutes. That's what kind of real game is. And then the ten minutes is introduction and question answering and stuff like that, follow ups. So let's do it that way. Let's try to do it in 45 to 50 minutes depending on how we are doing. Okay, that's good. Okay, so we are starting at 1008. So I don't know what time is in your time zone, but you can see a clock on the bottom, it says 630 or something, you know. So let, let's try to do that 45 minutes from there. Sound good? Okay. Okay, let me. And you can use the whiteboard feature. Let me. I can come in, I'll just delete this whole thing.

Mutant Anteater: I'm already in the whiteboard.

Doctor Squab: Okay, you're already on the whiteboard. Fantastic. Okay. I don't know if it's easier to paste question. Okay, sounds good. I can paste question here. So here's the question. Oh, no, wait, 1 second. Sorry, sorry, sorry, sorry, sorry, sorry, sorry. Okay, so 1 second, 1 second. I gave you a wrong question.

Mutant Anteater: Yeah, it looks like some security.

Doctor Squab: No, I do conduct interviews too, so I just, I think the wrong thing got pasted. Sorry about that. Okay.

Mutant Anteater: Job scheduler that runs job at a scheduled interval.

Doctor Squab: Yes. And I can tell a little bit more about, see the job schedulers are used everywhere in the industry. One example is that if you are, if you are in, if you want to schedule a job, to do something in company, for example, you know, roll out this pr or roll out this code at certain time. Right. Or run this query SQL query every day. And that SQL query will do a to b. So the jobs can be anything. Or if you are like say if you're building a payment system, then you might schedule to pay bills on every 30th of, you know, month or you can schedule it every week or whatever. So something like that. So job schedulers are everywhere but here we will focus on designing a general purpose job scheduler. So what will be the job itself? You don't have to worry about that, but we are mostly focusing on the scheduler part of things.

Mutant Anteater: Okay, so will this be a service for just back, back in a background. Background thing. I don't know. It's the whole system. It's a whole host.

Doctor Squab: Yeah.

Mutant Anteater: So, okay, so how. How. How our, how are we. How are our clients going to interact with this job? Scheduler?

Doctor Squab: Yeah, what are your suggestions?

Mutant Anteater: Well, my suggestion is maybe we can do something like AWS lambda so our clients can upload their job in python or in a docker, I don't know, and make it run in a scheduled interval.

Doctor Squab: Good. Yep. That works for me. Yeah.

Mutant Anteater: Image or a lambda package?

Doctor Squab: Okay.

Mutant Anteater: Okay, so, okay, let me think. Will there be any size limitation for this job?

Doctor Squab: Yeah, you can expect. Excuse me, you can. The size limitations. You can say that the job is five megabits. That's five mb is the size which it needs. While that's the average size, I'm telling it needs to run the job.

Mutant Anteater: Okay, so five mb memory or total.

Doctor Squab: You know, five mb memory in memory disk. You don't have to worry. Just assume that you have something like efs or whatever, like infinite amount of disk is available.

Mutant Anteater: Okay. Okay. Okay.

Doctor Squab: Yes.

Mutant Anteater: Okay. Of course. Let me think. Is there anything else I should notice for this system?

Doctor Squab: Tell me what kind of. I would expect you to. I'm happy to answer. And maybe you tell me what you should know to design this.

Mutant Anteater: Sure. So there should be a website where something that client can do interaction. And will this interval change?

Doctor Squab: Yeah, interval is. Of course, I mean, you can say that interval is a timestamp in the future at which you want to run, basically. So.

Mutant Anteater: Okay, so this sounds good.

Doctor Squab: And it is changeable. Yes. So you can change your mind. So think of like, say, in case of payments, you wanted to send money on 30th and you changed your mind. Now you want to send on 15th. So that should be allowed.

Mutant Anteater: Okay, cool. Okay. And we don't care about the job detail. So, yeah, as you said, like, you.

Doctor Squab: Can wrap it into some lambda or you can run and just think of it whatever, like a batch script or whatever.

Mutant Anteater: Okay, so, so we have functional requirements now. So first we can upload, change existing information maybe. Yeah. And, oh, do we need any message from this job? Like, do we need to tell our client that, oh, your job is running and it's done.

Doctor Squab: Yeah. Yep. Yeah. And it will be nice because otherwise, how will the clients know? How is your job doing?

Mutant Anteater: Okay, so we need a message. Yeah, pushing the service.

Doctor Squab: Yeah, something like, you can say status. I would use, like, basically, users should be able to track the status of their jobs.

Mutant Anteater: Okay. Okay, cool. So we have this three functional requirement and non functional. So first, it should be reliable like you said, it should be run maybe tomorrow, but it didn't rest. That's not acceptable. So reliability is our first. And oh, one more thing. If a job is running and a previous job is running and it didn't finish well, will we start a new one or should this be a setting that, should we allow clients to decide whether we should run a new one if the previous one is still running? Yeah.

Doctor Squab: So I think what you're referring to is the deduplication kind of case that whether you are running. Okay, so yeah, de duplication. You can for, I would say we can come to it as an advanced feature, but, but that's, I wouldn't bother about it as a, in the beginning at least.

Mutant Anteater: Okay.

Doctor Squab: Yeah.

Mutant Anteater: So let me show you like this. Okay, cool. So the jobs, how long will it take? Maybe several seconds.

Doctor Squab: Yeah, yeah. So you can assume that a job takes like a minute to run on average. And let's say that job, they can all run on cpu, they don't need any.

Mutant Anteater: Okay, I see. So we can assume we have a ECF or EC two running the job. So that's not our concern. Okay, so how many users do we have?

Doctor Squab: Yeah, so let's assume that up to 100 million users can use this service.

Mutant Anteater: Hundred million. And how many jobs are they going to use?

Doctor Squab: Maybe, let's say that they do, they do one job every day.

Mutant Anteater: One job every day?

Doctor Squab: Yeah. On average. I mean, of course it's average. Some will do more, some will do less. Yeah.

Mutant Anteater: Okay. This is fair enough. I think so. Okay, cool. So next is maybe scalability. As we have this much users, and it might take, we assume one mb for every job and it multiplied by 100 million. It's like, let me think how much it is. But so like this. And it should be like, holy crap, it's like this number. So we need 500 terabytes just storing the job. Okay, that's not a problem, by the way.

Doctor Squab: Let me, let me, let me. It's the five megabyte. When I said that is the memory, when job is running, that is the amount of memory it needs to run. The storing the job and all that you can just think about that goes in the desk depending on how you're storing it, maybe. Yeah. That part I'm not, you know, just storing the image as you said, like it was lambda or something like that. That part I'm, I'm not including. This was, if a job is running, then that is the amount of memory it needs, or ram it needs to run.

Mutant Anteater: Yeah, I see. So how large will the job be? We don't care about it.

Doctor Squab: We don't care about it. But if you, if you care about it, let's say that each job information is 1. Yeah, that's the like AWS lambda stuff things.

Mutant Anteater: Okay, so 100 terabytes in total.

Doctor Squab: Yeah.

Mutant Anteater: Cool. So scalability and of course fog tolerance. And we need monitor it. Okay, so now we have three functional requirements and four nonfunctional requirements. I think that's good enough. Any questions so far? Anything you think I should improve or.

Doctor Squab: Yeah, I mean, so sure. Let's go further. Yeah, I want to see how you're proceeding.

Mutant Anteater: Okay. Okay, cool. So first we have a client here. Oh, wait a minute. Let me, let me define some entity first. So we have job entity, maybe job name id, maybe job, don't need a name. So job id, but it can have a name, job description, something like that. And job code or image. So if a code, then we will store this code here. Maybe if an image, we will store URL for the image, maybe somewhere else. Okay, so that's job, entity and userentity. User id their job. Oh, so now we have only one job for every user every day. Maybe in the future we will have more. So let me keep this, keep this. So these two entities and maybe APIs. So first we need to upload a post or something. I don't know. Next we need to add it. We cannot only change, draw information. We need to create, read, upload, update and today.

Doctor Squab: Yeah. So let me, I know you are trying to get into a little bit more into like how APIs and jobs will look like. Maybe, maybe give me a high level of picture of how you're designing. Sure. I don't have an idea like how you are thinking of doing this.

Mutant Anteater: Sure, sure. So first we have a client here. This one is a client. And maybe a load balancer here.

Doctor Squab: Okay.

Mutant Anteater: And load balancer.

Doctor Squab: Oh.

Mutant Anteater: And then client will upload their job to load balancer. And the load balancer will firstly, I think upload to maybe every bucket somewhere to store the user id. No, I mean to steer the image or anything. Okay. And then.

Doctor Squab: Let me try to understand. So you have a client and the client gets the job information. And what else information will it give to the load balancer?

Mutant Anteater: So maybe job details.

Doctor Squab: Yeah, let's say the whole job details is like a zip file or whatever. Like that's not important. I'm just, what more information will it give? And what will be the on s three? How will it look like you mean.

Mutant Anteater: For APIs or for how clients are going to upload the job?

Doctor Squab: Yeah, on s three. Is it just so we will just dump it in a directory? Like what, what are you going to do?

Mutant Anteater: Sorry, I played around. There should be a back end.

Doctor Squab: I'm saying that there are 100 million users, right? 100 million users are uploading their jobs. How are you going, how are you going to dump it in s three? How will the schema look like? Will you just create one bucket and then just dump everything there? Like I'm trying to understand that part. Okay.

Mutant Anteater: Okay, I see. So it shouldn't be just one s three bucket. I'm thinking maybe. So first we should upload to the backend service and the service will decide how to upload them to bucket. And looks like it's not acceptable that we create a packet for each user so maybe we can group them.

Doctor Squab: Okay, maybe what I'm not getting is that, okay, tell me what is the request looks like for your client? With the client, what kind of request looks like and what this kind of response the client will get. So I'm trying to understand these things. I know you were trying to get into these details. Like what I'm trying to understand is that what will be the request and response format of these things?

Mutant Anteater: Okay, so like this, you mean maybe some.

Doctor Squab: Yeah, sure. Yes.

Mutant Anteater: Or trigger and maybe they will upload some, upload some files.

Doctor Squab: Okay, yeah, sure. I would say that. Forget about the job related file. Right. Okay, let's, let's what? Let's work with, with job name and, and a timestamp. Right. And you don't have to include files anywhere. I want to remove that completely from you. Let's say the job is, let's say the job is to print your name. That's it. Let's say you don't think about it anymore. Okay. Job is to literally print user id. Everybody will do trials trying to do that. Now tell me that. How are you going to store the jobs and how are you going to schedule them and stuff like that?

Mutant Anteater: Sure. Okay, so here we have a database here to store all the metadata of the jobs. And you can directly for now and we will scale this up later.

Doctor Squab: Okay.

Mutant Anteater: Currently it's just from back end directly to database. Forget about this vector packet about this database. So we are, let me think. So clients are, there are, read and write are almost the same. I think both are not really, really much like users are. I assume users are not going to change their jobs every day. So currently I'm just using one database here. If there's problems in the future. I will change it and, okay. Backend derivative database and there should be another, another, maybe a scheduler here. And let me think, how are we going to schedule this thing? Scheduler need to know what's the next job we are going to run. So back end should, oh, one more thing. How are we going to run this, run the job. Okay, so schedule, I need to run the job and think how it going to run the job. Let's see, we maintain, maintain a queue somewhere.

Doctor Squab: Okay.

Mutant Anteater: I don't know how to, how to maintain this queue for now, but we will discuss it later and this queue will tell the scheduler what was the next thing that need to deploy. So this queue will go through this scheduler and it will do something, maybe create ecs, something else. Okay. Let me draw something here.

Doctor Squab: Okay.

Mutant Anteater: Yeah. Okay. And we, and we don't care about how we don't play.

Doctor Squab: Yeah. So I'm happy with that, that you, you pick it up, you give it to somebody who will handle the running part.

Mutant Anteater: Cool. And also after it's scheduled we need some service to push the message to our client.

Doctor Squab: Okay.

Mutant Anteater: So this could be a third party thing. Maybe push.

Doctor Squab: Data.

Mutant Anteater: Yeah. It will directly go to our client or, I don't know. We don't care about this part. So now our question is that how are we going to maintain this queue? Exactly. Yeah. So what I'm thinking is that, so we can have a cache here. Let me think about it. No, we don't have cache. Maybe we don't, we can directly query database and the database will tell the scheduler what is going to be the next. And this could be done by scheduler and also it could be done by another service maybe. And that service is going to put the things to queue. Okay. To the message queue. I think so we can have another service here and let's discuss which solution is better. Okay. So I don't know how to call this circle, but we have a service here and its job is to read the database and prepare for the queue. And queue goes to the scheduler. So if the scheduler directly reads from database to decide which job we are going to do next, then we don't need this queue. And why we, why I think we need this queue is that Q, can this, this message queue may be a Kafka. I don't know. This message queue can provide some consistency. Like if this is something wrong with the scheduler, maybe it broke. We will have to. And we get a new scheduler here and it will need to query the database again. I think that's not efficiency enough. But if we have a message queue and it's just handling the messages, then after this schedule is down, then we can directly move the offset of the message queue to make this system work again. So I think we have this message queue can provide some consistency. So that's the reason I keep it here and why I'm, why I'm having.

Doctor Squab: This service here is let's call that box publisher. I just want to give it a name so we can use something. Let's call it publisher which publishes the t u b. Yeah, publisher. Yeah. Okay.

Mutant Anteater: Yeah, cool. Okay, so this publisher will. So first I think there are two jobs for this scheduler. One is that so if there's no dispublisher then the scheduler has to do its own job and the published job and so split it will make our easier to scale. First reason and second reason is that. So I think one service do one job is better for maintenance. So scheduler just do the scheduler job and publisher do the publish job. So I think that's fair enough. Okay, so let's see our functional upload already done and crud already done by this message. Data service done by here and random job done by here. So okay, we meet all the functional requirements.

Doctor Squab: I think crud is not that easy because let's say especially around. Can you think of cases where it's not as easy as you said it's done?

Mutant Anteater: You mean this one, how this is done?

Doctor Squab: Yeah, can you think of some, there's more to it, you know, I want you to think about.

Mutant Anteater: Sure. So for create it's just this one definition. And for read, I think read can also be handled by this backend like query by job id or. Okay. And update. Oh I see, I see what you are talking about. But if it needs some update then there's a problem were delayed. Yeah, you're correct. Okay, so if we want to update the job then the first step is that it still needs to be updated to the database. And we can have a call it change data capture. Change data capture CDC here and the publisher will know how the database is changed. And this can directly from database where the backend tells the publisher to do something. I mean it could be like this. Yeah, I think let's step back.

Doctor Squab: Okay, tell me, tell me, tell me how it's working. Especially the updates.

Mutant Anteater: Okay, sure. So let's see. So for example, we changed the time step from maybe morning to the afternoon. Then we need to do two things for the publisher. The publisher can send two messages to the message queue. The first one is that disable.

Doctor Squab: What is the time right now? What is the time right now? You said morning to.

Mutant Anteater: After from morning to afternoon.

Doctor Squab: And what is the current time?

Mutant Anteater: Excuse me?

Doctor Squab: What is the current time like so maybe tomorrow. So let's use number let's say that I had a scheduled job at t equals to 50, right. And, and I updated it to t equals to 75.

Mutant Anteater: Uh huh.

Doctor Squab: Now now tell me now tell me how, how are, how is your behavior changing based on, based on what is the current time?

Mutant Anteater: Okay I see.

Doctor Squab: Did you see the question?

Mutant Anteater: No. Type the time here.

Doctor Squab: So right now could be t equals to zero. Right now could be t equals to 45. Right now t could be 55. Right now t could be 85. Right. It could be anything because 85. Yeah, so I'm, what I'm saying is that you, you can update the job anytime you want. I'm trying to understand how will the update work and think it as a, as a, you know, let's say I'll take an example of payment service. Right. In case of payment service you. I wanted to send money tomorrow and tomorrow at 08:00 a.m. and at 07:00 a.m. i decided that I want to send money at 730. I was planning to send at eight but now I want to send it 730 or okay, or, or let's say it is I wanted to send at 08:00 a.m. and at 759 I decided that I want to move it to 830 or it is 08:00 a.m. and, and I had scheduled at 08:00 a.m. and now I'm trying to cancel it.

Mutant Anteater: Okay. So we have three situations need to.

Doctor Squab: Yeah, I want you to, this is just a hypothetical example but that is these are the cases I want to consider that. How are you going to handle multiple of different updates at different timestamps and how will the design get changed because of those?

Mutant Anteater: Okay I see. So in general case that is not in a rush I think we can directly do this thing. Maybe. Is this a good choice? I'm not sure but the back end will first handle your update request and it will change the database of course. And it cannot use this message queue because the update will be the at last of this message queue. So we will skip this message queue here and directly backend to this scheduler to tell them oh, we are going to cancel the money pay in the next minute and the scheduler have to do this job immediately. So scanner will tell the maybe terminate the instance or send update to the maybe send a new one. Let me think. So it change time we can terminate the old one for the new one. Yeah, yeah, it's no harm. So change and delayed I think now is fulfilled.

Doctor Squab: Yeah, sounds good. What kind of database will you consider for this?

Mutant Anteater: The database. Okay, so now I think we have only like 100, 100 users or 100 terabytes. So I prefer non SQL DB because it's cheaper and our case do not require any secure relational requirement. So I think nine SQL should work for this situation. Okay, maybe.

Doctor Squab: So how is publisher going to know what to run, what to enqueue using non sequel DB.

Mutant Anteater: Query by time? Maybe we're listening to the change change data capture.

Doctor Squab: How is that going to help? I mean because change could be anywhere. I might say that move it a year from now, six months from tomorrow. So I'm trying to understand how is public gonna benefit from listening to changes.

Mutant Anteater: So it can only let me think. So publisher, maybe there's a job status here, let me add this so publisher can read by query by job status to see oh, which one is just added and put it in the message queue. So are you going to put everything to message queue? No, no, no. Just create like we are planning to do the changes directly from backend to scheduler. So this publisher looks like it's not necessary.

Doctor Squab: Let me add this thing that let's say that you can schedule up to one year in future. You are allowed to. Any of these hundred million users can schedule any job up to one year in future.

Mutant Anteater: So one what? One year?

Doctor Squab: Yeah, up to one year in future.

Mutant Anteater: Uh huh.

Doctor Squab: So, so are you saying that every single time there is a job creation it goes to the methods queue?

Mutant Anteater: Yes, I was planning to do this but looks like it's not a good choice. So I'm planning to directly do this. How can I change, I don't know how to change it. So we are going to delete this and directly schedule a read from the database.

Doctor Squab: So okay, tell me what kind of queries will the scheduler do on the database?

Mutant Anteater: So scheduler will query by the status. So if a job is already running then scheduler will skip it. But if it's just created it will.

Doctor Squab: So the query will look like if I understood correctly, that select style from just job table if job status is not cancelled or not completed, something like that. Is that what the scheduler is doing?

Mutant Anteater: Yeah, I think so.

Doctor Squab: But that will return 100 terabytes of records.

Mutant Anteater: Then we can have multiple schedulers we can skill this into.

Doctor Squab: Okay. All be calling with the same thing and they will be getting these, I don't know, couple of still terabytes of records.

Mutant Anteater: I'm sorry, your microphone. Yeah.

Doctor Squab: No, I'm like, I'm saying that. So the scheduler will be making these queries and they all be getting couple of terabytes in as a result.

Mutant Anteater: You mean initially?

Doctor Squab: No, I'm trying to understand how will, what will be the query? Can you write like a high level query?

Mutant Anteater: Sure.

Doctor Squab: That, yeah. Let's start from, let's say jobs. Yeah. Okay. Yeah. Okay. Okay. So how many records will this query return.

Mutant Anteater: So we can ask this backend to push the newly updated thing to the scheduler. So this query will not be too large. Most of things will come from this arrow.

Doctor Squab: So scheduler, how, I'm trying to understand. How will a scheduler maintain such a large amount of records?

Mutant Anteater: Let me think about it.

Doctor Squab: So you are giving me information for our next one year and I cannot deploy it. I cannot send any status because job needs to run one year from now. So what should I do with it? What should schedule do with this information? You are telling me this job needs to be run one year from now. What should I, what will the scheduler do with all the information? Where will it keep it?

Mutant Anteater: I see. So maybe start time range, maybe just in 1 hour.

Doctor Squab: Okay, so now, now you got these 1 hour records, right?

Mutant Anteater: Uh huh.

Doctor Squab: So yeah, now of course this is also a very large number. So now how do you propose to handle this? Because one scheduler cannot run or deploy so many jobs.

Mutant Anteater: Yeah, you are cracked. Then maybe just several seconds.

Doctor Squab: Okay, let's say, let's say several seconds. So now how are, because everybody is calling like you have many schedulers, right? That's what you said. Like you have many schedulers and they are all calling this with this query and they will all be getting similar responses, right?

Mutant Anteater: Yeah.

Doctor Squab: So is that a problem? It's not a problem.

Mutant Anteater: So this will be a problem as we are going to correct it every, every second. So there will be a lot of queries here.

Doctor Squab: There will be a lot of queries and there will be a lot of duplicate results which will be that quite, let's say queries NoSQL can handle. I am curious, how do you know that? Yeah. How are you going to prevent scheduler one from scheduling the same thing, which scheduler two has already done it?

Mutant Anteater: Yeah, I know your question already understand it. But let me think about how we are going to handle this scheduler these schedulers can definitely, you can get same jobs. So we have several plans. Our first plan is to do something like consistent hash and for every job maybe drop id. And in this way we can assign the job only to one scheduler. That's a solution. That's an option.

Doctor Squab: Okay, so how will you let me try to understand again this part. So scheduler one makes a request how consistent health is going to help you in avoiding schedule, assigning it to another one.

Mutant Anteater: Yeah, so we can call it by type hash here and maybe hash, you call like something, like something. And in this case this cache is always will, a same hash will always assign to one specific scheduler instead of others. So others, other schedulers, Cory will not get this job by Kafkinson hash. Ok.

Doctor Squab: Cool. Now that you have, let's say that scheduler got the things to run right. And now what if after getting that information it dies?

Mutant Anteater: It dies. Yeah, the scheduler dies. Where the drop dies, anything can happen.

Doctor Squab: So tell me both cases, maybe now. Good question.

Mutant Anteater: Okay. Okay, so let's see, the job dies. I think the drop dive is more difficult in this part because a scheduler needs to know the status of our instance. So I think, I mean that's not hard.

Doctor Squab: Let's say that whatever deployment services that tells you that gives you a long running operation and you can query and that will tell you whether the job has failed or still running. So you can query that information from deployment.

Mutant Anteater: Yeah, I see. That's what I'm thinking too. Yeah, so scheduler can tell this. So if the instance died and scheduler will know this and the scheduler will know the information of when this thing should be scheduled, in my opinion, I think we should tell the client first to say, oh, the job is bad for some reason, like maybe the memory is run out. But if we can know the specific reason, then we will tell the client and ask them to ask them to maybe reschedule the job. I think this option is more mixing. Like if something is, you have to do that on that time. And if we don't tell the client, then maybe the client will miss something and they will blame us. So I think we should tell the client. Yeah, so unless the client decide what to do next. So that's what will happen if an instance dies. But what happens if a scheduler dies? So as I mentioned before, the consistent hash, it will automatically, automatically move, assign previous the jobs assigned to that slide scheduler to some other schedulers. So that's why I'm using.

Doctor Squab: Sounds good, sounds good. So let's spend one more minute on it. Just a time check. So let's say scheduler died, right? And how, how would, how would you reschedule the job? Because you still want to run the job. Right? So can you tell me the state of database while all this is happening? What's happening? Like, how will a new scheduler use that? Consistent hats will make sure that new scheduler comes here, but that new scheduler needs to know that. Where was it left behind? So can you tell me quickly in maybe 30 seconds that, how is it going to know that? Where are we right now? How much was done before that scheduler died?

Mutant Anteater: Sure. So we can know what the scheduler get in the last query. Right.

Doctor Squab: Good.

Mutant Anteater: So we can do this. We can do the same thing to our new scheduler and by a similar secure query and to the deployment.

Doctor Squab: Okay, cool. Cool. Thank you. Yeah, I think that's about time. Do you want to discuss now? Interview is over, by the way. Interview is over. So you can relax a little bit and we can casually chat about it now.

Mutant Anteater: Sure. Yeah. Thank you for your interview. And, uh, how did you feel?

Doctor Squab: How did you feel?

Mutant Anteater: Um, well, I think, I mean, I might need some, need to learn more things.

Doctor Squab: That's good. Was it helpful? Like, I know I was little bit more aggressive than an interviewer, but I, I thought this can help you.

Mutant Anteater: Yeah, it definitely helped me.

Doctor Squab: Right. Yeah, I know I went into very much too much details and, and I thought that will give you an idea of like, where to do the time management at least. You know, initially you were doing a lot of things, which pretty much every candidate does. And I feel like they end up wasting a lot of time by doing that. So you see, like, later is when the interesting discussions were. And, you know, in the beginning we were talking about, let me put a load balancer and then let me design a job schema. And, yeah, I mean, it's interesting, but it's kind of boring as well. Like, there's nothing interesting happening there, right? Like you said, job name, description code, image owner. Yeah, sure. It's like, who cares? Anybody can do that, all that. Right? So.

Mutant Anteater: But I cannot hear you anymore. Hi. Yeah, so can you hear me? I can. I can hear you now, but not 1 minute.

Doctor Squab: I see, I see. Yeah, no, I was saying that, you know, in the beginning, let me go over from the whole thing, right. If you want to see the time, the whole feedback. So let me conclude. So you are applying for l four, right. And alcohol. Typical expectation is that they can independently code, but the design it's okay to get help. So l four s are not supposed to independently design. They can design smaller things. This was a bigger problem than an l four scope. So I would say that it is a harder problem and l four will not do this in a real job. But it's good to know that you might be asked to design just that scheduler part of things, for example, or you might be asked to handle only the NoSQL part of things. But nobody is going to ask to design such a big system as Al Four. But having said that, and you are supposed to of course do extremely well in coding and in design, you should be able to come up with the right high level things, although you didn't do, even if you don't do so well in the low level details. Right. That's the, that's the typical what. When somebody hires four, that's what they are looking that independently can code, can design with an assistance. Right. So that's the expectation. And now on your performance, I would say you are borderline for alcohol. Based on this interview performance, if I was supposed to give you a rating, I wouldn't say you. Yeah, I would, I would, I would. If you ask me if there's a, you know, there's a reject, you know, leaning reject. Leaning higher. Higher or strong?

Mutant Anteater: Higher.

Doctor Squab: If there are five of these, I will give you like leaning higher. Three, three on fire. So it's not. But the problem with leaning hire is that if you give, get another leaning hire in any other part of the interview, then you are going to make it. But let's say that you got hires and strong hires in coding and you got a leaning hire in system design, you might still pass the interview.

Mutant Anteater: Okay, I think I already got a strong encoding part. So.

Doctor Squab: Yeah, so cool. So, and this question is a little hard. Of course, this question, can I ask this question from everybody from l four to l six? I ask everybody and of course their performance changes. And that tells you the signals, tells you that, what are you? Are you l four, are you l five or are you l six? Right. Okay, so let me quickly go over the performance. The beginning, you did fairly well. I would say you could have done better, but you did fairly well in terms of coming up with something like aws lambda and the size limitation and how you will deploy it and then functional requirements and non functional requirements, breaking it down. That is all good. I think, you know how system design interviews kind of, you know, are set up. So you did in like 1st, 1015 minutes, you are doing pretty decent I would say in case of functional requirement and in case of non functional requirement, I would say that, think about the problem, maybe close your eyes and think about it how as a user you will use this product or whatever you are designing. I think that was not very clear and I call it life of a query or life of a user or user journey. You did not kind of, you were not very clear about how a user will use it. This, this whole thing which you are designing, and this is where, let's say you were saying upload job and query existing job information, job status. That's good, but a nicer way of putting it. This would be that I'm sitting on my browser, for example, right. And I can do what, I can schedule a job, I can cancel a job, I can modify a job, or I can see what jobs are there, which I have scheduled in past. So a list, a job. Very clear. Right. So you, you are doing, but now this gives you interviewer a picture that, okay, so now you are thinking of a web page or you are thinking an app, which this is where these options are right now. Now, so you did collect the functional requirement, but you were adding requirements as we were discussing things like running the job or message status service. I mean, message status service is not a really functional requirement. Functional requirement is that you should be able to see that, see the status of the job, basically. So that gives all of the things are good, but they are not really the functional requirement. You kind of mixed up lots of server side stuff, more on client side stuff. So functional requirements should be more on client side. How things are, how this thing is going to use basically. In terms of non functional. Yeah, in terms of non functional requirements. Again, think of the functional requirement and then think about that to satisfy those functional requirements. What should I do? What should be my non functional requirement? So they should be derivative of functional requirements. So it seems like you were thinking of like reliability, scalability, fault tolerance monitoring, but you are not motivating them that why do I need reliability or why do I need scalability? Scalability, sure. You saw the number, so you said that. Right. But it would be nice to kind of talk in terms of, you know, what, I need to need reliability because if my job status, you know, if job is I have a schedule, a job, it doesn't run, then I might be in trouble. So that's why I need reliability. And then you write reliability. I need scalability because there are so many users and they are scheduling one job every day, there's too much qps. So I need scalability. The fault tolerance part was extremely unclear when you wrote it that why do you need fault tolerance? Unless you actually think about schedulers and all that and they going down, it's not very clear why you need fault tolerance at that point in the requirement. Right. Monitoring is good. Monitoring is good. Right. But monitoring, to be honest, I didn't even clear what kind of monitoring you mean. Do you just mean that whether my job got is working or not? Yeah. So anyway, so these are the feedback on the, then you started designing all these job entities and APIs and user entity and stuff like that. You never discussed like what, how are you thinking about it? So you jumped from non functional. Functional to directly all these details without any kind of, you know, relationship. Like where did all this come from? So maybe going into, at that point it would be nice to put some boxes, client, you know, servers, databases and stuff like that, start putting those things. That that's what I am thinking. So a high level design, I would put with after functional, non functional, maybe a high level design, right. In high level design, you know, everybody puts these load balances right. In the beginning it felt like you were using load balancer. We'll call the s three bucket. I would ask you to read about load balancers. What do they really do? Right? So load balancer will not talk to back end directly. There will be a front end. Load balancer is actually pretty dumb. They actually just request come. They redirect requests to many web services servers. They are not smart. They don't do anything to be honest. Like the request comes, it just balances it to something. So it's a hardware, it doesn't have the semantic understanding of anything. It really doesn't understand that part. All it does is it bounces around the request all over. Right. So either don't put it, if you're putting it, then you be sure about it what it did. So, right. You could have just said that, you know, I have a client and I have a server and a server will be, you know, server will use the, will get the request. I'm running HTTP server and that gets the request. And then we forward it to backend and stuff like that. Just to keep it simple in the beginning. And you, then you put load balancer if you need it, but don't put things just because that's how every single diagram looks like. Cool. So now you came to back end and then in the beginning you were kind of confused like I'll use s three buckets. So that's where, because you didn't really give me an idea about how your high level information looks like. You didn't tell me how your request looks like and how your responses look like. And what will be your APIs, which you are going to serve from the front end. So these are the three things you don't have to write details. So API would be, let's say upload, modify, cancel list. These are the API's simple, right? And they are directly related to your functional requirements. You can have four APIs, right? Then request, request will have what? A user id that tells you whether the user, who is the user, it will have a timestamp and it will have, let's say a job location. And that location could be the s three path that you as a user went ahead and uploaded your job somewhere in s three. And all you are giving me is the s three URL. So that's the nicer way of saying what you were trying to say, that I will upload to s three. But s three you will upload, it will be a user bucket. It has nothing to do with you generate, you can see I will generate a pre signed URL in my st bucket and that's what I will provide to the service. The service will go and just look at that pre signed URL and get the job from there. Right? So anyway, so, but the main things are you just need to know user timestamp and job location. Nothing else really. Everything else is too much. Details, not important for the doing the thing. Right. Then we came to back end, back end, go to database. Right? In the beginning, you were discussing a lot about queues and you were discussing about that publisher thing you had put, then you removed it, right? Yeah, yeah. So that's where I think you need to spend more time now. So I would say whatever we have discussed so far that was interesting. But you should ask yourself that if somebody is asking these system design questions, they really want to get to the interesting part of the problem, right? They are really not here to listen to all boring stuff which they think that everybody knows already. So, so far it's good, but it is not really telling me anything about how smart of an engineer you are. And you really want to show your depth, you want to show that you are, you know, you're not just like, you know, you look at any YouTube video of system design and everybody starts like this so far, but that this is where the differences come, right? Interviewers are smart. Like they have seen, we have interviewed tons of candidates, so we know everybody starts like this. But that's where you get into details. Like, okay, let me think about how am I going to do the functional requirements and especially like how am I going to. The whole question is about job scheduler. How am I going to schedule a job? And that's the most important part of the whole problem, right? And you discussed interesting trade offs between schedulers and publishers and queues, but you removed the queue. That's where you need to kind of devote more understanding about it. There are lots of problems which fall under that. The design you have chosen is not the great one. The one you rejected was the better one. I can just tell you as a detail, the design which where you had a problem. That's a way better design than this design. What you have right now. Yeah, having a message, this is better. This is better. But this was better until you explained me that how you are going to use it. So I thought that you were on the right path. So what you will do is that in this case, for example, a publisher will ask the have the same, select code, select start from database, where blah, blah, blah, and put those messages in the queue. And the schedulers will subscribe to those messages from the queue, right. And you can have many, many schedulers listening to the same queue. And then they are the one who is going to forward it and the publisher when it, when it runs. The, you know you were talking about, I was asking you like, how do you know that publishers are not going to, because you will have multiple publishers. How do you know that they will not get the same thing? So you can, one way to think about it is to do locks at a row level or a table level or whatever level. So when you are checking certain records, you can lock them so that the next guy, when it tries to get those reports, it can't even get those records because there is another guy who is working on it. And then you flip the status as soon as you have got the lock on those records. Let's say you were doing select star from status not started, and then you flip them to queued. You have changed that status. So the next guy, when it comes in, will not even get those records because you have changed the status from not started to queued right now. And now because you have a status like queued. And the scheduler will, what scheduler is going to do is that will look at all the messages in the queue and let's say scheduler has got a message and dies. Your status will still remain queued because the scheduler never got, never picked it. So after some time, what you can do is that you can look at that which are the messages which were never picked up were remained in the queue. And you can have another service which will re en queue them back to the queue because it knows that those schedulers have died. Because why is that? So you can play with the statuses in a very clever way to figure out that which scheduler and which publisher is dying, let's say publicize, then you will never see the status has changed. You can deal with all these problems. Of course it gets more interesting when you work with that. But that's where I think that's a better design. The problem right now is that you have a scheduler which is running these queries and also calling the deployment and also sending the notification. A scheduler is doing too many things. Scheduler is querying as well as deploying as well as sending. That's too much work. It's doing. You're not breaking up. Right. It's doing a lot of stuff. And the notification, right again, the notification may go from deployer because a deployer, when it completes it, it can send the notification. But you never said that I'm going to change that status in database. So the main problem with your design is that you were never changing the status I had. I never heard that when your status is going to change. And that is very important because that's how your records get changed. And because let's say that you value were while you were working with my job and clients call and says that hey, what's the status of my job? Database is the, is the truth. Right. So the client, if let's say client calls list jobs or get status, then, then how will, how will it is to say not started. These notifications are not really helpful for client because yeah, you said you have started, but it, when it goes to the list page, it is not started. It's going to cause more confusion. Right.

Mutant Anteater: Okay, so I see.

Doctor Squab: So you have to think about it that you can come to me, come to database anytime and it will always know that what is happening to your job.

Mutant Anteater: Right, I see.

Doctor Squab: Yeah.

Mutant Anteater: So we need to keep this system consistent that like it's okay, it's okay to be outdated.

Doctor Squab: It's fine. Again, those are the, those are the things which you know, you should discuss with the interviewer and then they might relax it. Okay. I think I have to wrap up, but few more things in the later you were, you were talking about, you know, consisting has thing and somehow hats will be used as a way to assign. It was not very clear really. How will a scheduler map to certain has in the NoSQl DB? I think you were trying to, I feel like what you were trying to do is that you were trying to come up with a mapping of schedulers to certain shards in the database. That's, again, that's not a really great way of looking at it because sharding, I don't know, first thing, how will you shard? Like what kind of sharding are you going to use? Are you going to shard based on time stamps? And, and then this whole has to, mapping to a scheduler is not a great way of balancing the whole act as well. So I think that the problem was that the publisher wasn't there and then you were using schedulers to directly go talk to shards. That kind of became little messy. Right. Anyway, that's another one. Yes. Status was, I would say the biggest one, which I felt, and then when I was querying you, kind of probing you about cancellation, that's where the difference comes between l four and l five or l six will change the functional requirement themselves to make the system simpler. So I see that you were trying to really allow cancellation till last minute. Let's say if some job is running and somebody sends a cancellation request, you were trying to cancel that job, right? So, yeah, what, what l five or l six will do is that they will just put that requirement in the functional requirement that cancellation is allowed up to 30 minutes before the time. You see, smart way that, so that's where you, because you were, you were trying to just do the whole thing. But you see that, how simple it is to completely think differently, that, hey, I'm not going to make my system super complex and because, you know, supporting cancellation till last minute is going to create too much complexity in my system, I'm going to keep it simple. I'm going to just not allow cancellation after 30 minutes. So, and that's where you see that why Amazon does it this way, like Amazon does not allow you to cancel after some time, stuff like that. All that is because of these things, because the system becomes way too complex and you want to keep it simple. You want to be very sure that when I accept the cancellation request I can actually cancel it. So if, let's say if a publisher or a scheduler has picked it, you do not want to allow further cancellation because now you don't know status, you don't know whether it's scheduler is working on it, you don't know whether the deployer is working on it. You don't know whether it's sitting in the queue, whether you have send the notification. You don't want to deal with all that crap because there are too many things are happening now. So that's why you want to keep, you want to work with. So don't just trust the requirements like, oh, I have to do something. If you see that things are becoming very complicated, go back and to the whiteboard and see whether you can talk to the product manager and just change the requirements altogether because it's too complicated to support it. Okay.

Mutant Anteater: Okay. I see. This is really interesting point. I never thought about it.

Doctor Squab: Exactly. That's where you were asking me if you judge me for Al five. And I said, these are the things which we will start noticing when you spend more time in the job that how things become, why they become the way they become because of these reasons, because you don't. Otherwise you are see how much complexity you are adding for just supporting last minute cancellation. It's way too much complex system. It's very clear if you just draw a boundary that till what time you want to accept it, until what time you don't want to.

Mutant Anteater: Okay.

Doctor Squab: Other few things more. You need to take more control of the questions, the leadership part of things that I was giving you lots of hints throughout. Probably you didn't notice, but I was kind of guiding you toward right path. You might find interviewers who will not give you hints and just waste your time and then you will fail the interview. So, because they will say that the candidate did not, you know, pay attention to the right places. So you, you need to kind of take the lead and you need to, you kind of guess what interviewer might be thinking and go in those directions. So try to make sure that you are keeping the whole conversation engaging and you're trying to guess whether and if you are, let's say, if you are not clear, it's good to communicate with the interviewer rather than assuming things or going in some direction. Right. So don't make the interview super boring because that's not a good sign. If you really didn't, if the interview, interview was super boring and you did not discuss anything interesting in the entire interview, most likely you are going to fail in that interview.

Mutant Anteater: Okay. Okay. Okay.

Doctor Squab: Okay. Sounds good. Okay. I think, I hope this feedback is helpful. I think we have spent a lot of time. Yeah, yeah.

Mutant Anteater: This is really helpful. I grab out something to practice in the future. Thank you.

Doctor Squab: Okay, cool. Yeah. And best wishes with the process. Yeah, thank you.

Mutant Anteater: Okay. Okay. Thank you. Thank you. Bye. Have a good night.