Interview Transcript
Hyper Taco: Hello. Hello. Hey, how's it going?
Doctor Malamute: Good, good. Let's get started. So first of all, I want to understand your current status. Yeah, basically like what level are you right now, what level are you interviewing for, and when is the next interview, and what area, by the way. Yeah, okay.
Hyper Taco: So I'm actually— the reason I picked you is because I'm in terms of [REDACTED] Cloud AI, interviewing at L5, possibility to B down on L4, so between L5 and L4. Ideally L5, but I think it's a long shot, to be honest. Uh, interview is on Monday, uh, again Cloud AI. Uh, yeah, so that, that's about it.
Doctor Malamute: Wait, so it's L4, then you're still holding this machine learning system design interview? Yeah. Okay, okay.
Hyper Taco: Um, I mean, okay, so I think like the idea is they want to like do the leveling. I think they want, so it's a boomerang. So they're trying to, and I left [REDACTED] as an L4. So I think they want to do the leveling and for this, they're going to do like the ML system design.
Doctor Malamute: Oh, so you worked at [REDACTED] before? Yeah.
Hyper Taco: So the thing is I was a data scientist. Full disclaimer, was laid off 3 years ago in January 2023. I'm trying to bounce back as a SWE. I passed coding and AI, coding and AI. Now they're just like, [REDACTED] AI System Design Committee.
Doctor Malamute: Oh, cool, cool. Glad to see a [REDACTED]-er here. OK, so let's get started. I think we just do the traditional machine learning system design here. Yeah. Overall, I think maybe I can share a little bit insights on how to do the— what's the criteria here. So basically, there are only two major parts. One is the think big. Basically means like you have to have understanding like end-to-end structure, like some high-level perspective. And once the criteria goes higher for the rest of the part is the dive deep. Basically once you finish the high-level design, which basically just like a connection between components, we will dive deep into certain component. And the details matter as level goes beyond. But since you are only leveling, interviewing for L4, I think as long as you can give me a brief introduction, like on a high-level interview, that should be a high-level overview, that should be good enough.
Hyper Taco: Something I want to flag, sorry if it wasn't clear, was I'm interviewing for L4, maybe L5. So I think like this is going to determine the leveling. So I mean, I'm going to be eventually essentially evaluated for both L5 and L4 from what I'm understanding.
Doctor Malamute: But to be honest, like, I haven't seen upleveling once during my time in [REDACTED]. Like most of the time they say this is like, it can be either— this is probably like some bloody truth, like a recruiter wants you say, oh, we can decide your level during interview. If you do good, uh, we can uplevel you, uh, something like that. But, but it never happened to me before.
Hyper Taco: I see a friend to which it happens, but, uh, I'll take your word for it. So maybe, um, why don't we do the following? If it's on view, would it be possible to like, just like you, I do the, the whole thing and you tell me at the end what the level would be? Is this something that is—
Doctor Malamute: exactly. That's exactly. So we still have to go the, like, a full set. Basically you go high level and you go details. But if you have some, like, uh, gap, I can tell you on in again. It sounds good. Yeah.
Hyper Taco: So feel free to, like, okay, be very upfront. Like, don't sugarcoat it. If you tell me I suck, tell me. If I suck, just tell me I suck.
Doctor Malamute: Okay. Okay. Okay. Sounds good. Okay. Thank you. Uh, let's get started and, uh, we do this like a YouTube recommendation. Do you want to use the whiteboard?
Hyper Taco: Yes, let me just go to the whiteboard. Okay, so, um, all right, I'm just gonna write it down. Design YouTube recommendation. Okay, so before we get started, I'd like to spend some time on like some clarification on requirements. Alright, so first question I have is what's the business objective here? Is it to increase money, increase engagement? I'm going to— I mean, I think my assumption would be we want to increase user engagement, so increase watch time on YouTube. Watch time on YouTube. Is this a fair assumption?
Doctor Malamute: Mm-hmm. Yes.
Hyper Taco: Okay, good. Alright, so let's see what we're considering exactly.
Doctor Malamute: Let's skip that. So I take that back. So if we're talking about increasing watch time, that will bias our recommendation to long-lived video. This is like a long-lived video can better chance get more watch time. We'll say increase engagement.
Hyper Taco: Increase, OK, increase. Oops, I'm wrong. So increase engagement.
Doctor Malamute: Yep.
Hyper Taco: Engagement, if I understand correctly, is the number of videos the user is going to click and watch up to a certain threshold. Is this fair?
Doctor Malamute: Yeah, I think a proxy. Though engagement is more subjective, but I think a proxy metric can be the click-through rate.
Hyper Taco: Okay, yeah. CTR. Okay, so good. The metric is also optimized, click-through rate.
Doctor Malamute: Okay, good.
Hyper Taco: Okay, so I assume that I have access to some, like, personalization should be recommended, you'd be like, so recommendation should be personalized, so I should have access to user data, is this a fair assumption? Yeah. User data, also have access to like past user data, past watch videos, like essentially have access to like a large dataset of history of like watch time and like user interaction in the system. Past interactions.
Doctor Malamute: Sorry, your voice not clear. Could you maybe fix some audio issue, like, on your side? Okay.
Hyper Taco: Yeah, I can try to speak slower.
Doctor Malamute: Or maybe closer to the mic. Maybe that could help. Like, I try to increase my volume here too. Okay. Yeah, so first of all, yes, I think all question like user data, past interaction, definitely yes.
Hyper Taco: OK, good. Thank you very much. Also, all right, does the recommendation— OK, I assume we— do we need to take into account some factors such as language, for instance? So for instance, if the user is speaking a certain language, should we only recommend in a certain language?
Doctor Malamute: Let's limit that to just English.
Hyper Taco: OK. OK. Is there fairness or safety considerations that we should take into account, or is this out of scope?
Doctor Malamute: Uh, that's out of scope, yeah.
Hyper Taco: Uh, out of scope. Okay, let's talk briefly about the scale. So I think it's— my hunch would be there's about 2 billion users on YouTube. Is this a fair assumption? Is it roughly the right number of the magnitude we're talking about?
Doctor Malamute: Uh, so, um, but before we go into the scale part, or maybe are you done with the core requirements? Uh, I just want to, uh, say like, are you done with the business part? That's my question.
Hyper Taco: Oh, uh, yeah, I mean, I was, uh, for the business part, I was thinking about asking also if you had a baseline. Um, uh, like, this is like the baseline.
Doctor Malamute: Um, I can give you hints, but the goal here is that you can think about this more thoroughly.
Hyper Taco: More thoroughly? So increasing watch time, increasing engagement. So if we increasing engagement and watch time, I guess we can also place more advertisements, because there's advertisements on YouTube. So maybe something to think about would be like, what's the revenue generated by, you know, by ads placement on YouTube? Maybe that could be one of the things.
Doctor Malamute: Oh no, no. Okay, so let me — so ads is totally separate from the organic growth. But let me give you two clarifications here. First of all, we are doing this home feed recommendation, which means like when user enter the home page, we are giving them a list of videos. So there's another set of — there are many set of recommendation system in YouTube. For example, watch next. When you finish watching, what's the next list of videos you can pick? So that's a — you can see that that's a one very important clarification you need. And they intentionally give you a very ambiguous system design. This is called home— usually we are asking about home feed recommendation. And second of all, YouTube has a set of like just different forms of entity. For example, long-lived video, short-lived video, channel. Oh, I see, I see. Yeah, yeah, yeah.
Hyper Taco: Okay.
Doctor Malamute: So we're limited. Yeah, go ahead.
Hyper Taco: I mean, okay, the question I should have asked then is different type of videos like Reels, there's, and there's regular videos. Should we consider both of them or should we consider them separately? Yeah, it is also like—
Doctor Malamute: yeah, yeah, for scope of this, uh, interview, let's limit to long-lived video. But, uh, just give you heads up, like, uh, you don't have to be too concerned about, uh, the, the challenging— technical challenging. But you should be more thorough in thinking about the product. Uh, usually we will limit the scope to certain, like, certain product, like certain language, like you just mentioned. Yeah.
Hyper Taco: OK. So, OK. So, the goal is to increase watch time. I think I was also saying, the point is to maximize the click-through rate. I think there's a direct connection by click-through rate with advertisements. So it's also the amount of money the website generates. And the whole point is to generate home feed recommendations. So essentially, there's a ranking of different videos that appear on the user feed, let's say, by most important. And I think it's fair— maybe you can tell me. I think it's fair to assume that you want to foot first, like the one that's most highly likely to the high— the one that's like the highest likelihood of being picked by the user. We only consider one type of videos. We have access to like user data interactions as well as user data. We consider in English only. Fairness and safety are out of scope.
Doctor Malamute: Okay. Cool. Yeah.
Hyper Taco: So the next question I have quickly is like, do we have a baseline already or are we like implementing this from scratch?
Doctor Malamute: Nice, nice. That's a very good question. Yeah, so assuming you have a baseline basically means like there's a logging in place.
Hyper Taco: Okay, logging interest. Okay, okay. So, all right, I think that I'll have like this break here. So like, I'm, you know, just like product thinking. Maybe I want to think quickly about like more like system level, as in like also scale. So yeah, like the order of magnitude I think, and maybe you can contradict me, would be there's like roughly 2 billion users and let's say maybe 10 billion videos. Is this a good order of magnitude.
Doctor Malamute: Yeah, okay.
Hyper Taco: And also, like, like, we know we're at this scale, we also want to produce some fast recommendations. So I mean, maybe something I can think of is like one requirement that I think could be good, and maybe you can discuss, is I would say like the P99 of the— like, you know, if you look at the speed of generation, like the P99 should take roughly 100 milliseconds. Is this a fair assumption?
Doctor Malamute: Um, that would be, uh, very challenging, but let's So the overall can be within 1 second.
Hyper Taco: 1 second?
Doctor Malamute: Yeah, yeah, yeah.
Hyper Taco: OK. OK. Thank you. I mean, I'm just thinking about rough numbers. Full disclaimer, they work on recommendation, like hands-on. But OK. So OK, those are the clarification requirements. Is there anything you want to discuss? I think we have a fairly good idea. Yeah. Now I would like to hear more about the— OK, thank you. All right, so high level, the modeling approach that I want to take is fairly standard in recommendation, and it's called— I think it's good. I think the name of it is candidate generation and ranking. So the idea is you have a two-step approach. I'm adding approach. You have a two-step approach, and that's what I want to take. I know there's like different approaches for recommendation, but like the one I want to think of, and that I think is, as far as I know, state of the art, is, you know, you have like a lot of like— you have a user, you have like a lot of video, we said 10 billion roughly, you You want to first select, let's say, 1,000 or maybe 10,000, something of this ballpark. So it's like, first step is content generation, which selects, let's say, I'm going to say 1K videos that are likely for the user to be picked. And then from these 1K videos, you just rank on a ranking step. You just rank the different videos. You rank the different videos, uh, which, you know, takes like n log n where n equals 1,000. That's why it's quote-unquote fast. And then you like, uh, display the videos by order of importance, uh, by, uh, importance, by score. Uh, I'm gonna pause here for a second. Um, that's like the direction I want to take. I want to loop the loop first if it's fine with you, and if you have any question about the approach I'm taking and if everything clear and what I'm trying to achieve.
Doctor Malamute: Sounds good. Very cool.
Hyper Taco: Thank you. Okay, so the rough map I want to take now and the way I'm going to structure this, we are about 5 minutes in, 10 minutes in. Okay, so I just want to talk first of all briefly about the data that I have. Then what I want to do is maybe modeling deep dive, modeling deep dive. Then I want to talk maybe by modeling, also going to talk about the actual model I'm going to take for each, like, you know, the content generation and the ranking. Then I can also briefly talk about the training. But most importantly, I think I want to talk next about, let's say, evaluation of the model. Evaluation, so both online and offline. Both online and offline. Then I want to talk maybe about deployment and serving. And by this, I mean, like, OK, how are we going to deploy the model? How are we going to test it? How are we going to see if it's an improvement compared to baseline? And also maybe how to make the serving of the model fast enough. Also, how are we going to monitor things such as if the model stops performing, how can we detect this? All those kind of stuff. So that's the high-level approach I'm going to take. Does that sound good to you? Is there one aspect you need to cover?
Doctor Malamute: Yeah, so that's one thing I can give you a quick feedback. So basically, in the machine learning system, especially in this kind of very big design, there are multiple models involved. So I wouldn't recommend you follow this template because like there'll be retrieval model, there'll be ranking model, there are many probably small models involved if you want to dive deep. So I would recommend you go with the high level first. And once you finish that, like depend on your interest, or maybe I will challenge you with one particular component. I'll go through a particular model. Maybe it's not the model I want to challenge you. So the template here is not necessary. Just give a quick feedback.
Hyper Taco: Okay, okay, okay. So maybe, okay, so I mentioned that there's like this content generation and ranking approach. There's other stuff such as collaborative filtering. There's like different things. I think I would prefer going with this approach. Is this fine with you? I can discuss other things. I'm going to pause here for a second.
Doctor Malamute: So I feel like we want to go— so you see the— what you call is the modeling process. This candidate generation and ranking is actually the high-level flow. I want you to show me a use, a life of one query, one user request comes in, how it goes end-to-end. Yeah. Yeah. OK.
Hyper Taco: I'm just going to delete like this. And we can go back to this. So, okay, one user and one—
Doctor Malamute: One user request, actually. It's not one user, actually. Yes.
Hyper Taco: Oh, one for you. Okay. One request. Okay. Request and one.
Doctor Malamute: Okay.
Hyper Taco: So the idea is you have— all right. So where is this?
Doctor Malamute: Okay.
Hyper Taco: So the idea is you want to— you have user, and then you have past history searches. You have user data. You have views. Video data and like you want to like essentially those are two different objects, like videos and users, those are two different objects. So I want to like put everything into an embedding space. So essentially like map a user to embedding space, actually map user and videos into an embedding space. So it's very important that this embedding space has to be like similar in dimension because you know you're just like putting everything into the same exact vector, or like not the same exact but like the same vector space. And the whole trick and the whole fun of the model is to try to like find embeddings or vector representations that satisfy the following. So if a video— I'm just going to put this as brackets, just kind of like a claim of fame of an intuition. If a video is likely to be watched by the user, like the user embedding and video embedding should be close. I'm going to say close as a vector intentionally. Let me just put it here so it's seeing.
Doctor Malamute: Okay, that's too many details. Before that, like, I want to see how the flow goes. For example, like, user comes to the server, then maybe I can give you something, then user retrieve the embedding. right, from some server, and then use this embedding to call another server. I use server intentionally because I want to skip the name. I don't want you to mention it. Call a server, retrieve some videos that has similar embeddings, and you now have the list of videos. Then what's next? So something like that. That's the flow I want to see.
Hyper Taco: So retrieval of videos, uh, videos given one user. So, like, the thing is, there's like, let's say for one user you have, let's say, 10 billion candidates, uh, 10 billion videos. That's like way too much. So, like, in an ideal world, you'll be able to compute the, like, the, you know, you compute the distance between the user embedding and every single different, uh, user, uh, video embeddings. That's way too expensive from a computational standpoint. So the idea is to use, for instance, approximate nearest neighbor search, approximate nearest neighbor search. So we're just going to tell you roughly there's some type of approximations called ANN. I'm calling it ANN, approximate nearest neighbor, that retrieve roughly, using some approximations, 1K videos close to user embedding. Does that sound good?
Doctor Malamute: Yeah. So I guess maybe that's a distinction between L4 and L5 here. So usually when we talk about high level, I can give you a brief rough— so this is user. Can you still see my screen? See my— and there's the retrieval stage. Or maybe this time is candidate generation. This will be ranking. This is something I want to see. And within each, I have to see in for this one, this will be A and N. Okay. And, uh, let me just give you very, very raw, very, very, uh, on— this will be something. What's the thing? Maybe you can tell me.
Hyper Taco: Okay, so yeah, it's a feature store. You also need to have the feature store. You need to pre-compute the feature store for all of the videos. I think that's important because you want to pre-compute it.
Doctor Malamute: Sorry to interrupt. Sorry. Just since you want to level up, like do L5, right? So the standard practice is we have this diagram. This high-level architecture. In, in like, you, once you describe, you draw this diagram together to give both you and the interviewer understanding what the system looks like. And once you finish this diagram and, uh, interviewer is okay, then we kind of have like, uh, a high-level sign-off. That's a step one. Step two is like what you're saying about retrieval, ANN, collaborative filtering. And all that details. So that's what we call dive deep. So basically we need to finish the whole level for us to understand how this, how this flows working. Yeah, yeah.
Hyper Taco: Okay, so you have a user query, uh, it just, as we discussed, it just takes, uh, this into a feature store for the user. Um, okay, this content generation that's going to use ANN, uh, I think here I would probably— oops, sorry. I would probably do— you're just going to do— okay, how do I—
Doctor Malamute: This is intentional for you to modify. I didn't actually put this in very detail. Okay.
Hyper Taco: How do I— I do a little shape here. You're like a generation feature store. I'm going to do this here. So you have a feature store and there's another box here, feature store. Let's do— okay, let's do this feature store for video embeddings. Then you just use this to like— there's like, I guess, let's do it this way. You have like some approximate nearest neighbor that just goes to feature store here. So you have ANN, and then that's gonna— from the DNN is gonna like retrieve, uh, just like give the candidate generation on the NN. Let's say 1K ranking.
Doctor Malamute: Oh yeah, knowledge gap here. So feature store is a PV lookup store, and ANN here is actually an index, embedding index. So you can— so actually, you can consider ANN as more like decision tree on a hyperplane. So they cut the hyperspace into multiple semi-space.
Hyper Taco: And is it like local hashing?
Doctor Malamute: Is it—
Hyper Taco: look, are you referring to local sensitivity hashing?
Doctor Malamute: Uh, yeah, yeah, yeah. Uh, and that's not saved as the feature store. Okay.
Hyper Taco: Oh no, I'm saying the DNN is querying from the feature store, like, oh, uh, no, no, feature store is getting any key value. No, I see, I see, okay. It's just like decision tree. I see what you mean, okay, I see what you mean. Okay, you're talking about local city hashing, so it's like O(1) almost retrieval, right? Should be almost something like O(1) or like some like, you know.
Doctor Malamute: Oh, no, no, O(log n). You need O(log n) to do the retrieval.
Hyper Taco: I don't know my local city hashing complexity, so I'll trust you, it's O(log n), okay.
Doctor Malamute: Yeah, so in [REDACTED] this is this in [REDACTED]. This is called, uh, this, this, uh, algorithm basically essentially just, uh, embedding is a hyperspace, right? And you just cut the hyperspace into multi-mini space in each, each, uh, split and try to find the nearest neighbor. So this is how we save the embedding for video, but how we save the embedding for user is using the KV store, usually the feature store. So this is the place, the feature store is essentially just a KV lookup. And what it does, you do here is basically just a user ID and the embedding. Yeah. So you look up the embedding for user.
Hyper Taco: Yeah, it's O1. It's O1, yeah.
Doctor Malamute: Then you, that's O1, yes, exactly. Then you retrieve an AN from the list of the video IDs. This will give you a list of video IDs. Yep. Face. So this will be a list of video IDs returned. Okay. And once the list of video IDs return, that's another interesting thing you need to understand. So, uh, AI essentially is, uh, optimized for latency, optimized for speed, right? Because there are 10 billion videos over there. So everything's in memory, essentially. The algorithm, everything's in memory. Once we are in memory, you cannot save a lot of information per video. Yeah, you can only save the IDs.
Hyper Taco: Yeah, okay, so you're only like passing the IDs ever, right? And you can always retrieve— okay, you're only passing the IDs. Okay, okay, okay, okay, okay, got it.
Doctor Malamute: In that sense, if you want to pass into a ranking stage, you need another lookup for feature store by video ID and video feature. Okay, so basically this will— yeah, yeah, yeah. So this is what we call high level. I just want to give you a whole flow. And then you have the user feature in the— this will probably give you embedding plus some features, whatever you want to design. And the list of video IDs will go to the feature store again and give you a list of video features. And then you go to the ranking stage. You're using whatever rank model you design, and that's how the flow goes. Hopefully this makes sense to you.
Hyper Taco: Okay, that makes sense to me. Okay, so maybe let me, let me start again. Uh, all right, so, uh, if I were to do this again, I would say, okay, the modeling approach is like I'm going to do content generation and ranking stage. So the whole design, and I would just jump straight to like writing the diagram. So if user in a query, you just query the user into a feature store You just query the user embeddings or representation of the user from the feature store. Then you use that. It's just an ID, I guess, vectors. And you pass it to a continuous relation stage, which is akin to an ANN, which retrieves a list of video ideas. And it's based on low-content hashing. And you can mention FaISS or SCAN. Now that you've retrieved all of the different video ideas, you pass those video ideas to a ranking step. And ranking is also going to retrieve all of the features from the feature store. And you do the ranking essentially. So you also need, I guess here you also need a feature store somewhere. How do I use that feature store? Okay, what is going on? Okay, I'm sorry, I almost skipped a thing.
Doctor Malamute: Okay. Yeah, yeah.
Hyper Taco: Feature store. Feature store. Feature store. And you just do the ranking. And it just spits out the recommendation and that's the output. She's here, and that's the final output here. Um, would that be fine?
Doctor Malamute: Yeah, so, uh, now good. So basically, this is the— if you have, uh, I can give you, uh, I think this is more like semi-mod interview. I don't want to do formal because I want to give you best chance of success. Uh, yeah, so if you have a good clarification of the requirements and also this high-level diagram, like, speaking clarified clearly to the interviewer, you pass the interview already. That's L4 Spark. And once you want to dive into to be more level up to L5, we probably need to talk a little bit about the wider ranking or candidate generation phase in the design. And do you want to move on to this?
Hyper Taco: Yeah. Okay. So maybe, all right. So something I'd like to do is talk about the data that we have because at the end of the day, we're generating features and embeddings. Should I spend some time? Do you want me to spend some time talking about how to generate features for both user and both the videos? Because you know, it's like different things you can do. Okay, so let's talk about like, I don't want to talk about data. Let's call it feature engineering.
Doctor Malamute: So that's the deep part. You understand this thing to yourself. Before, it's all high-level design. And when you're talking about engineering, especially I think assuming you're talking about ranking, That's the 5D part. That's another level thing. So basically you finish first and finish second. That's more like a procedure here.
Hyper Taco: Yeah. So I mean, I just want to talk about feature engineering and then there's two different models we need to tackle and I need to do a deep dive into both the candidate generation step because you need to decide how to generate the embeddings and how we're going to do the ranking stage.
Doctor Malamute: Just choose one. You do not have time for both.
Hyper Taco: OK. OK. OK. OK. OK. OK. That's good to know. OK, feature engineering. All right, so we have two types of data. We have user. Essentially, for user, we have, let's say, the ID, the location, the age. And essentially, those are maybe the past watches, past watch history. Actually, let's not put the watch history, but ID, location, age, all those kind of stuff, blah, blah, blah. So the thing is, these are all categorical variables. So far. So I think the trick is, since there's many, many combinations, you can reach a huge cardinality. So that's not good. So I think the trick is you want to generate some embeddings from this categorical variable to generate embedding. You can also do bucketing. So bucketing. So assume for age, you can bucket them into different age groups. Same with localization. You can bucket by country. But bucketing and embedding here is key. Embedding here is key. Yeah, OK. Now, okay, let's talk about how we're going to generate from the videos. So actually from the videos, there's a given thing to consider, right? There's a set of frames, like a sequence of images, as well as text.
Doctor Malamute: Plus text.
Hyper Taco: Maybe I don't want to talk about— we can also talk about sounds. Plus sounds. So something that I would do is just like feed— it's just like, how do we generate useful videos? Maybe something that I would do would be to generate, especially for pre-trained models, thinking about VIT-50, for instance. Like, I'm talking about the images. So you pass this to a pre-trained model, like ResNet-50, for instance, not VIT-50, or Vision Transformer, ResNet-50, that's pre-trained. And they're just like, you know, the last step before the compression stage, they're just like generate the feature vector. And maybe what you can do is feature vector, maybe you can do is like, since the video is like a minimum length, you just take like the number of seconds corresponding to this minimum length. Then you can like stack, okay, you can, so you can like, I guess one of the things is you can share images. You can also like use a pre-trained model such as BERT for text embeddings. And like for text, actually I don't know about this for sound, so I'm just gonna pass it, but like I'm sure there's some pre-trained model to generate embeddings from sound. So the idea is you generate a bunch of embeddings for the images and a bunch of embeddings for the text and you just like stack them all together. Stack everything. Stack everything. So you have like user, your videos. Now you also have like user video interactions. And I guess the user video interaction is like, did the user like the video? Did the user watch the video? So like watched, how much did you watch, etc., etc. OK. OK. So, so far what I've done is just like talking about how we can put everything for user and videos into embeddings. But those embeddings do not necessarily have the same. Okay, so just like how we just gonna process the data, but like maybe now the next step would be how we gonna like generate the proper embeddings. And like, I think the idea is, is a two-tower neural network. Like the big buzzword is two-tower neural network. And the idea is you take like the user embedding. So I'm just gonna draw it here. You take You know, like user features, and this is going to be, um, this is going to be video features user, um, and this is going to be video features, um, okay. So those don't have like at the output here as we discussed, uh, the, they just generate like the raw user features and the raw video features, uh, so there's gonna be here you're gonna have raw video raw user features. And here you're gonna have raw video features. Okay, but like, you know, you need to make sure that those don't— like, the dimensions don't match. What you need to do is you add a few layers. So just like, for instance, like, uh, uh, just like layers, you add some additional layers to make sure everything is in the same dimension at the end. Okay, I'm just like this, this, like this. How do you do that?
Doctor Malamute: Yep.
Hyper Taco: Okay, and then at the end of the day, you have like two different embeddings and you want to make sure that you learn the layers. So what you need to do is you need to learn the layers such that the embeddings are close if a user has watched a video before. So the keyword here, like, to be the objective for like training is called contrastive learning. So, you know, you're just going to put them here and just going to compare them. And you have like a contrastive loss.
Doctor Malamute: Okay, okay. So that's a good thing to call out here. So contrastive learning. Yeah. So what's contrastive learning?
Hyper Taco: Okay. So the idea is you have—
Doctor Malamute: okay.
Hyper Taco: So what you want— the idea is you want to like— if you have like two vectors that are supposed to be close. So for instance, if a user has watched a particular video, you want to use a feature the user embedding to be close to the video embedding. However, if a user has not watched a video, you do want the user embedding to be far away from the video embedding. So essentially what you need to have is you need to have one positive example and a bunch of negative examples in order to learn similarity and dissimilarity. I think the loss is some type of variant of— you compute the— for instance, I think the exact loss is you look at the cosine similarity between the two embeddings, the user embedding and the video embedding, and then you compute the softmax over the hard, over the n examples that you have. The n examples being one positive example and n minus one negative examples. One thing that you need to be careful of is just having quote unquote hard examples. So like saying that you need to be careful to include hard examples. I'm going to pause here for a second. I speak a lot just to make sure.
Doctor Malamute: You're speaking. So basically, I want to understand why you were talking about contrastive learning, right? What's a particular— how do you pick the negative example and the positive example in this case?
Hyper Taco: OK, so the positive example is a video a user has watched. You can also— OK, you can have some subtlety here. The baseline would be you click on a video. If you click on a video, it's a positive example. If a user has been shown or suggested a video and has not clicked on it, then it can be a negative example. Now, you can be a bit more subtle. For instance, if your user just clicked on the video but just didn't watch it, let's say, to more than 50%, you can say it's also a negative example. So this one's subtlety here. Or in this particular case, you can say that the similarity between the two videos is, let's say, I don't know, between 0 and 1, you can say it's 0.5, for instance. Does that make sense?
Doctor Malamute: So maybe— so I think what I'm looking for is like batch. Basically, contrastive learning usually means one single single batch, we are using the— this is a single batch of positive labels, basically one user to one video. And the rest of the video will consider negative to this single user. That's how we define contrastive learning. Oh, sorry.
Hyper Taco: That's what I mean by positive example and negative example. So you need to have one user that's positive, that the user has watched, and like some examples that the user has not watched.
Doctor Malamute: Yeah, yeah, yes. The contrastive learning, the technique, we're talking about a technique which is introduced in the CLIP, the OpenAI CLIP paper, essentially means like we are using a batch negative. Basically, we think like a batch and you kind of like, that's purely positive labels. There's no negative label in the batch. And you're just using within the batch using other positive label as the negative for this user.
Hyper Taco: I see. OK. I only knew the— I mean, I knew the thing where like, OK, my understanding was that we had to do like one positive example, negative examples. But I'll just read the paper then. So only positive examples.
Doctor Malamute: OK. So let's move on. Let's move on. And for picking active example, that's a good topic actually. Yeah. Are we talking about— so assuming you're not using the batch, are you— so let— once I speak like what contrast learning usually refers to, do you think like we want to use the other positive label as negative? Or how do you pick negative here? Basically, I want to challenge your assumption.
Hyper Taco: Yeah.
Doctor Malamute: That if user has not watched the video, this is a negative label.
Hyper Taco: Yeah, okay. I mean, okay, so one thing that I should precise is in my head, when I'm thinking about negative examples, I think about a user, a video that the user has been presented to, that the user has scrolled through but has not clicked on. That would be a negative example, not a video that the user has never been suggested because if you're like If your user has never been suggested a video, it makes sense that he's never going to click on it. But I was thinking about videos that the user was suggested to and never clicked on. Those are—
Doctor Malamute: Yeah. That's exactly what you want to call, because in contrast to learning, in the original, what we usually refer to, you just randomly pick the random video for a user. That's very hard to be— that's going to bias your model a lot because— let me just finish the thinking. So originally, how OpenAI is using contrastive learning, they are using text and image, basically text description versus the image. So that's very likely the image versus the different tags will be a negative example because the tags is not referring to the image. That's— that's the assumption is more solid. But for our assumption is different. Our assumption is user versus a video. Even a user not watch the video before, it still could be watching, you could still like the video. The assumption is less solid than what OpenAI does in that paper. So that's a— I see many candidates using this contrastive learning, but they don't see the difference between these two. What you are saying here is just purely traditional classification model with the good selection of positive and negative samples.
Hyper Taco: Yeah, I mean, I'm just like giving the, like, I would say traditional stuff. I have to say I never worked on recommendation before. I, you know, did the reading. I work on AV, so— but yeah, okay, fine. That's good to know. Thank you.
Doctor Malamute: Yes, that's very good you called your negative. I'm not sure. I don't think you called in the first place, but it's very good you already have this in your mind, but you didn't call it like your negative sample is selected by viewed but not clicked, which is better. Yeah, yeah, sorry.
Hyper Taco: OK, so it's good to know because I thought I said it, but I did not. So that's good that you mentioned. Thank you. OK. OK, yeah, Ponte Plaza. So this is for the tutorial network. So OK, that gives us the embeddings. And one thing I should mention is when we are computing the embeddings for the videos, should be computed. They should be put into the feature stores, into a feature store, because video embeddings are static. Or I guess, no, they're not exactly static, because there's a number of views that should be a feature for the videos. So maybe you can compute them every day. Essentially, the point is you should compute the embeddings not on the fly. You should store them and have some store that maybe we compute them. That's just something I'm saying. Now, should we move to ranking? Or do you want me to talk more about serving and how to deploy the model?
Doctor Malamute: So for this one, there are tons of things to talk about still. For example, like you said, like the sequence of image, right? In the videos. Are you going to use the raw feature, the image, like the pixels? Or are you going to use more embedding space of the video image?
Hyper Taco: Oh, I was thinking— okay, sorry. I thought I said it. I was thinking about using some pre-trained model. I mentioned ResNet-50. So the idea is, let's say you take the first, I don't know. So the idea was each video is a minimum length, right? So you take the, you know, and if you're running the video at 25 frames per second, you take the minimum number of seconds times 25 and they give you like every video, you know, as a minimum number of seconds times 25 frames. And the idea that I had was you can just like put every single of those images stacked into a ResNet-50 on each or any other pre-trained model. Like, you can think about the Vision Transformer as well. And that generates embedding for each video. And then what you can do is you maybe stack all of these embeddings or just maybe do some averaging. Essentially, I want to use a pre-trained model and I'm not going to use the raw pixels. I'm just going to use some pre-trained model to generate features from the images, from the raw images. That was the basic idea that I had.
Doctor Malamute: Okay. Cool. So now, two follow-up questions on this. First of all, because video has different lengths, right, and very different distribution, how do you pick the sequence of images per video? There's like a choice I want to see. I will just—
Hyper Taco: go ahead. The baseline that I had in my head was just maybe let's just look at the first min length times Okay, you can, you can maybe took like the, the first like 2 seconds and the last 2 seconds, or maybe the first 5 seconds and the last 5 seconds, uh, and also maybe look at like some frames in the middle. So like maybe the thing that you can do is just like find some like 10 buckets, just like the first like, uh, and you just like, uh, kind of divide each image by like buckets of 10, um, of equal length. So you know, like for instance, you want 10 buckets of 1 second, So you take the first ones and you just spread those buckets around the length of the video, and then you just pass all of the selected frames by this method to a pre-trained model. Does that make sense?
Doctor Malamute: Yeah, I feel based on my experience, it's not— we're just talking here. It's not like evidence or paper. Few seconds from the beginning, few seconds from the end does not actually hold much info of the video. Usually it's like a preview and it's ending. And consider you're talking using image feature, right? What end up will be more textual or maybe some sound in this case.
Hyper Taco: Overall— You said preview and then what?
Doctor Malamute: Imagine you're watching some YouTube video, right? Because we also both are heavy user of YouTube. The first few seconds will probably tell you nothing of the video. Yeah. And the last second, and probably some like ending song and like ending, um, ending, right? Uh, so it's likely you don't get that much info from it.
Hyper Taco: Okay.
Doctor Malamute: That's my, uh, my suspicion. And if that's the case, how do you tackle this?
Hyper Taco: You're right. Okay, well, the obvious caveat would be like you can look at like only like a middle section or like some middle section or like only video, like part of the video that after, let's say like 5 seconds within the video. I think that's like a little bit hacky. So I know we have like, you know, just like wonderful model called Gemini and they're just like you can parse a video into it. So like there's some features that are gonna be generated from the video. So maybe what you can do is just use some like this particular pre-trained model. Instead of like, instead of what I'm saying is instead of splitting a video into different like frames, you just like, you just use like one model that takes video, just like text video as inputs. And the advantage of this is that, you know, in a movie or like in a video, you have like the context because there's like some sequential dependencies between different frames. And like the approach that I originally had in mind doesn't take this temporal dependency into account. And you want to take temporal dependency into account. So I would just use maybe, yeah, just like some, like, yeah, I was thinking about Gemini. I don't know feasible it is. I've never worked with this, but that would be an idea.
Doctor Malamute: Okay. So I don't think it's, first of all, I don't think that's hacky. We have this like, we have this like, How do I say, like, pre-trained model. We are living in a world of LLM, and reusing pre-trained model does not sound hacky to me. Question here is like the compute, right? Especially the model, if the LLM— I'm not sure I have heard about them, LLM that can take videos for now. But assuming that's one, because I'm not a multimodality part, I know taking image and text but not the video. Assuming that's the one taking the video, right? Processing the video and generate the embedding gonna be long, right? And be costly. How do you mitigate that? Because for example, you cannot do this online for sure. This probably take 10 minutes.
Hyper Taco: No, no. For the video embeddings, you want to compute everything offline. And yeah, needs to be stored in the video store. So, okay, how do we speed this up? Yeah, because it's like, let's say, a lot of like hundreds of frames to process at the same time. How can we speed that up? I mean, okay, there's like, okay, you can, okay, so essentially what you need to do is you need to pass a bunch of frames into a pre-trained model. And the two things I know about how to like speed is spinning up inference. You can batch the predictions and you can also quantize the model. Yeah, so I would say, like, I would like, you know, to make that faster, I would first try to batch and quantize everything. So batch all predictions and quantize the model, because at the end of the day, I don't think we need super high precision for that.
Doctor Malamute: Yeah, what else?
Hyper Taco: Maybe, okay, how to generate embeddings faster besides batch and quantizing. Yes.
Doctor Malamute: Well, I guess it's okay. But imagine there will be a discussion in the end, like we're going to challenge you a little bit and you probably can propose some alternative or some thinking on top of that. That's what we call dive deep. For this case, I feel like if, for example, if the video generation is going to cost a lot, maybe we can downsample the video. For example, like you said, by instead of using 4K video, we can downsample to maybe 480p or maybe even smaller. That sequence and also downsample the frame instead of 20 frames per second, we can use 1 frame per second like you just suggest. And use that thing to generate. That could— that might be much less costly, right? So that's something we can do. And during the batch, I feel we shouldn't do this for all videos. Maybe the video that could be popular or on target, and we can do this incremental instead of the whole, uh, like, uh— so incremental means like we do this in small batches. Re-update the views, not total like 10 billion at one run. Something like that. Something we can discuss and optimize. That's more like the realistic, practical question. So I guess we can go to the feedback loop now. First of all, yeah, I can see the conclusion on this interview. No. So one feedback here— not feedback, one tip here is you don't want to finish everything because machine learning system design is infeasible task, mission impossible. The reason is that you have to squeeze all your knowledge into 1 hour for YouTube, right? Because this is not building 1 hour. You cannot even describe the whole thing in 1 hour for end-to-end. What we really want to see is like the capability of you to think both in a very high-level way, for example, yeah, understanding the business product, understanding how to define a product and objective, basically shifting the product objective into a modeling objective, like for example engagement, UCTR, something like reflect on your side. And after that, I want you to see how you actually manage all this flow. Usually we call this life of a query, but I want to avoid confusion for you because query usually means the search query in the traditional context. But life of query means like many contacts of the user goes— basically life of request in this thing, in this setup, like how it goes and how it touch how many components, right? Something like that. Uh, that's the second one. So basically that's, uh, the, your capability or your skill high-level thing, but both on business side and on the system side. And after that, we're going to particular component. And in that sense, we can talk about how you generate feature, how you decide the feature, how you use the modeling. There are many, many things we can discuss. And the goal is not to have a perfect answer. There's no right or wrong in system design. Basically, you have to have a reasonable answer. And once you're facing some challenges, I raise really valid questions because I'm in this area. Um, yeah, so, uh, you should give me some workaround, some solution to optimize that. Like we are working peers, right? Uh, you propose this design to me, I give you some challenges, you get something that's workable, that's good enough. Okay, yeah, so that's, uh, that's overall the feedback. You don't have to finish end-to-end. As long as I collect enough data points, the interview is over. And I can say as a practical interviewer, most of this will be done in 30 minutes. Uh, okay, the rest of time is more like, uh, um, you cannot change the opinion after 30 minutes. Basically, it sort of means I can already know you well enough to give you a conclusion. Question.
Hyper Taco: I did a mock interview with a guy at [REDACTED] and he told me like, at least at [REDACTED], you have to like finish everything. Like, and time management was a big feedback I got. But I just want to get a sense, like, how different is system design, ML design at [REDACTED] versus [REDACTED]? Do you know? Because it seems like the expectations are a little bit different from what you're telling me.
Doctor Malamute: So that's not, to me, that's not something company-wide guidance. And actually, uh, in [REDACTED], I actually entered this like interview, uh, uh, pilot workshop, and it actually told me we don't need to finish. And I hold this bar over my career, not just in [REDACTED]. I already left [REDACTED], but I hold this bar like you need to finish the end-to-end, uh, overall, I think. But maybe it's by interviewer. Maybe that could be true, uh, for that particular interviewer, or that could be true by [REDACTED]. I, I don't, I don't have an answer for that. Um, but maybe, maybe I think, uh, for your own benefits, check, uh, the interviewer before you do that. You say like, we probably don't have enough time to dive into all the components. Uh, do you see, uh, it's possible we dive into one of them, or you you want me finish describing the two components? Maybe like proactively ask the interviewer to see her— his— like, his decision. What do you think? Yeah, yeah.
Hyper Taco: Was it a question? Huh? Oh yeah, okay, okay, got it, got it, got it. Okay, yeah, so we don't have the time to cover everything. Uh, yeah, okay. So essentially when I said—
Doctor Malamute: sorry, go ahead.
Hyper Taco: Uh, just like when I write data, feature engineering, deep dive modeling, serving, deployment, I just say like, is there one component you want me to go deep popular?
Doctor Malamute: Yes, so my, my answer is that maybe that's by interviewer or maybe it's by company. I don't know, uh, but I'm being consistent on this part. So yeah, I think for your own benefit, it's better ask the interviewer before making this assumption. Basically you're saying, uh, we don't have— we only have 60 minutes and we have many components we can talk about. Do you want to me go over everything or do you want me go to one thing? Ask this question particularly.
Hyper Taco: Yeah, yeah, it's a good one. Uh, we cover everything.
Doctor Malamute: Okay.
Hyper Taco: Uh, all right, [REDACTED], is there like anything else you want to tell me?
Doctor Malamute: Uh, yeah, so I would say, um, so, uh, maybe I can give you an honest feedback. I feel you're on the fence of L4. That, that's what I would say, because like, uh, something you can improve on, like requirements part And something on the— you can improve on the high-level part, but you have a good understanding of modeling.
Hyper Taco: Sorry, when you say I'm on the fence for— is it like I'm borderline L3, L4, borderline L4, L5?
Doctor Malamute: Borderline of L4. Oh, and pass or no pass. Basically pass or no pass.
Hyper Taco: There's no way it's L5 and it's more of an L4, maybe?
Doctor Malamute: Yeah, but I feel like if you fix the, uh, like a requirements and high-level part, maybe you should be a pass for L4.
Hyper Taco: Okay, but L5 is not gonna happen.
Doctor Malamute: L5, I feel maybe we need to do more, many more rounds of this, and, uh, let me see your whole, uh, readiness because you need, you need to be very, um flow, like a fluent in a flow, like everything you need to drive and everything is up to my expectation, something like that, I will give you L5. That's my, um, yeah, that's my feedback.
Hyper Taco: Okay, so that's okay, that's borderline then for—
Doctor Malamute: okay, [REDACTED], that's good to know. I feel you're very close. Uh, I think, uh, on someday I will give pass, someday I will give incline, uh, to you. For L4. That's, that's much better. Yes. And a good thing is like for L4, you don't actually— I think system design is not good, it's a very hard criteria here. I see.
Hyper Taco: Okay, okay, got it. So besides that feedback, honest feedback, you said bold on L4, what else? So your requirement should be better. Okay, so requirements, I guess what you mean, they should talk more about like, like CTR. I should have said Talk about it.
Doctor Malamute: No, no, you're right. The important thing is like you need to define what's the product. Is it home feed recommendation or the Watch Next? And the recommended entity, especially since it's a recommendation system, right? Is it videos? Is it posts? Is it channel? Something you need to define. And one thing you have a bonus point, which I feel you do think about this, is you talk about baseline.
Hyper Taco: If it's a—
Doctor Malamute: do we have a baseline or not? That's something like a good bonus point. So it's like a little, little plus, little minus, something like that. I see.
Hyper Taco: Okay. I see.
Doctor Malamute: Okay. Yeah. So maybe let's move to your question. What do you think?
Hyper Taco: Sure. I mean, I don't have much questions. We covered a lot of stuff already, but if I had like one thing to work on before the interview, what would you suggest? Because, okay, so what I would do is just—
Doctor Malamute: Next Monday, right?
Hyper Taco: Yeah, it's something like introduce.
Doctor Malamute: Uh, so maybe talk to ChatGPT, uh, go over this, uh, again and, uh, try to be fluent. I think that's, uh, so I, I see you have a tendency to focus on detail first. Oh yeah, so which, first of all, I think your details are good. You're better than many candidates, I think. But before diving to dive deep too early. You should be giving more like high level. You're designing a YouTube, you're architect of YouTube, then you shouldn't focus on detail first. You should focus on high level, then dive deep too. That's like more logical reasoning, right? When you are architect of YouTube, you should first say how the whole high level works and how each component works, right? Yeah, yeah, that's something like that. And the Go, Go, I think so currently like Claude or ChatGPT has a very strong capability of system design, talk to them and let them mock you, and you can practice as many times as possible before.
Hyper Taco: And you say, okay, I should really do the drawing that I did. Okay, okay, okay, okay. Um, in terms of like the deep dive that you did, that, that you did, was it okay? Is there something that I should have done better? Or—
Doctor Malamute: yeah, you're better than I think you That is better than L4. It's more close to L4 to L5. So I would say if you get everything ready, it's more like connecting the dots and filling the gaps, right? If you're filling the gaps for everything, you can, you can reach for L5 too.
Hyper Taco: Okay, that's good to know because I feel like I have the ML knowledge or I have like at least some intuition. I can bring the ML knowledge. It's just about presenting the whole stuff now.
Doctor Malamute: Okay, cool. Yeah, presenting— this seems more like a demo or like a presentation, right? So you need to make sure the slides are smooth, slides are like consecutive, like connecting logically, something like that. Yeah, it's not in your bag, just don't feel sad about it. Uh, it's just like some gaps you need to fill in before the interview.
Hyper Taco: I see.
Doctor Malamute: Okay.
Hyper Taco: Okay. Okay. Okay. Okay. So we didn't cover everything that deployment— we didn't have the time to cover like deployment, serving, A/B testing, monitoring, shadow deployments, maybe like in comparison to the baseline. Is this fine if we don't get there? I mean, okay, you also, you already told me like ask if we should cover everything, but—
Doctor Malamute: But listen to yourself. Now all these things is like one team at least. You cannot cover their whole team structure within 30, 60 minutes, right?
Hyper Taco: Sounds good.
Doctor Malamute: You probably can briefly talk about it. But each thing you talk here is at least a 10-person job. And you cannot cover a 10-person job within 60 minutes. And you're covering many teams here.
Hyper Taco: Yeah, I know. I know. I would design this also myself in practice.
Doctor Malamute: Cool, cool, cool.
Hyper Taco: Question, one last. Okay, so my situation is as follows. I'm trying to boomerang kind of and have to like get some interviews to get back in. I know I will not go for the hiring committee, I'm just going to go straight to VP in my case. Oh really?
Doctor Malamute: Boomerang means to go into VP directly?
Hyper Taco: I'm sorry, what?
Doctor Malamute: So the boomerang, boomerang, like program is going to VP directly, not the hiring committee.
Hyper Taco: Okay, so, uh, I'm for— okay, I apparently in [REDACTED], uh, there's just like one AI depth interview that I did extremely well on. So I asked you a bunch of AI questions, I just like did very well. But like, since I was the DS when I left, I want to get back as a SWE. They added coding and ML design. So like, okay, I guess the question I had for you is, um, Is the bar for Boomerang different? Do you know about that?
Doctor Malamute: I don't know. I also want to know. I also want to know. Yeah, yeah. But I don't know. Because based on my interview before, I think I've done at least 100 in [REDACTED]. I never see a tag like, "So this is a Boomerang." You've never seen a Boomerang? I never see a tagging on a candidate say, "This is a Boomerang candidate." So I'm not sure if it could be possible I didn't never interview the Boomerang candidate, or it could be possible that just the recruiter intentionally hide that information from me.
Hyper Taco: I think they're hiding it because in my [REDACTED] — I already did the [REDACTED] interview. In the [REDACTED] interview, I told the guy that I was trying to Boomerang, and the guy was, oh, OK, I didn't know that. So I don't think they know that I'm a Boomerang.
Doctor Malamute: Okay, okay, then we probably hold the same bar. That's like, uh, we, we never treat differently.
Hyper Taco: Okay, got it.
Doctor Malamute: Um, cool. Um, maybe my personal question, if they give you an L4, uh, will you accept it?
Hyper Taco: Uh, maybe, maybe. I mean, I think honestly, like, I mean, I did like 2 interviews, system design, ML, on this platform. And like, the feedback was, your feedback is consistent with the past guys. So like, there's no way I'm going to reach L5.
Doctor Malamute: Yeah, but I feel with one more, one day or one week, depending on your speed of learning, you definitely can fit L4. That's something like I can see that. So yeah, it's up to you if you want to accept the L4 offer.
Hyper Taco: I'll think about it because I'm at the startup right now and it's like terrible. So we'll see. I mean, I left [REDACTED] as an L4 DS. Uh, I'm trying to grow my business with SWE. So if there's also a down level there, oh my God, that's fine.
Doctor Malamute: Yeah, I feel like also data scientists changing the track, right? That's also— I'm not just pursuing anything, but changing the right track also gave you some bonus in compensation too, right?
Hyper Taco: So yeah, I was very well paid as DS and I think that's what I got laid off, but I don't want to be a DS anymore. I think it sucks. I mean, I was like more of a cool like [REDACTED] DS, almost ML engineer, but I want to be a SWE. Like, I don't want to be a DS anymore.
Doctor Malamute: Okay, anyway, good luck and best of luck to you.
Hyper Taco: Yeah, thank you very much. Have a good day.
Doctor Malamute: Bye. Bye.