We helped write the sequel to "Cracking the Coding Interview". Read 9 chapters for free

An Interview with a Meta engineer

Watch someone solve the design instagram reels problem in an interview with a Meta engineer and see the feedback their interviewer left them. Explore this problem and others in our library of interview replays.

Interview Summary

Problem type

Design Instagram Reels

Interview question

Design Instagram Reels

Interview Feedback

Feedback about Giant Robot (the interviewee)

Advance this person to the next round?

How were their technical skills?

3/4

How was their problem solving ability?

2/4

What about their communication ability?

3/4

Strengths and what went well * Demonstrated solid conceptual grasp of modern recommender-system architecture (two-tower candidate generation + heavy ranker, learnable ID embeddings, multitask heads). * Asked clarifying questions on business goals, scale, user/reel signals and quickly recognised the task as a ranking problem. * Brought up key ideas such as cold-start, negative sampling for class imbalance, and implicit vs explicit feedback signals. * Comfortable discussing alternative label definitions (binary watch-through vs regression on watch-time, likes as sparse signal). * Showed awareness that candidate generation and ranking stages require different feature sets and latency constraints. Areas of improvement * Time management: \~20 min went into scoping questions and entity diagrams before choosing an ML objective; that forced a rushed treatment of feature engineering, training-data handling, serving and evaluation. * Feature set depth: missed simple but high-value counters/ratios (e.g. user-category watch-through rate, reel global CTR), author/social-graph features, session context and freshness-related signals. * Training-data design: did not articulate a strict time-based train/valid/test split, how to snapshot features at event time, or techniques to mitigate feedback-loop bias and ensure exploration data. * Serving/ops: no discussion of feature drift monitoring, automated retraining schedule, rollback strategy, or online A/B measurement beyond “watch-time goes up”. * Decision paralysis: spent energy comparing “simple vs state-of-the-art” instead of picking a concrete baseline and iterating; confidence dipped when pressed for specifics. Advice for future interviews 1. **Time-box the opening:** ≤5 min to state business metric, ≤5 min to lock ML objective and label. Declare assumptions instead of polling the interviewer for every detail. 2. **Lead with a clear end-to-end plan:** * Candidate generation (two-tower, recall ≈ 1 k) * Ranking (deep MLP with user, reel, context features, multitask heads) * Post-ranking exploration/novelty logic 3. **Feature checklist:** user profile, session context, interaction history embeddings, author graph, content embeddings, freshness, global ratios/counters. Mention cold-start fallback (text/video embeddings). 4. **Data and evaluation:** explain time-based splitting, online exploration buckets, offline→online metrics alignment, and how to monitor + retrain. 5. **Ops signals:** feature store reuse for parity, canary rollout, automatic rollback on watch-time/CTR alert, feature-distribution drift dashboards. 6. **Practice 45-minute dry-runs:** allocate rough minutes (10 framing, 10 data/labels, 10 features + model, 10 serving/ops, 5 deep-dive/Q\&A) and stick to it. 7. **Show decisiveness:** pick a reasonable baseline quickly, then layer in refinements; state why each addition matters to engagement.

Feedback about Admiral Hex (the interviewer)

Would you want to work with this person?

Yes

How excited would you be to work with them?

4/4

How good were the questions?

4/4

How helpful was your interviewer in guiding you to the solution(s)?

4/4

This was a challenging problem but I really enjoyed the discussions that it led to. We covered a few of the signals that interviewers are looking for, but it would be great to structure the feedback a little bit better. Also the sound quality got a little worse during the interview, maybe because you moved back. That said, overall very useful interview for preparation. Thanks!

Interview Transcript

Admiral Hex: Hello. Hello. Hey, how's it going?

Giant Robot: Good. How are you doing?

Admiral Hex: I'm good. So hi, welcome to your practice interview. We have one hour for the interview. We'll need only 45 minutes for the actual mock and then we can go over feedback towards the end. Before I get started, I want to understand from you which company you're doing for and when is your next interview?

Giant Robot: Sure, sure. So it seems like something's broken with the website. Let me reload. Doesn't show the UI. So I'll be right back.

Admiral Hex: Okay.

Giant Robot: Okay, cool. Sure. So my next interview-- so I'm starting the meta. kind of L6 interview on site tomorrow and the next day. And yeah, I have two coding rounds, two ML system design rounds, and a behavioral. So yeah, this is really just the polish. And the system design portion, or the ML system design portion, is the one that I'm actually the most worried about. So anyway, this is why I'm doing this.

Admiral Hex: All right. Cool. So whenever you're ready to get started, just hop on over to the whiteboard and then we can start. Okay, cool.

Giant Robot: Yeah, we can get started. Do you want to know anything about my background or you want to just roll with it.

Admiral Hex: Yeah, let's just go into the mark itself.

Giant Robot: Okay.

Admiral Hex: I don't ask a lot about the background and stuff just so that I remain sort of objective and unbiased in a way.

Giant Robot: Okay, sounds good. One preference is I did a harmful content detection one before, so maybe doing a different problem with the?

Admiral Hex: Yeah, sure. Sure, we'll do it. Yeah, I wasn't planning to ask it anyway. All right, so I'll give you a high level problem statement and I wanted to drive most of the discussion around the solution, but feel free to ask clarifying questions as required. So the system we'll be designing today is Instagram Reels. You might have used this product. It's a feed of short videos. So when you enter the app, there's a short video that's playing. And then you can swipe up to see the next one, and it's like an infinite scroll. Yeah. So over to you. Okay, cool.

Giant Robot: So first, I think just so I can frame the discussion a little bit and guide it, I wanted to do a quick outline of the way I want to approach this problem. First, yeah, I want to understand the constraints. So then I want to see how ML can help maybe classify the problem. Then talk about data and feature engineering. How will train and well, build a train set and, you know, train the model, talk about serving, the, you know, quality or metrics, how do weknow we're performing well? And then we can talk about any sort of deep Dives.

Admiral Hex: Does that sound good? Yep. Cool. So.

Giant Robot: Okay, so we have this design reels thing on Instagram. So let's talk a little bit about the business objective. What are we trying to optimize for? Do we want to increase the engagement or maybe show the most ads while doing this or have the most interactions. What is the high level business case for this?

Admiral Hex: Yeah, so what do you think?

Giant Robot: So we can optimize for views or engagement with the platform.But there are more concerns about long-term health of an ecosystem like this. So we could optimize for quality-weighted views or user satisfaction. I think we should maybe try for quality-weighted views or interactions?

Admiral Hex: Yeah. Okay. And how would you define that metric itself?

Giant Robot: Well, one approach is to say that we can gauge maybe the quality of a post based on its content, some heuristics on like duration or the number of say comments on it or.

Admiral Hex: Yeah.

Giant Robot: Maybe the.

Admiral Hex: Okay, sure, but maybe we take a step back and just look at it again from first principles. So like quality-weighted views is one good metric, but if you had to choose something simpler, like what are you trying to maximize here? Okay.

Giant Robot: So yeah, if we don't want to do long term, maybe we can do that in sort of refinement. We can just maximize engagement with the Reels. So maybe the watch time, like some sort of normalized watch time or the engagement with the platform through likes and sort of completed views, something like that.

Admiral Hex: Got it. So, yeah, that makes it as a business objective. Yeah.

Giant Robot: Okay. Maximize engagement platform as measured by maybe actually it's impressions and watch time, I don't know, or interactions and/or interactions.

Admiral Hex: Okay, cool.

Giant Robot: So the other thing we should probably talk about is, okay, where, like, what is the scale of the system? So how many reals do we get? Are there any, like, how many users, that kind of thing. So what would be something that could guide us there in terms of the user count, say?

Admiral Hex: So yeah, it's like a billion daily active users.

Giant Robot: And daily active users. And what is the creation rate of new Reels? Like how many new Reels are we creating?

Admiral Hex: Yeah, we can estimate that, I guess, but I would say, yeah, it would be sort of two orders of magnitude lower than the DAU, like, you know, like 1% of users roughly make a read. So that's, yeah. 10 million.

Giant Robot: Sort of new reels. Reels, okay, cool. And then what regions are we? Is there any regional kind of information we should care about?

Admiral Hex: It's a global product, but obviously the highest sort of scale is in the US and Europe, but there's also significant presence in Latin America, Africa, Middle East and Asia.

Giant Robot: Okay. But let's say also operating in other regions. Okay and obviously we want these-- the hypothesis is that we want to basically show users Reels in some personalized way. What kind of information is available about the users or the Reals? Like, do we have, for example, do we make users say what topics or categories they're interested in? Do Reals have tags? Or are we kind of learning this in an unsupervised way?

Admiral Hex: I think we don't have a lot of structured information that is input directly by the user. So we can infer all kinds of things, but from the user point of view it's very simple in terms of like they're just uploading videos and like, you know, as a consumer you're just coming in and watching the videos.

Giant Robot: Okay, on users upon creation. Okay, and then about the videos we just have video content and maybe initially we just have video content. Do we allow anything like comments or...

Admiral Hex: Yes, we do allow likes and comments and maybe even video descriptions. Okay.

Giant Robot: Likes, comments.

Admiral Hex: It was a free form, so there's not much structured input that the user can put in. Like, we don't ask them to choose a category, for example, but they can add free form text.

Giant Robot: Okay, so, okay, cool. Can I assume that we at least have location information or metadata? For example, some metadata, like what other metadata would we actually need? Is there any, interactions between users that we have, for example, follows or anything like that?

Admiral Hex: Follows, we do support follows.

Giant Robot: Okay.

Admiral Hex: So you can follow the author of read.

Giant Robot: How do you spell it?

Admiral Hex: Follow author. Okay.

Giant Robot: Let's see. Is there anything else that we should talk about here? I mean, okay. Maybe this is enough for the initial problem. Maybe I'll go back and refine this as we do our initial design. I like to think about this in terms of sort of an iterative approach. So we can view this as having certain hypotheses and then try to choose a-- oh, actually one thing I should ask, is there any system that already exists? Are we building on top of an existingsystem or data set? Or is this the first kind of MVP What do we have access to?

Admiral Hex: So we have user data. So let's say we have a heuristic running for ranking and the whole system is running and has a lot of users, so we have lots of user activity data and now we are transitioning to an ML-based system.

Giant Robot: I see. So we have a user stick running on the platform for ranking. Lots of users. So our goal is to maximize engagement on top of this baseline.

Admiral Hex: Right.

Giant Robot: So the first thing I was going to ask is or talk about is essentially what do we use as the baseline? But it sounds like we already have some sort of simple heuristic. And am I assuming that we will we should maybe talk about the metrics that we've been using for this heuristic. So the metrics that we said we were using are sort of engagement as measured by maybe daily impressions or daily watch time of users, maybe by region or by stratified by somedemographic. So if we gauge model improvement means that our watch time for our watch time improves or the number of interactions with our system improves over time. Maybe it's measured by daily, weekly, or in AB tests, something like this. Okay. All right. So, sorry?

Admiral Hex: Makes sense, makes sense. Great.

Giant Robot: Okay, so then let's talk about, okay, so this is fairly clearly some sort of ranking problem where we have a lot of reels that get created, so like the inputs, I guess, would be, so you have an interaction that's like, Create real. And we, I don't know. Well, actually, inputs to the system are like real creation and then like sort of we don't do queries but we.

Admiral Hex: Just do. Like.

Giant Robot: Correct. Ask for recommendations for the system, right? And the system must sort of do something with the created real information and the output should be something like a ranked list of Reals. The other interaction we can have is maybe the Create Real or Interact with Post, something like this. And then during serving, When working with the system, you ask for recommendations and get a ranked list of reals. So we can talk a little bit about the data thatwe're operating on. So the hypothesis is that about what are the entities? So we have user metadata, Like location is maybe the primary one, but maybe also age, gender, things like this. Features like this. We have users' interaction, interaction history.

Admiral Hex: Sorry to interrupt you, but before we jump into that, maybe we should think about what is the output of the model and just the overall structure of the training data. Like what's the ML objective? What's the label? What's the training data format?

Giant Robot: Sure, sure. So I was going to do that after the entities, but we can maybe switch the order. So, okay. So if this is a ranking problem, which it sounds like it is.

Admiral Hex: We.

Giant Robot: Need to process whatever information that we have and output a ranked list of Reals. So there are several ways of doing this. The classic one is you can, for.

Admiral Hex: Each.

Giant Robot: Real and User pair, you can assign a score for relevance, like the likelihood of seeing a thing, right?

Admiral Hex: And.

Giant Robot: Essentially you can treat this as a binary classification problem, like given input features and say history and things like this, what is the probability that the real is relevant? And you kind of then given a large group of Reels that you could show that are on your platform, you can sort of rank them based on the score and show the most relevant things first. So that's one approach. Another approach, does that make sense?

Admiral Hex: Yeah, that makes sense. I want to understand how you define relevance here.

Giant Robot: Sure. Relevance here is if a user, say, interacts with a post in one of various ways, there's explicit and implicit feedback that we can get. For example, if they watch the reel to completion, you can treat this as a probability Like, okay, a post is relevant, one of the hypotheses is a post is relevant if a user, when shown it, will watch it to completion, right? So, or like a large fraction of it. So we could treat this as a classification problemwhere the person finished watching the Reel. So another hypothesis is post is relevant if the user follows the author of it. Or you could also treat this as a regression problem, like what fraction of the real does the user watch as a normalized time, or maybe you can treat this as an ordinal regression problem where you bend the and say you want to watch a large fraction of it. So like the probability that a person watches five minutes of a reel or something or some relevant time bins. The, I mean, each of these are viable. Is there one that you would like to focus in particular or would you like me to choose one? Because, I mean, realistically, you should choose.

Admiral Hex: Okay, so. Well.

Giant Robot: We can focus on a simple version first. So in real systems, I think the trend has been to actually have multi-term objectives. So you actually sort of based on a user, like whatever user information and session information you have, plus the the real information that you have, you can optimize multiple tasks for predicting, for example, the probability of a user watching to the end of a reel or leaving a comment or sharing or doing a like. And those arekind of trained as multitask terms. But if you like, we can focus on a single one. Or I guess you'd like me to choose.

Admiral Hex: Yeah, I think multitask makes sense. But if you were to prioritize in terms of what signal do you think would have the highest or what approach do you think would have the highest? Because okay fine it's multitask but you mentioned like classification and regression separately so out of all the things that you said if you were to choose like youe know Of course it can be multitask you can but even in the multitask there would be some tasks that you would train first. Yes, sure. Yeah.

Giant Robot: Well, I think very strong signals. Here's the reasoning. So say that Reels are, I don't know, all about the same length roughly. They're typically at most maybe a minute. So maybe unlike in things like video, we don't have a huge spread in real duration. That's an assumption that maybe you can correct. The duration of watch maybe is irrelevant here. We really just want to see if a person basically scrolls to the next video, which means it's negativefeedback, they don't care about this, or they watch to the end.

Admiral Hex: So.

Giant Robot: We can treat this as sort of one term. So it's a positive impression if they watch to the end. So that was probably the one that I would train first, and it's probably the simplest to train. The second term that I would do is probably likes. It's very strong explicit signal that.

Admiral Hex: A.

Giant Robot: User likes a post. But they tend to be very rare. So one is more common and implicit and probably noisy. Sometimes you just watch reels on accident. The other one is very strong, explicit, but has low density. So I'd focus on the first one first and then maybe add other features. So if so, then you have a standard sort of classification problem where a user either interacts-- a user either watches to the end Or if they scroll past, then they don't. Andyou maybe set some threshold, like if they watch like five seconds, you treat it as a fast scroll. Something like this.

Admiral Hex: Okay, if we go with that, now what's one row in the training data?

Giant Robot: Sorry, could you say that again?

Admiral Hex: What's one row in the training data?

Giant Robot: Oh, what's one row of the training data?

Admiral Hex: Sure.

Giant Robot: OK, so we need a so we have an ID, a user ID, right? We have their, I don't know, some metadata like age, gender, location. And importantly, we should have, well.

Admiral Hex: Well.

Giant Robot: Okay, it's an important predictive feature for their history, but maybe we can leave that out for now and then make further improvements based on their view history.

Admiral Hex: So.

Giant Robot: The other So do you want me to create-- okay, so this would be, I don't know, some ID. I don't know, there are 25.

Admiral Hex: We would want-- There are some user features. Yeah. So user ID, user features, what else is there in the training data? Like what's there in one row of the training data?

Giant Robot: Sure, sure. So I think there are three entities here. So there's the user. So we have the real. which we have the video content, we have a description, we have an author, and maybe we have some comments, I don't know, associated with it, something like this. And then we also have user.

Admiral Hex: Real.

Giant Robot: Interactions where we can have different, maybe actually we take this out. We can have user ID, real ID, and then we have a type of interaction like watch. We have a timestamp.

Admiral Hex: And.

Giant Robot: We can have extra fields. We can have other interactions like same kind of user ID, real ID, comment, and then similarly we can have a like.

Admiral Hex: Right.

Giant Robot: Cool. So since our first Our first task so far has been watch. We can actually drop all the other rows and just encode information in various ways, but in our case, a watch is like a positive label and a short watch or a skip is a negative interaction. If that's the case, we need to build, I mean, these days you would build features based on the user and also you would process the reel in some way and then you'd use the watch interaction as your labelset. Would you like me to talk about the feature engineering for some, how I would encode these features?

Admiral Hex: Yeah, just A couple of things before going to feature engineering is like, so positive samples are where the user has interacted with the real, in the sense they've watched it more than whatever. What are the negative samples?

Giant Robot: Scrolling to the next one or ending the session.

Admiral Hex: So for a user you have the reels which they've been shown.

Giant Robot: The real switch.

Admiral Hex: They've been shown, right? If I have not seen a real, that's not part of that user ID real ID pair is not in the train data.

Giant Robot: Yes, that's true. So you're saying how would I actually gauge that? So I could do So we can.

Admiral Hex: Do.

Giant Robot: In the training data, we can have skip or some-- that's fine.

Admiral Hex: What I'm saying is for a user, you first start with all the reels that have an impression, and then you use the ones where there's a significant watch as positive labels and one where they're skipped as negative labels. Right? That's a correct understanding. All right. And how do you split the train data into train test and validation? So.

Giant Robot: Okay. So we have a bunch of these interactions.

Admiral Hex: So.

Giant Robot: Right now I have an assumption that There's no learning behavior.

Admiral Hex: So.

Giant Robot: Basically, well, if that's true, then.

Admiral Hex: We can.

Giant Robot: Literally do an agnostic split. So we have some large amount of data. We make sure that we have a representative sample of our users. in the training set based on demographics information. So you can maybe do a stratified split. And right now, I'm not doing anything special with the timestamp. In principle, I know that for these kinds of systems, you want to make sure that you're not peeking at user behavior by having the same user at different times. beat basically the in the train split and then say a future time in the test split because you can then sort of unmask or get dependency. You know more about that user than you would with a random user. So maybe said another way, I want to stratify by user ID interactions. So I want to build a test train validate split such that users are the same user is only in the sort of one of the folds. That's probably the biggest thing that I'm worried about. The other thing is I want to have a nice mix of the other features so that, for example, we don't only have people from the US in the train set and then people from other regions in the other set. So I would then sort of mix those.

Admiral Hex: Does that make sense? Makes sense. All right, let's get into the feature engineering.

Giant Robot: Okay, so in the feature engineering, well, A lot of the things that we care about are essentially how do you create useful numeric features for the user information, and then how do you actually encode the video content and the description for the real information. So for For the tabular features for the user, initially, I mean, you do the standard thing where if it's a continuous feature, maybe actually the age itself is not particularly important, butsome binned versions of age is useful. For the gender, you do some one-hot encoding information for things like Location, I.

Admiral Hex: Wouldn'T.

Giant Robot: Well, location you could do a label encoding but you probably want to have some sort of also binning, like maybe you only care about regions, unless you want to show like very specific content. For the reals, as a first pass, There are a few things you can do for the video content. For example, you can do a video embedding model where you just directly process the video content. Alternatively, you can just ignore the video content itself. You just havea video ID, and then you try to treat this as just-- you learn an embedding for that video content directly. As a first pass, I would probably do that rather than trying to process the video content. The downside of that is you have a cold start problem if you don't know anything about the video itself. But since we have also description information, maybe we can infer the content based on the description.

Admiral Hex: Right.

Giant Robot: So maybe as a first pass, I would just treat the video content as an ID, and then the description, I would have some like text embedding model. We can start with something as simple as like FastText or TF-IDF, but these days maybe you would just use a pre-trained transformer model to create an embedding for that. You can use something like BERT or if you wanted to use a decoder model, you can actually pull the embeddings. Right now we're not reallyusing the author feature.

Admiral Hex: So.

Giant Robot: I think maybe I want to skip using the author feature for now. We can actually use that for simple heuristics to recommend Reels to users based on their friendship. OK. Would you like me to talk Well, does the feature engineering make sense? Or would you like me to expand on something?

Admiral Hex: Yeah, it makes sense. Is there anything you want to add to the feature engineering because you have only the user profile and real embeddings, if I understand correctly? Yeah.

Giant Robot: Well, for now, okay. It's a question of whether we want to build a very simple model or a more elaborate model. So the standard way of building recommendation systems based on things like this, or at least a decade ago, was collaborative filtering, where you would use maybe some feature information and some video content or like the real information. And then you would just look for impressions that you cared about. And then based on that, you could findsimilar reels and you wouldn't use the other features. These days, and maybe that's a good approach for an initial system. But these days, what you would probably do is you can use the sort of a video embedding that you learn. You can sort of concatenate the description information, and then you would learn a similarity embedding between the user features and the real features. Then based on that, you can kind of predict the the watch interaction because users that interact with Reals will tend to watch them, things like this. But generally for latency reasons, you would build this as an initial stage system for candidate generation. So you would actually pre-train this initial candidate generation system. You would potentially generate a large number of Reels that you could propose. And then you would go on to a ranking or a refinement phase afterwards, where you could, in addition to this user information, You also have things like watch history, like things that they have watched before or recently.

Admiral Hex: So.

Giant Robot: You could gauge what their current interest is and sort of dynamically personalize their recommendations.

Admiral Hex: Similarly.

Giant Robot: If you wanted to, you could expand the video content understanding to beyond the standard embedding. You could, for example, sample certain frames and try to understand the content of the video in a way that goes beyond just an embedding ID and then build a ranking based on that. Is that what you had in mind or would you like me to add extra things?

Admiral Hex: No, makes sense. Let's get into the next part which is the actual ranking system itself and you've sort of explained it a bit. So candidate selection ranking, what model architectures would you choose for each stage?

Giant Robot: Sure.

Admiral Hex: So. I understood the difference in the features. But yeah, let's talk about the model architecture. Sure. So.

Giant Robot: The In the candidate generation stage, generally, well, okay. Well, generally these days, you can start with a simple model where you have basically the some concatenated User features, right? You have your real features.

Admiral Hex: You.

Giant Robot: Do some maybe relatively shallow neural network for processing the features, right? And then you have a similarity embedding where you treat the real features either directly as the video ID or you also process them in a separate embedding.

Admiral Hex: Like this.

Giant Robot: And then you do a dot product similarity. So like e-real and then e-user, right, and then you find similar features like this. That's if you do the contrastive learning case, right? So the other one we talked about is more for the ranking.

Admiral Hex: So.

Giant Robot: I guess here for the contrastive, sorry, before I go on to the ranking step, the candidate generation step generally has, okay, so you have users and reels that they interact. So for each user, you have some positive and negative examples for the Reels that they've interacted with, as determined by the user Reel interactions. Generally, they have skipped a lot more Reels than they've watched. So you need to watch the imbalance between positive andnegative interactions, and you can do that with basically negative sampling. You can reweigh the positive and negative examples and then sort of train a model that way to train stably. That'll give you the ability to look up relevant Reels for a given user based on their features at the initial stage. But you can also use other heuristics to suggest things like Reels that are popular within a certain region or you know, like you're doing an advertisement, you want to show certain reels. Okay, so the, for the ranking case, we specifically want to talk about how likely they are to have a long watch versus a skip, right? So we can watch, we can do a more advanced sort of user features where we additionally have their history or embeddings of Reels they've watched in the session, something like this. And then we have a reel, like proposed reel. Unlike the two tower architecture, now we.

Admiral Hex: Would.

Giant Robot: Have some shared shared and then or, you know, in the limiting case, we could use a very simple model.

Admiral Hex: But.

Giant Robot: You have some set of features or, sorry, some set of, I don't know, neural network layers that is shared that sort of produces these user features and or takes these user features at the ranking stage and the proposed real for each element of of the list that we want to rank that we got from the candidate stage. And then we, so we had previously talked about multitask. For now, let's do a single task. We can, for one task, stop. For a single task, we cantalk about the, probability of positive interaction. So we treat this as a classification task.

Admiral Hex: Right? And.

Giant Robot: We train this as a binary cross-entropy task. When we add additional tasks, we can have a task-specific head for like, probability of like or comment. And we can then appropriately weigh the, you know, we can have an objective that weighs these tasks and the overall score of a Reel for a given user based on their session would be given by some weighed probability of these interactions.

Admiral Hex: Cool, I think we're out of time, so we can end here. All right, cool. Before I share my feedback, I want to understand from you how this interview went, what are the things you did well, and what are the areas of improvement.

Giant Robot: I think maybe I did the initial question gathering in a more interactive way than I have before, but I think I really struggled with time management and settling on a particular choice. Yeah, I mean, I didn't get through most of the things that I wanted to talk about. I explored a lot of options, but I didn't get down into specifics. I think that's the biggest thing. And I couldn't decide whether to do a sort of a short, like a simple version at the topor just dive into a more complex, like what I would actually probably build in terms of state of the art.

Admiral Hex: Right, right. Yeah, I think those are all very good observations. So overall, you know, subjective sense that I get from having done a lot of these interviews and also working as an ML engineer at a staff level is like you definitely know your stuff, right? But I think just based on how this interview went, I couldn't recommend a higher for you, to be honest, right? Like I didn't get enough signal in this interview for E6 and even E5 is honestly borderline, right? So there are a couple of reasons for that. The biggest reason I would say is time management and within that the biggest reason is just the amount of time you spend in the beginning. So just to give you a sense, right? we started like maybe five minutes into the hour. So like 20 minutes in which is like 15 minutes out of the 45 that you get, you're still at this point here. And all of these, like this high level diagram was not required. For a second, I felt like you're going into like the traditional system design and you're talking about entities and all this. Things, right? So I would just say, and then you mentioned binary classification like 20 minutes in roughly. Yeah. And even then there was a big discussion on binary versus regression on percentage watched and multitask and okay, that was a decent discussion, but you know, like if you've already, like you should be having that discussion maybe five to 10 minutes into the interview. Right? So like I think you just need to go through things a lot faster. Like just understand the problem really quickly, understand the business objectives really quickly. And yeah, it doesn't have to be super interactive, right? Like you don't need to ask everything from me. You can just make some assumptions on your own. If you do a wrong assumption, I'll correct you. Right?

Giant Robot: Okay.

Admiral Hex: And then that whole part, like the business objectives should be over in less than five minutes. and ML objective five minutes. So 10 minutes in you are at the point where you're talking about, oh, you know, I'm gonna model this as a binary classification with predicting like a significant watch versus no watch, right? And then use like a secondary signal in a multitask approach. So to get to that point should be like no more than 10 minutes, whereas in your actual interview, it took you like 30 minutes almost. Yeah. Right. And then feature engineering was light. We had to rush through all of that. I think your approach overall of like multitask learning and then multi-stage ranking with candidate selection using two Tower and a heavy ranker makes sense with learnable embeddings on video ID. Yeah, those are all good ideas, and you should you should mention those. Yeah, one more thing you said, like you're confused between the simple versus the state of the art approach. Always go for the state of the art, especially nowadays. Like don't shy away from the complexity or the depth. That's the whole challenge of this interview. So think about this, right? If you're like staff level M. L. Engineer position at Meta is pretty rare. but if you look at the actual loop, it's like coding interviews, yeah, those are the same regardless of level. Behavioral is just asking about your past experience. So it's these two system design interviews that actually get you the job as a staff engineer and as an ML engineer also. Right? Like how do they judge whether you know your ML stuff and also you know it enough at a staff level is just this interview. So it's extremely critical, right? And so it's also very difficult. Right? The problem itself is pretty high level. It's pretty tough. And in 45 minutes, you have to be like the other guy has to be like sort of impressed that, oh, you know, this is amazing. This guy was able to do so many things. in just like 45 minutes.

Giant Robot: I see.

Admiral Hex: So by that criteria, like you're still missing a lot. And you know, like, okay, let's say you did what your original intention was. Let's say you had enough time to complete the design, right? Even then I would be so like even if you take the time management aspect out of it, Let's say you did like proper feature engineering, you had more time, you did the multi-stage ranking, learnable embeddings, everything. Like nowadays, that's the E5 bar. What's different for E6, right? It's like things like the training data, right? Like the time-based split. Yes, you did mention that, but you were not, you know, the way you said it didn't Give me confidence that you understand what's happening there. So you are very vague and sort of not confident about your answer on the time-based split on training data. And also, there's no exploration, there's no freshness. Remember, the users only seeing the deals that we give to them. Right, it's not like a normal feed or YouTube or something where you can search, you know, their categories and you can browse. There's a lot of emphasis on browse usually. So, you know, users can go ahead and explore videos and watch them and then you can get a hold of their preferences. But in this sort of system, the user only sees what the system serves. Yeah. Right, so there's a very high possibility of like bias. Like your training data is the output of the model from a day ago. So, you know, if a model starts missing something, that whole bias can sort of propagate. And you end up with this. So like maybe certain topics are underexplored, right? So. Yeah, so, I mean, yeah. So, yeah. So what's the bar? Right? Like, bar is you complete your design, which is a reasonable design with, like, good amount of depth covering all the sections that you said. So you just got till, like, this point here.

Giant Robot: Yes.

Admiral Hex: You're like serving quality, deep ties, everything in 45 minutes. And then you're also able to talk intelligently about some of these considerations, like. exploration, like splitting training data into train and test. Then like feature drift, right? Like, you know, how do you actually get the feature values in the training data? Like, do you just recompute them? If you recompute them every time, then you're also enabling some data leakage, right? Because for example, you have a user watch history. Now, in multiple days of data, obviously, if you pick a day, the feature value at that point should be the watch history till that point, right?

Giant Robot: Yes.

Admiral Hex: So if you just have a feature store where you have latest user watch history and you use that feature everywhere in your training data, that won't work. But that's an important detail which gets missed. I see.

Giant Robot: Yeah, yeah, yeah. So the other, so I guess, so I had done one before and what I was trying to aim for, some of the reason I was hesitant about the timestamp was I was not sure if I could afford the time to explain in detail, but I mean, that just speaks to more time management problems. But in a previous interview, I was told that one of the marks of an E6 engineer may be a bias towards simpler solutions that sort of solve the problem initially. But maybethis is just a difference in strategy. Like you can either be very, I guess, impressive. In particular, if you've worked on these systems before, you know, like the state of the art. Or, yeah, I guess.

Admiral Hex: Yeah, I mean, that's part of the journey, right? So I understand your confusion. I also understand the previous feedback. and my feedback. So let me try to put it a bit in context, right? So what do you mean by simple, right? Like for example, the idea of watch history where you take the embeddings of all the previous videos and average them or something and that's a watch history. It's simple idea. It's not very complex, like to be honest, right? Yeah. But it shows a an understanding of the underlying concepts. It shows a deep conceptual understanding. Otherwise, someone who doesn't understand how embeddings work wouldn't come up with this idea. That's the notion of simplicity. Simplicity doesn't mean old or rather, simplicity just means that you're not overcomplicating it. For example, like a good example of where you showed simplicity here is kind of your whole approach to the label towards the end, right? Where you were like, you know, like after considering everything, I'm like, you know, I just want to know if they swiped or they watched for a significant time. And that's a binary label. And then I've likes, which is a stronger signal, but that's more sparse. So I'm combining both of these things and that solves 90% of the problem. That's the simple approach. Now, MTML approach with learnable parameters of course is way more complex. But that's where you realize that the simple approach gets you most of the way and leads to a productive discussion. But then don't confuse simplicity with missing important details. So you can choose the level of complexity of the solution that you want, but then when you describe a solution, you need to know all the details of it. Then the kind of problems that I have described, they are more to do with mostly like 90% of machine learning is just data. How to deal with the data. So there you might expect some more higher level conceptual thinking like the things on the split and all of those things, right? So what was thing with a time stamp that you were not sure of? Because you were like explaining it could take too much time.

Giant Robot: Is that the... Yeah, so I was waffling first because I said initially.

Admiral Hex: I.

Giant Robot: Know that the time stamp like feature drift, especially if you're releasing multiple models and there's feedback issues with these systems, that this is an important thing to watch out for. In fact, since like whatever, 2017, when they talked about the YouTube algorithm or whatever, they mentioned that label leakage or whatever leakage due to mismatched timestamps was a big problem that they had to solve. But my initial statement, I wanted to juststratify based on age, gender, and location because I would expect that.

Admiral Hex: If.

Giant Robot: You'Re only good in one region and yet you need to deploy everything across multiple regions, regional information is important at this scale.

Admiral Hex: But the feedback group is also important. See, that's the part of the seniority signal, right? You don't add complexity for complexity sake, but you know where it's important to add the complexity, right? So the reason why feedback loops are so important here is also like the context I gave you. This is not YouTube. Yeah. Right? Because YouTube has categories and, you know, all those kinds of things where there are lots of surfaces and the interface is built in a way where you can go and explore.

Giant Robot: Yes.

Admiral Hex: So you do get enough signal about the user. So you can assume that whatever videos the user has seen is more or less from an unbiased sample. Of course, some things are amplified more than others and that always happens. But at least more than reels. Reels is like literally just see one video and then you swipe. So you have no control more or less over like choosing your own path in terms of what you want to see. It's a very extreme thing on the other end. Yeah. So feedback loops become way more important. So it's important to solve it.

Giant Robot: Yeah, for sure.

Admiral Hex: Okay. So drawing this link and recognizing this problem, is part of the seniority signal, I would say.

Giant Robot: I see. So, for example, at the start, when-- so I may be focused on too much detail there, talked about too much, talked about too many options there. I should instead focus on, say, okay, cool, we want to maximize engagement or impressions of of reals or interactions. Okay, and then we talked about a little bit about scale, but the point is that this is multi-regional, so maybe we have to build a model per region or something. But the importantaspect of the interface is that you don't get to--.

Admiral Hex: You.

Giant Robot: Have intrinsically a lot of feedback when you show only reals, and the only signal you get is whether they watch, like, or they scroll past. And therefore, your model sort of changes the data that you collect. How do you deal with that? Okay. Yep, got it.

Admiral Hex: That is a very important consideration in this sort of system. And contrast that with the quality thing that you said, like the long term, right? Yeah. That thing I would put in the category of like adding complexity maybe in the sense, yes, it's an important consideration of course, you know, you want to evolve your system one day to reach that, but it may not be like something that has like a 5 to 10% difference in the accuracy. Maybe it has like a 1 to 2% difference towards a long end long take. Right? Like, imagine you have a really good model. Once you have a really good model, these things start to matter. Like the long term. And you can talk about such considerations towards the end, maybe once you have the design working.

Giant Robot: Okay, cool. So in short, just breeze through the design. Maybe part of this is I've never seen this real. Well, design ever, I'm relatively new to it.

Admiral Hex: It's not an easy problem, to be honest. And I'm not even sure whether Meta actually asked this problem in the wild as much. They definitely asked a news feed version of this. But I know from having worked there, like this system was like very different in terms of the design when we actually did it. And I think Zuck spoke about it as well. Like it took them a while. They first tried to take their new suite models and adapt them to deals. That didn't work. And the lessons from that that were learned were like all these feedback loops. And the other thing about like friend network based recommendation versus just like interest based recommendation. So Facebook is all like based on your friends or your network. Like what did my friend like? I might like the same thing. Whereas wheels is like, yeah, I just like X. So show me X out of the entire universe of all the videos. Regardless of what my, I don't have any notion of friends or anything like that.

Giant Robot: Yeah, yeah. So I guess for next time, okay, time management and just going through some more of these case studies, but don't shy away from complexity. when it's warranted, but try to nail.

Admiral Hex: Down.

Giant Robot: The nuances of the problem very early and then solve them in a minimal way. Is there anything I should have particularly emphasized in the training data section? You mentioned that are there any other features you would like to use? Did I capture the majority?

Admiral Hex: You had the real ID based learnable embeddings, but watch history kind of learnable embeddings, right? Like once you have learned embeddings of videos, you can also have learned embeddings of users. Yeah. And you can then have all kinds of cross features, like for a user you could have their watch history, for a video you could have list of users who liked this video, kind of similar, like embedding. Yeah, so you only had video ID learned embeddings, but if you take that concept of ID-based embeddings, there's a whole universe of features you can build on that.

Giant Robot: For sure.

Admiral Hex: And the whole category of features that you missed were just simple ratios and counters. So I know you asked early on whether there are hashtags and categories. I said no, but even if there are no categories, you can always assign categories. Either you can manually label them or you could even cluster or something and get sort of derived categories. And then you could have like, okay, in the same category as the real, how many watches does this user have? you know, like just numbers and what's.

Giant Robot: The wash. Yeah, so you can use.

Admiral Hex: Simple numerical features of ratios and counters kind of was missing.

Giant Robot: So for you, what would be more important basically diving into more of the feature stuff? I'm just trying to get a sense of strategy. So if I go through the design.

Admiral Hex: Training data is an important signal for leveling and conceptual understanding. but feature engineering is required for completeness reasons. So if your feature engineering section is too light, you might just get rejected. Whereas if your training data section is too simple, in the sense you don't handle the time-based split or the feedback loop or you don't talk about those considerations, then you might get down leveled. Right. But yeah, so I would say like a basic level of feature engineering has to be there. And I think you're almost there. The only glaring miss was on the ratio and counters. Like there's no ratio, no counters at all. Right? Yeah. So beyond that, like once you have the basic level of feature engineering, like in the sense of take most of the boxes or themes. Having like a very exhaustive list is not particularly required. It's more about understanding what techniques you're familiar with on PageEngine.

Giant Robot: I see.

Admiral Hex: Okay.

Giant Robot: And then yeah, we didn't even get to the serving or gauging the quality. What is I want to avoid generating basically a system design for serving, even though I have some familiarity with that. Are there any basically leveling or sort of criteria for those?

Admiral Hex: I think serving and all is fine. You can just go with generic answers, but typically like what I don't know how to bucketize this as in terms of serving or evaluation or whatever, but more on the practical aspects of okay fine, you know, like at the end of this, you would basically be at a level where you have a model in your notebook kind of, right? Yeah. Or somewhere in some interface where, you know, you have a bunch of data, you've trained a model. Now. What'S the step from there to actually have a working system that's running in production? and understanding some of those nuances of like when you productionize what happens. So there are some nuances around serving, there's some nuances around pre-training and data management, feature management. Particularly I would focus more on the data related aspects like how do you prevent feature drift between training and inference?

Giant Robot: Yeah, so there's an example of this. So I talked to a friend about this, but is an example of this that I guess I'm not sure how to talk about this in the context of the whole interview, but very often use different training systems versus serving systems. And in order to avoid basically changes or mismatches in processing data, like the data, literally the features that you use between training and production, you could actually reuse the samefeatures and your training data literally comes from a production system.

Admiral Hex: And you sort of ingest that. Yep, that's a good example. So that's one thing that's like an important detail, right? And the other thing is probably like, OK, what if something goes wrong? What's your process of figuring out how do you go from, you know, our watch time has declined to like, there's a problem in the model or a bug somewhere. What is the process you follow to do that? That's the other practical aspect.

Giant Robot: Okay. And what kind of things could we do? Is this like basically tracking?

Admiral Hex: Sorry? Mostly around data drift, like, you know.

Giant Robot: Monitoring features and things like this.

Admiral Hex: Yeah. And, you know, the the thing that you mentioned is also relevant here a bit, but there are other things as well like, you know, like user interest changed or some popular features came, some popular events came on or something like that. Like, yeah, just like what can go wrong and how do you know, how do you get to that point where you know what went wrong? Actually, it's like data drift and just looking at changes in feature distribution, adding monitoring to the feature values and stuff. Cool.

Giant Robot: Yeah, so maybe the deep dives becomes the real world problems section where we can talk about how to troubleshoot certain things.

Admiral Hex: There are all the practical aspects of like, someone has a model in a notebook versus is a production ranking system. And not going too much into the details of system design as such, in the sense of the distributed systems. Don't go too much into serving rate, all those aspects. Talk more about the data related and ML related aspects of it. Like feature drift, all of it. It's mostly features, right? Isn't it? Yeah.

Giant Robot: One thing that I'm not sure about in a practical setting is-- so if you're retraining a model every, I don't know, every day, for example, or every.

Admiral Hex: Week.

Giant Robot: Essentially, how do you actually roll out this model to new users? And then how do you roll back and detect-- like, do you do AB deployment--.

Admiral Hex: It's a combination of AB testing and monitoring. So you obviously AB test before you launch and then you also run back tests after you launch because of the feedback loop. You have monitoring on all the features and stuff. So yeah, those will kind of COVID that aspect.

Giant Robot: Is there any, is it typical to actually use or back test old models or for example, I don't know, just do a random periodically just do random videos or maybe like popular videos bypass the ranking model.

Admiral Hex: Holdouts and those are usually used for measurement. At Meta scale you won't have a proper popularity holdout but you in lower scale systems, it's pretty common. The one that I work with, we have a popularity pullout, and that's just to benchmark the gains that we're getting from the ranking system.

Giant Robot: Okay. So you want some way of breaking the feedback loop or measuring the effect of the feedback loop based on models by sort of bypassing it somehow or doing random ranking or something.

Admiral Hex: Typically it's handling the training data itself, like just injecting some amount of random user real pairs in there just for exploration. And then in practical settings, you could also combine exploration with exploitation using something like bandits. I see. Yeah, those are more advanced techniques. Well, I think out of time as well, so. Yeah, yeah.

Giant Robot: But thanks a lot. Yeah, you could maybe highlight the most important feedback that I-- I mean, I don't have very much time to prepare, so I'll do my best, obviously. But yeah, if you could tell me-- yeah, some of your feedback, that'd be great. Otherwise, this is very useful.

Admiral Hex: Thanks. Best of luck with the interview and have a nice day.

Giant Robot: Yeah, take care. Bye.

Admiral Hex: Bye.