We helped write the sequel to "Cracking the Coding Interview". Read 9 chapters for free

interviewing.io

An Interview with a FAANG engineer

Watch someone solve the harmful content removal problem in an interview with a FAANG engineer and see the feedback their interviewer left them. Explore this problem and others in our library of interview replays.

Interview Summary

Problem type

Harmful Content Removal

Interview question

Design a system for a site like Reddit that can handle harmful content removal.

Interview Feedback

Feedback about Golden Pheasant (the interviewee)

Advance this person to the next round?

Yes

How were their technical skills?

4/4

How was their problem solving ability?

3/4

What about their communication ability?

4/4

Excellent interview overall. Clearly has a strong grip on the technical knowledge and only needs minor process refinement: - Excellent communication, clear and easy to follow - Excellent line of questioning wrt background processes as well. [Be mindful of this as you need to denote which data you are thinking of, this will be key to defining scale down the line] - Wrt the data input, think deeply with a goal of standardization of inputs. I tend to approach this with the aim of getting data ready for your model down the line. This could entail conversations on video frame sampling, text embedding/ vectors generation. - Excellent work think of the preprocessing. The only caveat is that I had to push you in this direction, but in an interview, you want to automatically take charge and drive the conversation in this direction. - Single word vs Multiword tokens tend to reduce to the number of token which can help with performance - Preprocessing is always going to be the starting point for most ML System design interviews. You showed excellent knowledge of this bit although I had to push you to get the conversation going. Make it a habit to break down the flow as follows: Requirements→ Preprocessing→ API Defs→ Fun Model Stuff [Initial training/ Deployment]→ Monitoring → Maintenance [Handling Data and Model Drift] - Requirements - Identify the business problem to be solved, the stakeholders involved, and the constraints and limitations of the project. For example, if we are building a sentiment analysis model for a social media platform, we need to identify which social media platform, which languages the model needs to support, and what kind of accuracy and latency requirements we have. - Preprocessing: - Clean and prepare the data for the model. This may involve tasks such as removing irrelevant data, handling missing values, and transforming the data into a suitable format for the model. For example, if we are building a text classification model, we may need to preprocess the text by removing stopwords, stemming, and converting the text into a numerical format such as word embeddings. - API Defs: - Define the input and output interfaces of the model. This may involve defining the format of the input data, the expected output format, and any constraints on the input data such as size or type. For example, if we are building a speech recognition model, we may define the input interface as a raw audio signal and the output interface as a text transcription. - Fun Model Stuff: - Develop and train the model. This may involve selecting an appropriate algorithm or architecture for the task, tuning hyperparameters, and training the model on the preprocessed data. For example, if we are building an image classification model, we may select a convolutional neural network architecture and train it on a dataset of preprocessed images. - Monitoring: - Monitor the performance of the model in production to ensure that it is meeting the desired accuracy and latency requirements. This may involve setting up monitoring tools and logging systems to track the performance of the model. For example, if we are building a recommendation system, we may monitor the click-through rate of the recommended items. - Maintenance: - Handle data and model drift over time to ensure that the model remains accurate and relevant. This may involve updating the model with new data, retraining the model periodically, and monitoring the model's performance over time. For example, if we are building a fraud detection model, we may need to update the model with new fraud patterns and retrain it periodically to ensure that it is detecting the latest fraud techniques. - Excellent work with the expected output for the results. This showcases excellent multi-label output intuition. You didn't need any hints with this and I absolutely loved how you presented the results in json format. Really brilliant job here! - Absolutely brilliant job with the monitoring stage. You provided strong metrics to monitor for an went in-depth on why each would be useful. - For some basic metrics like precision and recall, I totally get how confusing these can be. Frankly, I usually generate a Confusion matrix since it is easier to intuit and speak to. Make it a point to speak of them without defining them to showcase experience with them once you have their definitions nailed 100% Resources: - **Sigmoid**: use when you want to output probabilities. - **ReLU**: use when you want a simple and efficient function that doesn't saturate. - **Tanh**: use when you want a function that is zero-centered and outputs values between -1 and 1. - **Leaky ReLU**: similar to ReLU but allows for a small, non-zero output for negative inputs. Useful for preventing "dead neurons" in deep networks. - **Softmax**: used in multi-class classification problems to convert a vector of arbitrary real values into a probability distribution. - **ELU** (Exponential Linear Unit): similar to ReLU but with negative values, which can lead to faster learning. - **Swish**: a newer activation function that has been shown to outperform ReLU in certain contexts. - **Loss functions are used to evaluate how well a neural network is performing on a given task. Here are a few examples of loss functions:** - **Mean Squared Error (MSE)**: used for regression problems where the output is a continuous variable. Calculates the average squared difference between the predicted and actual values. - **Binary Cross-Entropy**: used for binary classification problems where the output is a probability of the output belonging to one of two classes. Measures the difference between predicted and actual probability distributions. - **Categorical Cross-Entropy**: used for multi-class classification problems where the output is a probability of the output belonging to one of several classes. Measures the difference between predicted and actual probability distributions. - **Hinge Loss**: used for support vector machine (SVM) models. Punishes the model for incorrect predictions and aims to maximize the margin between classes. - **Kullback-Leibler Divergence (KL Divergence)**: measures the difference between two probability distributions. Used in generative models. - **Cosine Embedding Loss**: used for learning embeddings of text or other types of data. Measures the cosine similarity between two vectors.

Feedback about The Legendary Avenger (the interviewer)

Would you want to work with this person?

Yes

How excited would you be to work with them?

4/4

How good were the questions?

4/4

How helpful was your interviewer in guiding you to the solution(s)?

4/4

I think my interviewer did a really great job actually. I think he did a good job of prompting me when needed and let me sit back and talk and allow me to stay on track when necessary as well

Interview Transcript

The Legendary Avenger: Hey.Can you hear me?

Golden Pheasant: Yes. Can you hear me?

The Legendary Avenger: Yes. Loud and Clear. Okay, cool. So just to get it started, maybe give me a quick run through of what you're prepping for as well as what you're looking to get out of this. That way I can make sure we target that exactly and get the most value for you out of it.

Golden Pheasant: Yes. So I'm interviewing for some senior machine learning ML ops positions, and I will be getting some machine learning system design questions along the way. And that's kind of the angle, I think. Now, I'm not sure the company I'm actually interviewing for does a lot of payments stuff. I'm not sure if they'll give me one that's focused on that or if they'll just give me, like, a random machine learning ops question. So I'm not sure.

The Legendary Avenger: I see. Okay, so lots of infrastructure monitoring and something along the lines and given that it's a payment, probably along the lines of anomaly detection in this case, right? Could be.

Golden Pheasant: I don't know.

The Legendary Avenger: I'm just saying yeah, it's like a best bet kind of situation because what I usually try to do is given the company or the target company, I try and think of problems that have a high probability of showing up. Who knows? If you practice with that and it shows up, that's a good thing, right?

Golden Pheasant: Yeah. So the company upstart, they do lending. I don't know if it's as much anomaly. I mean, it could be anomaly detection, but yeah, I'm not sure I hear you. Okay.

The Legendary Avenger: Makes sense. I mean, at the end of the day, fortunately, the infrastructure roughly stays the same because it's usually if you have a model handy, then it's more or less about scaling up and making it available. So the process stays roughly the same regardless of the application. Although, contextually, it can be different, especially when thinking about requirements. And I noticed that this is a mentorship session, so have you done any such interview before? And if so, what's your comfort level with machine learning interviews overall?

Golden Pheasant: So I've done plenty of normal sui interviews. I am not as experienced in system design stuff in general, technically. It's kind of weird. The way it worked was basically like, I wanted a staff engineer who did machine learning, and we couldn't find a match, so they said, we'll throw in a free machine learning infra interview for you. So I was like, okay, cool. So kind of how this is marked as a mentorship session.

The Legendary Avenger: I see.

Golden Pheasant: Yeah. Okay. That makes sense. Cool. And final question, at least, before we jump in here. And have you done any prep along the lines?

The Legendary Avenger: A little bit of prep with Alex Zhu's book.

Golden Pheasant: Oh, nice.

The Legendary Avenger: Yeah, I've read, like, the first four chapters or so. Perfect.

Golden Pheasant: Yeah. If you've done that, at the very least, then you're in a good place. And TLDR just to set expectations. The process is roughly the same as traditional system design. The only difference will be maybe less of the estimates, a bit of extra time on the APIs. And at the end of the day, in terms of monitoring it's, not the traditional stats like P 90 and whatever it is, it will focus on ML specific metrics depending on the model you choose. So it's fairly straightforward process. Okay, I have a proposal, at least for this interview, in order to help make sure that you get that practice. So I can give a question to you, a prompt to you, as though it's a normal question and you'll go through it. And now, unlike traditional MOOCs, where I'll reserve feedback for the end, if there are any glaring weaknesses, I can kind of stop you in the moment, suggest a way we can improve and then try again. And then if there's anything that's good enough, I can reserve feedback towards the end. So minor things reserve feedback for the end, major things kind of stop we iterate and then think of better ways to do that. Way we target the big weaknesses, if any. And then for the minor ones, just give resources or like stuff you can practice on Async. What do you think?

The Legendary Avenger: Yeah, sounds good to me.

Golden Pheasant: Perfect. Okay. And so, building on the same problem, so what you're going to try and do here is essentially something more traditional. We're going to again target then, okay, what is happening with this thing? Are you able to see what's going on on the screen?

The Legendary Avenger: Yeah.

Golden Pheasant: Okay. I'll reload so that the bug goes off. I don't know what's going on with.

The Legendary Avenger: Oh, okay.

Golden Pheasant: You're able to see my screen, right? You see me type?

The Legendary Avenger: Sorry. Yeah, it was some weird infinite loop thing. Okay.

Golden Pheasant: We're going to focus on an anomaly detection problem. And in this regard, we are going to focus on actually, let me even twist it up a bit, so let me generalize. Even instead of anomaly detection, let's think of harmful content removal. So it's somewhat anomaly detection, but I think it gives us a better, more generalized framework. And we're not going to target a payments company yet. So think in terms of Reddit. And the beauty of it is it gives us multiple media types. So I'll give you an initialization here. Imagine we have Reddit. So we have the subreddits, we have the reactions, and a couple of things to get you started. We have the ability to report posts. We have the ability to, let's say, react posts. And so we also support the following types of posts. So images, short videos, but mostly text. Okay, so this is like our watered down version of reddit. So given that, what I would like you to do is think of how you will build out a system to essentially handle harmful content removal and by harmful content and feel free to add to this. This is what we mean. Content related to abuse, content related to nudity, content related to what else? Violence, misinformation. This one is especially tricky since it's hard to know. But we can think of it as an extended goal. We can discuss it once we have the rest done. So that's kind of the synopsis of this. Feel free to take over and just run me through how if you were to build this system, assuming this watered down version of Reddit exists, how you would go by that?

The Legendary Avenger: So I have a couple of questions in terms of the overall, I guess, requirements gathering. So I guess the first question is how soon do we need to moderate content? Right? So if somebody posts a picture, I guess even before this, is this a reactionary system or is this a proactive system?

Golden Pheasant: Excellent question. So I would say maybe let's target a hybrid where for some content, especially categories of content that are extremely harmful, we basically react ASAP. And for some other contents we can maybe take a day.

The Legendary Avenger: Right?

Golden Pheasant: So we are open to different variation in timing in this case, depending on.

The Legendary Avenger: The type of so certainly like with nudity violence, I think another one is called like that. Those are all things that we'd want to take down immediately. Sorry, this would be proactive and then other pieces of content abuse would go in here as well. But we would say for the reactive case, perhaps misinformation is a different thing here and this is just posts and reactions to posts. There's no comments or anything like that. And then I'm assuming there's other pipelines that when you post something, it's not like in terms of Reddit, I guess slo for post to feed time is probably not on the order of milliseconds, it's probably on the order of seconds. Just an assumption here. Are there other background data processing jobs that are actually going on in this system or will this be the first of these?

Golden Pheasant: Excellent question. So you can assume you have some background jobs. That said, I'll also give you the freedom to define what jobs you think will be convenient and if you think exposing the data that they are processing either in state or after.

The Legendary Avenger: Okay, first async system. Okay, cool. Let's see, let's get a sense of what are the number of the volume of posts and reactions per day? So this is like Reddit, right? Reddit is supposed to have 50 million daily active users. I don't know if they're posting 50 million videos a day, it's probably well then we can just do sort of like a range estimate of in the one to hundreds of millions of posts and then just a clarification. Does that include comments? Not comments. Okay, so we don't care about comments. I probably revise this to be down a little bit, maybe like to 50 million posts per day including all of the content that's in it.

Golden Pheasant: Let's see question on that. So when you say 50 million posts, do you have a distribution in mind? Because given what you have on line six, it will be good to think about how this is distributed, right?

The Legendary Avenger: Oh, I see. Yeah. So if there's 50 million posts, then if we're saying most of the texts are posts sorry, most of the posts are text only. If we said most, meaning maybe 80% if we say 80%, okay, 80% are text only, this would leave us on what would that actually be? I am not good at math right now. 40 million to million text posts per day with a mix of image and video. So like up to 10 million images and videos per day.

Golden Pheasant: In terms of videos in this regard, are there any assumptions you're making with respect to what the size of the video is going to be as well as what the input format is going to be? So are you thinking somebody will upload an actual video or will they link a YouTube video, a vimeo video that's going to be, let's say, rendered in an iframe in reddit?

The Legendary Avenger: Yeah, so good question. My assumption is that's true. So text can have links, so that'll be something to consider. We'll need like a link detection deal here service. If there's a video, I'm assuming the videos are direct uploads so we'll be able to analyze them within our system. These are directly attached? Yeah, because otherwise it's really hard to moderate if somebody's adding like a YouTube embed. I mean, one, we assume that YouTube is doing their own moderation, they're doing a good job, and two, if there's a sketchy link, then what we may want to do is actually queue this stuff up. So I guess the next question is do we have some data sets around of existing content that is labeled or do we need a labeling manual review system?

Golden Pheasant: Excellent question. So on line 31 I'll make reference to line four. So feel free to make assumptions about having a human in the loop. So we will actually have a continuous stream of some of the posts that will be labeled depending on the category. So we do have humans in the loop and we actually do have a mechanism available for reports.

The Legendary Avenger: Okay, cool. Yeah. So that mechanism from what I'm familiar with is basically the way the mechanism is structured is they monitor the number of reports. If reports exceed some threshold, Q or manual review and then we get the labels there. So in terms of data set size and distribution, can we assume that the training data that we have mirrors the population data set distribution? An assumption is maybe images are more frequently moderated than text only and so it could be the case that the data set has 50% images, 50% text, whereas we're getting 80% text from our prod distribution.

Golden Pheasant: I see. Yeah. I think it's fair to assume that the rate of report should the size of the data set in this case, especially since from my point of view, any abuse is always going to be abuse. So once it's in the database, we can always reuse it in the future. So to a degree, it's kind of hard to argue that the distribution of data we have currently will be the distribution all the time. But for now, I think for simplicity purposes, I think that's a fair assumption to make, at least for design purposes.

The Legendary Avenger: Okay, right, cool.

Golden Pheasant: In addition to that, I was actually going to suggest it might be worth thinking of data drift ahead of time in this case because one year ago, two years ago, even five years ago, we remember images were the thing. Like it was the bigger deal. But now with the age of TikTok, now that's a bigger deal.

The Legendary Avenger: Right.

Golden Pheasant: People are looking at videos more often. And this is why I'm saying it might be hard to argue that the distribution will stay static, but at the end of the day, the data is always going to be the same. It's always going to be abuse.

The Legendary Avenger: Right? Yeah. Another question is what percentage of posts are marked harmful? So I'm assuming it's in like a 1% range kind of thing here. Like a very number. Right? Yeah.

Golden Pheasant: Reddit is crazy. So yes, that's a fair function.

The Legendary Avenger: Maybe reddit is like 5%, I don't know.

Golden Pheasant: Could be with that platform.

The Legendary Avenger: Man, Twitter is lost.

Golden Pheasant: But yeah, I agree.

The Legendary Avenger: Yeah. Or like Twitter. Yeah. So I guess the last thing I think I wanted to cover here well, I think I have enough in terms of requirements to kind of move forward here. I know in terms of machine learning today, multimodal is starting to really take over where you can just sort of give a machine learning model. I think Salesforce came out with Instruct Blip pretty recently and you can just feed the model like kind of anything and it'll figure out whether the model is sorry, it'll figure out whether the actual content is something like abusive or violent or might be misinformation or something like that. The more traditional way of setting it up is just having kind of one model per modality. And so that's the system I'm more familiar with. But today, if I were setting this up from scratch or even just like joining a company, I'd really look at multimodal first to see if there are wins there. But the system I'm proposing for the interview is the one I'm more familiar with, which is just unimodal. Right.

Golden Pheasant: Let's do this. Let's make the assumption that either we have an ensemble that essentially gives us a multimodality functionality, so it kind of marries the two ideas where you have one model that can do it all with all data types. Alternatively, you can spin out separate models. In fact, I would actually argue it might be worth going with your intuition here. Let's go with the traditional approach. Reason being we already made the assumption that there might be more images that are harmful. So if we have one model yet, most of our content is going to expect, let's say images, we might want to scale that up without we might want to scale the processor further, what do you call it for the images in this case while leaving the text on to stay constant. But that said before, with all the model assumptions made, I would like you to talk about preprocessing. Like given this data that we have, is there any preprocessing steps you might want to make in order to make this data available? Because my assumption is regardless of the type of model we have, we still have to process the different data types to make it ready for the model with the target standard input type, right?

The Legendary Avenger: Yeah. So there's a few pieces here in terms of preprocessing, we can kind of split it up. So for text there's tokenization that you need to do. I can't spell tokenization and so basically it's the process of what an NLP used to do where you do stemming, removing stop words and trying to clean up capitalizations, fix misspellings and things like that. Traditionally thing you do nowadays there's libraries that basically break words up into tokens and then creates kind of like I want to say it just assigns like an ID to each of the tokens and then that's sort of like the sequence of features that are fed into an NLP model. I think a popular one was tick token by that will actually do the tokenization of the text. And then for images and video, some of the things that are really important for pre processing, I think the first one is making sure that the dimensions of the images are correctly sized. Because if there's images that are bigger or smaller, trying to fit all those things into the model definitely will cause some issues and so you want to scale the image. So I haven't worked with video much, but I can say for images you definitely want to scale it. And then also if we are working with a very low percent of harmful content, sometimes it can help to upsample these images by doing things like adding a rotation what's it called where you change the hue saturation brightness? I don't. Changing HSL values. So like brightening an image, darkening an image, doing all that or changing the contrast of the image in terms of preprocessing to kind of get more sample in.

Golden Pheasant: My question actually I have two questions particularly know totally okay if you're not familiar with any of this, but this is actually a very interesting point, especially in the context of reddit. So when it comes to tokenization here, there are multiple strategies we can use you don't have to cite any specific tokenization strategies. But in terms of token size, how long do you think each token should be? And for reference, Ngram is usually like intraword. It can be two or three syllables. Yet you can also have bug of words, which in that regard it looks like it looks at entire words in this case. Or you can even have multi word tokenization strategies where it looks at samples of phrases or something like that and maybe looks at overlap. So in terms of tokenization, which strategy do you think would make most sense in our case and why?

The Legendary Avenger: Good question.

Golden Pheasant: I think.

The Legendary Avenger: So, like single I mean, like kind of like single word tokenization sort of seems to make I mean, you can't tokenize a word. I believe there's something like you pick like eight characters and that should be enough to sort of encompass most words and then some words get chunked into two tokens. I think that's at least what I'm kind of familiar with in the space, I guess. What are the trade offs there? Well, there's definitely a trade off between the more unique tokens you sorry, there's a plane.

Golden Pheasant: I live next to the Boeing HQ, so trust me, I hear that a lot.

The Legendary Avenger: Yeah. So I think there's a trade off with the number of token values that your vocabulary has. So you have to be careful. If you only do one character, you end up with like if you use ASCII like 256 values. But I don't think the models are as good at being able to pull out context on a per character basis. Now if you do a bigger token size, then you'll have a lot more unique values. I think up to a certain extent models will do pretty well on attention. It's just that now you're feeding in larger I guess you kind of blow up the size of your embedding table a little bit. So you have to kind of watch out for your memory of your system.

Golden Pheasant: Excellent. That's exactly what I was targeting right there. And it's really good that you brought in context because abuse is in context, dirty words always have the same characters as praise. So in that regard, I think you're right. I think I would also lean towards single word, probably towards multiple word tokenization, especially since nowadays a lot of abuse will lean towards the three, four words. Like people will combine multiple words. A single word that is sometimes abusive might not necessarily be abusive. Let's say I'm describing the word stupid in itself, it might not be abusive, but if you're calling a person abuse or stupid, then that is actually an insult. So I think multi word tokenization, you can also justify single word tokenization in that regard. And then there was something you mentioned that made me think a bit. So you mentioned that we need to be mindful of performance, but that's where maybe I might question it a bit. Because if we have, let's say, single word or multi word tokenization, then to a degree, we are reducing. The total cardinality of the total number of tokens generated from any particular text input which could actually end up helping the performance compared to, let's say doing an Ngram tokenization where we are looking at individual characters. So do you think let's say this approach would actually end up meaning something in terms of path? What's your intuition on that?

The Legendary Avenger: Okay, basically, just to reiterate what you just said, the idea is as you choose a multi wad token, that should result in an increase in the number of what do you call it? It's like a decrease in the sequence length, but also an increase in the kind of, like, embedding table size. Right. So I guess in terms of performance, I'm not actually 100% sure on what would actually end up causing the most issues. I think my feeling is that at training time, having a larger embedding table can be more memory intensive, but at inference time, your context, like the sequence length that you actually need should be shorter, and so the inference should be quicker because you're just doing less matrix multiplication at inference time. That's my intuition on it.

Golden Pheasant: I think that kind of confuses me, though, because technically, with the reduced number of token sizes, with the reduced number of tokens, we should technically also have a reduced embedding table size because the size of the embeddings themselves, actually the number of embeddings should stay roughly the same. The size of the embeddings is what will vary.

The Legendary Avenger: Right, right, yeah.

Golden Pheasant: Just a bit further, in terms of sparsity, do you think there are any implications with, let's say, a multi word tokenization compared to, let's say, Ngram tokenization.

The Legendary Avenger: In terms of sparsity? Yeah, like, I mean, so I I guess, like, with this is just a guess because I don't actually know, but I'm guessing, like, there's a I'm guessing with, like, Ngrams, there's probably a higher sparsity of tokenization. Hello? Did I lose you're? Hello? All right. Hello? It I hear something.

Golden Pheasant: Oh, can you hear me now? Yes. I don't know what happened there. Yeah, sorry.

The Legendary Avenger: I was saying you are right.

Golden Pheasant: Your intuition is very correct on that, because objectively, if we have Ngrams, we have a huge embedding table because there are too many unique values. And so for it to make sense, we have to have huge embeddings. In this case, probably more than 1036, which we have with TikTok. So in that regard, most of the values will be zeros. And so we'll end up with too many zeros and too little information. But with multi word tokenization, smaller embeddings, which will mostly have values, because in this case, essentially, with the same size of input, we have fewer boxes to eat. So not only will they be more useful because they encapsulate context too, and so it also ends up helping with performance in that regard. But anyway, sorry, we spent a bit too much time on that. So I think we can move on in this regard. Let's maybe talk about the API. So given what we have here, talk to me about some APIs you might want to design in order to serve this data.

The Legendary Avenger: Yeah, let's see. So I could see like a few different systems here. So the idea is that we kind of said we're going to do single.

Golden Pheasant: Modal.

The Legendary Avenger: Inference and inference. In this case, it could be done as like a batch inference or online. To me they kind of both work because technically it's an async process. But also this data is kind of coming in as a stream. So to me, I think it's just easier to kind of model it as an online inference system and just sort of like a little more future proof. If you do need it to be more real time, it will still be there. You don't have to migrate from batch to real time in that case. So there's basically going to be three services. One for image, one for text, and one for video. They'll all have the same APIs. So maybe it's like moderate. So you could say service moderate. And then the request should be like post request. And if we're thinking about what kind of data we want for the body, so we should have features and then we'll pass in a list of features. The services should be doing the same preprocessing that we're doing to generate the data and train the model with it. I think these features in reality, the way this would kind of work is you use like feature IDs and then sort of work through some kind of a feature store. So this is actually something I didn't cover, but I feel like we should talk about is it's not just image text and video, but also user profiles might also be something that we want to consider at least in the future. So it's not just like show me an image, it's like show me an image by a particular user. And then maybe that helps with identifying things like misinformation. Like if there's a post and that user has been reported a number of times, maybe that actually positively impact, but maybe that's like a sorry, I'm kind of thinking out loud here and I'm a little disorganized. I think having the three services here and then another service, a fourth one for just text misinformation makes sense. These are a lot easier to moderate. And then with misinformation, it's a little more subjective and the model may end up having features in it that the text model doesn't need to include. So that's why I'm kind of building this out as a separate service here. So for this API, you'll pass in a list of features. Maybe it's as simple as just saying like a dictionary and then we can say the text for texts, right? One, two, three. Or for same with images, it would just be the Blob or video. So the Blob and the bytes. And then same thing with the video, right? So that's just kind of the API there. And then the response is going to be a list of labels and confidences. So you'd say like results and then, so we'll have a label and then, so we may want to say this is the abuse label and then we can give a confidence and then this will be a float between zero and one. So this could be like nine and three. And then we can have other labels here as well. So we'd have like labels for nudity, violence and then yeah, these, these would be like the labels. Yeah. So that's basically the APIs of the service. In terms of like there's a few important things here which is as a system as a whole, the thing I want to touch on is there's training metrics and then there's online metrics. And how are we doing? So for things like online metrics, like actual production metrics, we want to cover things like the number of we call it harmful let's we'll call it harmful impressions. So that's just the number of people that have seen these things and then like number of reports.

Golden Pheasant: Nice.

The Legendary Avenger: And maybe even so there's impressions and then also maybe like a time based one. So like average time to take down. So when something is posted reactively and we need to take it down, we want to be able to take it down quicker and hopefully models will facilitate.

Golden Pheasant: That take down time.

The Legendary Avenger: In terms of actual training metrics, some of the things that we want to consider here so when we're training on image, text or video, we're doing a.

Golden Pheasant: Multi class classification.

The Legendary Avenger: Which means that we'll be doing a get them mixed up all the time.

Golden Pheasant: Precision and recall.

The Legendary Avenger: Oh, sorry. Yeah, so we care about precision and recall. And so precision is basically the number of true positives divided by the two positives plus the false positives. And recall is true positive divided by two positives plus the false negatives. In terms of how we want to actually present this, I think just presenting in terms of precision and recall is we don't have to do an F one score or like a receiver operator curve just leaving as precision recall works. Any questions on this piece?

Golden Pheasant: No question on that. Maybe just a quick suggestion because I was literally typing feedback on this bit because I totally relate to you on one bit. I absolutely hate these two terms because I rarely remember what exactly they are. And so my fallback is literally saying I'm just going to generate a confusion matrix because it's going to allow me to interpret the same metrics the same way. Like I'm better with the intuition rather than the terminologies. But I totally hear you on this. Makes sense.

The Legendary Avenger: Yeah. That makes sense. Okay, so that's that for training metrics.

Golden Pheasant: Actually, sorry, quick question, objectively speaking, because given these training metrics, can you define to me the objective of the model in terms of the two metrics?

The Legendary Avenger: Yeah, you're talking about like the loss.

Golden Pheasant: Function, not necessarily the loss function. We can speak on that later given we have a multiclassification problem here. But I'm interested in what is the goal? Like do you want to minimize precision while maximizing recall or do you want to maximize both? What's the goal? Particularly in our case? Yeah.

The Legendary Avenger: So this is actually interesting because it depends on the business objectives and time. So I've actually seen that at my company where we basically said for a while we were okay with businesses, we were okay with being very strict in terms of, I want to say, like false negatives. And then economic downturn kind of happened and then we kind of said like, hey, we want less false positives because we want people to actually spend on our platform. And so we were actually willing to trade off like recall, I think back for precision or it might have been the other way around, business metrics, but if you're trying to juice engagement, then you want to be careful with okay, so false negative in this case would be classifying something as harmful when it's not. And so if you're trying to juice engagement metrics, having a very high precision and low recall is good because if you have high precision, low recall, what we're basically saying is we're okay with harmful content making it past the model because we don't want to block good content. Makes sense.

Golden Pheasant: So in this case, essentially what you're saying is we are okay with people being too sensitive rather than being less sensitive and us missing on actual abuse content. Like it's okay if they report something and we find out it's not harmful and restore it rather than the opposite way they fail to report or we fail to flag an excellent so we are basically making our thresholds very low for what abuse or what a flag for abuse should be. But then of course we have the human in the loop, so we'll always have a restoration process for those.

The Legendary Avenger: I think that's an excellent approach and.

Golden Pheasant: I was actually very happy with that because I was going to grill you after that on that point, like would you want it to be the opposite and why? But I think you immediately went to it, which was excellent and then final question at least, sorry, I was going.

The Legendary Avenger: To sort of make a joke, which is this is until you have to advertise on the platform and brand, then you have to change your strategy.

Golden Pheasant: Actually, if you think about it though, that strategy can actually help because if we are worried about advertisement then unless we are worried about the rate of reporting on advert because we know it will always annoy people, and then they'll report it. But I think advertisers will probably be happy with a platform that's a bit more strict because look at, let's say Twitter, right? Twitter is not strict at all right now on abuse. And they pulled up because they don't want the possibility of being mixed in with.

The Legendary Avenger: Right.

Golden Pheasant: So to a degree, that strategy will actually help with Ads. What do you think?

The Legendary Avenger: The opposite strategy. Right? Because what we're saying is we'll let some harmful content through, and because we don't want to block good content accidentally, and then if we want to be brand safe, we'll actually start blocking more harmful content at the expense of blocking some good ones every once in a while. So we're more brand safe.

Golden Pheasant: Exactly.

The Legendary Avenger: Yeah.

Golden Pheasant: Makes sense. Okay, we in line on that. Okay. And then, let's see. I had a final question with respect to metrics, and then the next section I wanted you to focus on before, at least to get a feedback, is just scaling. So just pure architecture talk. So I was thinking, how are you going to scale it up? But before we do that, now that we have our objective function, objective in this case, can you talk to me about the loss function? Like, if you're going to choose a loss function here for your model, this is the most technical I'll get to the model, but what loss function do you think would make most sense?

The Legendary Avenger: Yeah, the loss function is I'm blanking, so it's a multiclass classification. And I know that you apply well, you're supposed to apply a soft max at the very end. And then what you do is you compute you compute the loss. The loss is, in this case, like, um, the only one I'm thinking of is well, there's two mean squared error or like, maximum likelihood. I think, like, maximum likelihood sounds like it'd be the right loss function. But I'm going to have a hard time with this section because I definitely need to study up on my loss functions. All right.

Golden Pheasant: Are you familiar with cross entropy?

The Legendary Avenger: Somewhat familiar with cross entropy. I know the word. I don't know what it represents at the moment.

Golden Pheasant: Yeah, so the reason I brought that up is because, at least from my understanding, it's been a while since I had to meddle around with maximum likelihood. But I remember it as a way of estimating my parameters for my model, but not necessarily for a loss function. It's actually used more so for tuning the parameters. But the loss functions, they tend to be things like cross entropy. There's binary cross entropy when you have two levels of multilevel cross entropy. When you have multiple levels. There are things like hinge loss, maybe focal loss, and a couple of other examples. I cannot think of all of them off of the top of my head, but honestly, and this is just a quick tip, cross entropy will almost always be a good answer when you're asked about the loss function. The only difference might be you need to showcase the intuition on the variance. So things like multilevel or binary, as well as understanding how it penalizes false predictions. Now, the downside to it though, is it's excellent for supervised problems where we have labeled input. In our case that would work well, but for unsupervised problems, that becomes a bit of a problem. And this is where if you think of, let's say, pre clustering in order to get a sense of, let's say, some sense of labels, that might actually be helpful.

The Legendary Avenger: Okay.

Golden Pheasant: Does that make sense?

The Legendary Avenger: Yeah.

Golden Pheasant: All right. And then final bit of probing in this case. So talk to me about deployment. So say we've set this up. How are you thinking of, let's say, scaling the system up, if need be? And are you going to have a multi service architecture or monolith? What's your intuition on it?

The Legendary Avenger: Yeah, so there's a few things here. The first point is my initial launch, I'd want the models to be small enough to fit on a single GPU for inference, because if I have to do a distributed GPU thing, that just adds more complexity to the system. So, like single GPU inference, that would be nice. And we basically put a load balancer in front of a number of these instances. I think each of the major frameworks has their own serving framework, so PyTorch and TensorFlow have their own serving infrastructure. So I would just run it on top of that. And yeah, I mean, like in terms of doing a couple of things that are important, I'd want to make sure that I'm logging all of the features at request time. That's one thing I want to do to make sure I minimize the amount of potential data leakage that's happening in terms of rolling that data out into the new data set. And then also we want some metrics and monitoring to watch for data set distribution shifts. So things like anomaly detection just to figure out, like, hey, are we getting a bunch of nulls for a particular feature? Like in this case, it's text, images and videos. But if we're doing something with misinformation where we care about the user profile, you do care about that kind of data distribution shift, and also you want to look for your label shifts too. There could just be all of a sudden one day, everyone's posting a bunch of crypto nonsense and you're getting a lot more abuse on the platform that day. So you do need to be able to alert and say like, hey, our models are kind of going off the rails. Here the system that I would actually set up. Okay, so there's scaling, scaling. I would just do a load balancer with a cluster of machines as CPU or GPU, I guess utilization reaches a threshold. Add more clusters to the machine and auto scale the thing I would want to do though, is on any new model rollout to do an AP study just to see how the model is performing online versus the existing baseline before just blindly deploying it out to production.

Golden Pheasant: Excellent. I think that certainly makes sense. And just a quick point. In addition to what you've just mentioned, there are also other benefits to think about. Given that our service and maybe you should have clarified this initially, it ought to be global abuse in the US. Might not necessarily be abuse in some other place. And in similar regard, we have inputs that might be different, that might completely distort the model. We want to stratify the models by geographic features, like language, what might happen in China versus what might happen in Russia. The languages are completely different, the input is different. So we probably want different flavors of the models. And so your approach here, where we have mini models or microservices essentially, and distributing them depending on the data service, that will then being a very useful approach for this purpose too, right?

The Legendary Avenger: Yeah, having regional support.

Golden Pheasant: Exactly. Okay, we have about five minutes left. I will add a couple more to make sure we are going through feedback exhaustively. But for context, I never give feedback straight up. I usually want to first get your intuition. So, looking back at the interview, what do you think? You did well and what do you think you could improve on?

The Legendary Avenger: I think I did all right with the initial part, the sort of like requirements gathering. I definitely need to go deeper. Well, I got stuck on the tokenization piece, so I think there's just a little bit more focus on some of the machine learning aspects of things like what are some of the more common preprocessing steps. Tokenization be one of them. Looking at the different scalings and things like that, like doing log transformations when your data distribution isn't exactly gaussian. Things like that, I need to just refresh on. I thought the API endpoint part was fine. Yeah, I don't know if there's anything really glaring there. Ultimately, I feel like I'm not sure I called it out in the beginning, which was like, really should be doing a multimodal system that's just like, more intelligent. I have a feeling that would actually perform better. I'm glad I called that out and then went with the system that I'm a little bit more familiar with. So I could actually each of these things. I'm not actually sure if it's better to do it that way in a real interview or if it's like by going this way, I'm giving the interviewer an opportunity to dig on. Let's talk about image related models. Let's talk about text related models versus potentially. If I'm saying let's build some multimodal system and the interviewer doesn't know how to or isn't familiar with multimodal systems, then I'm not sure if they're getting signal. So I'm not sure if that's like the right path to take there. I need to just do better on my loss function stuff and just kind of didn't get that piece. And then other than that, I think everything kind of did all right. So that's sort of how I feel. Like there's some ups and downs. Definitely not like the perfect interview.

Golden Pheasant: Here's the thing, I rarely find a.

The Legendary Avenger: Person who signed up for mentorship, but then at the end I'm telling them I will probably give you a pass. So you actually did pretty darn well. So I still took a ton of notes, as I said. So to start with, keeping it short, I frankly think you're almost there. If I was to give you feedback in case you want to sign up for more sessions, you probably need at most five, if not three, just to kind of refine the process. Because in terms of content, you probably are one of the few people who've been able to keep up with the line of questioning I've asked, which was excellent. So just to go through the feedback and mind you, I'll paste all of this in in addition to some extra resources you can look at at the end. For the most part I agree with you, but objectively communication excellent. Like you are really easy to communicate with, easy to follow, even throwing jokes on occasion in the interview, which keeps it light hearted. So really did well with that. And then the line of questioning. With respect to background processes, I really like that because most people fail to think about that. They think they have to design reddit. In fact, one of the biggest problem I've had with candidates is you ask them to detect to design a harmful content detection process, but they don't think about the background processes that exist. So you started with that. Make it a point in any interview like nobody is asking to design the actual system. They are designing a new system in addition to the platform that's already there. So feel free to denote exactly what you think already exists. And maybe this is where one bit of improvement we can have here is actually detailing those bits. So if we expect that, let's say, ready to give us APIs that can extract text, or alternatively, if we expect that we have a push model where if a post is made, our model automatically consumes that post, or has that post pushed into, let's say, some message queue where we can consume it and process it. It will be good to talk about it from that stage. And in fact, for most machine learning interviews, I think that's probably the best stage to start at. Just thinking about what's my input like, what's the current platform giving me in terms of input. Of course, if it's not like a platform based system, it could be different. But typically data source is always the first stage. But it was really good to hear you actually talk about that because it showed that you had the intuition to think about that bit. And then on line 85, I mentioned the preprocessing bit. I felt as though you were not thinking about it until I pushed you in that direction. I had to actually ask you, okay, how are you going to standardize this? How are you going to preprocess it? But once I asked you, you went in depth on tokenization preprocessing removal of stopwords. It is key to think about that because part of the ML system design, in fact, sometimes you may find that's the stage that requires the most system design work, because these are heavy processes. Like can imagine when you have a bunch of images or even let's say videos where you have to do frame sampling, then that's computationally intensive. And it's usually worthwhile thinking about how that setup is going to look like. What those features once you've extracted the features, because this is basically the feature engineering part. Once you've extracted them, where you're going to store them because you don't want to process all of that. And then all that data is basically output into some log file. You want to actually think about the data output. In that case, talk about, let's say ready, so whatever data tiered storage solution you have where the model can consume from there in a quick manner. So it might be worth talking about API endpoints there, maybe talk about them being backed by, let's say, I don't know, redis or some other system. So that's stage one of the system design. And then I give you a breakdown. You can clearly see online 82 on how I typically do it. It's kind of worked well for me. Maybe it can inspire you to curate a process for you. So typically it's usually requirements preprocessing and then define my APIs, both the input and output before I even jump into the model stuff. And then I think we delved deep here. Yeah, I actually made the point on that. You went deep into it. So I didn't really have any line.

Golden Pheasant: Of questioning on that.

The Legendary Avenger: So that was good. So for the tokenization bit, I think we kind of covered much of that conversation in the interview. But refresh on that. That way at least you have a stronger sense of the intuition between of the implication of performance between short and long. So clearly that's probably the weakest point of the interview because I felt like we were going through it's kind of like throw something at the wall, let's see if it sticks. But for the most part, it was good that you even knew of the different tokenization strategies. You even went to a level of giving me examples which I could identify. So I really appreciate that. And then let's see. I think this is the same point I've made there, where preprocessing is always step one. So it was good that we talked about that. And then in terms of output, I felt like once you were done with the core APIs, the rest of the stages were perfect. It was like you talked about the output you expect. In fact, the funny thing, when you said the API was okay, to me, it was like, this is probably closer to what I would want to see in an interview where somebody actually gives me example payloads. I always ask, for example, payloads. Most people won't give that. So this was actually really good. So these are nits, I would say maybe include the expected protocol. So though for machine learning system design, I don't think I'll push anybody down to that level. But the reason why I might want to converse about that is sometimes it's worth thinking about, let's say the concurrency or the number of requests I expect. So if this API, it's going to expect a lot of, let's say get requests. Get requests tend to be more performance than, let's say, post requests. So when you're talking about that, especially in this case, we know there will be a lot of text request coming in. Like as soon as a post is made, a request is made. So there should be a ton of them, especially since the sheer volume is going to be a lot. But most of the payloads will be tiny. So in that case, we probably want to have lightweight post requests. In this case, maybe the downside might be since most of the requests will correspond to a unique thread, then maybe post might is that actually good? Is it post or put? Like in terms of method? Because I want to create a post.

Golden Pheasant: Is kind of well, there's two ways of thinking about it's, like rest and then there's just like JSON RPC. So if you're thinking about it in terms of like rest, a post request is more tend to create something. And a put is like an update on the full entity. But people who like gRPC is all post requests, even if you want to get something.

The Legendary Avenger: I see. Which one is the one that has the item potency? I think it's a post or put in terms of like a session item potency. I can't remember which one.

Golden Pheasant: I would assume, like, they're both supposed to be idempotent, I think. Yeah, I don't think it should matter. For instance, if you're doing a payment endpoint, you want to post to create the payment, but that item potent. Like if you try and call it twice, it needs to have the same result twice.

The Legendary Avenger: Yeah, actually, I do believe put is actually put is item potent while post is not. And this is actually the key bit right there. So if we made a put request, especially since we know most requests will not be updating, they'll be sending this one request. We process, we're done. And that's where maybe you can justify post because I don't think we worry too much about reprocessing the same thread. But the downside might be if somebody decides to spam our system then once they start sending threads and maybe they know we are preprocessing them for abuse or whatever so they just send like 2030 threads. If you are processing them using post, each and every thread will be processed. But if you are using a put type of request then once they've done it then we kind of might monitor the session. Although if it's, let's say along the same request line, then we could have a conversation on that. But you kind of see why depending on user behavior, the method might be worth talking about. But other than that, most of the content was perfect. In fact, I really appreciated that you thought of the multilevel approach without prompting. I've honestly not had anybody immediately think of that. So you've done a good job with that, thought of how you're going to present them in terms of multiple labels and then I think I break this down for you in order to give a sense of the sequence of steps. For the most part, I think we already touched on the precision versus trickle bit. Honestly, I'm not going to grill you on that because I know I struggle with that too. So the only suggestion I had is just use a confusion metrics. That way you avoid having to talk of the terms themselves. But that said, it always looks good if you immediately talk about it without even defining it because it kind of gives that interviewer the illusion that I work with these metrics regularly, I know what they mean, so I don't need to think too heavily about what they mean. So essentially just review the resources to kind of have a sense of what high precision, low precision means. And then maybe I forgot to mention this, but this objective, the objective function or the objective of the modeling altogether, it might be something we want to highlight in the initial stages, but it's totally understandable to talk about it. Even at the end I felt like it gave the interview a more natural flow. But at the end we want that optimization in terms of process. So maybe try and touch on it right at the get go. That way it's not something that you might miss out on. Let's say if the interview is kind of hard and you're stuck with some steps, you want to just knock it out of the way because it's easy to do and then keep going. Yeah, does that make sense? For sure. But other than that, as I said, I will give you a pass. I think you actually really did a good job. I could clearly pick up on the knowledge. So if you have interviews tomorrow, I honestly will be confident you'll do well. It's more or less about refining the process. So still put in a bit more work and try and get the processes refined and final bit, maybe spend a bit less time on the requirements. But I was okay with this because I felt like you were talking about design as you went through it, and that's why I think that was time well spent. But in case the interview is proving to be a bit more technical, just a bit more crazy, because you'll find some interviewers who are generally hard to work with, just try and maybe time box yourself there to at most seven minutes or so.

Golden Pheasant: Okay. All right, cool. That makes sense.

The Legendary Avenger: Awesome.

Golden Pheasant: All right, any other questions? Anything else I can clarify?

The Legendary Avenger: Nothing else today?

Golden Pheasant: Thank you so much.

The Legendary Avenger: Absolutely.

Golden Pheasant: I enjoyed this. I hope you did, too. And it was nice talking to you.

The Legendary Avenger: Yeah, this was really great. Really appreciate appreciate taking your time today.

Golden Pheasant: Absolutely. All right. Thank you.

The Legendary Avenger: All right. Thank you. Bye.

The Legendary Avenger: Hey.Can you hear me?

Golden Pheasant: Yes. Can you hear me?

The Legendary Avenger: Yes. Loud and Clear. Okay, cool. So just to get it started, maybe give me a quick run through of what you're prepping for as well as what you're looking to get out of this. That way I can make sure we target that exactly and get the most value for you out of it.

Golden Pheasant: Yes. So I'm interviewing for some senior machine learning ML ops positions, and I will be getting some machine learning system design questions along the way. And that's kind of the angle, I think. Now, I'm not sure the company I'm actually interviewing for does a lot of payments stuff. I'm not sure if they'll give me one that's focused on that or if they'll just give me, like, a random machine learning ops question. So I'm not sure.

The Legendary Avenger: I see. Okay, so lots of infrastructure monitoring and something along the lines and given that it's a payment, probably along the lines of anomaly detection in this case, right? Could be.

Golden Pheasant: I don't know.

The Legendary Avenger: I'm just saying yeah, it's like a best bet kind of situation because what I usually try to do is given the company or the target company, I try and think of problems that have a high probability of showing up. Who knows? If you practice with that and it shows up, that's a good thing, right?

Golden Pheasant: Yeah. So the company upstart, they do lending. I don't know if it's as much anomaly. I mean, it could be anomaly detection, but yeah, I'm not sure I hear you. Okay.

The Legendary Avenger: Makes sense. I mean, at the end of the day, fortunately, the infrastructure roughly stays the same because it's usually if you have a model handy, then it's more or less about scaling up and making it available. So the process stays roughly the same regardless of the application. Although, contextually, it can be different, especially when thinking about requirements. And I noticed that this is a mentorship session, so have you done any such interview before? And if so, what's your comfort level with machine learning interviews overall?

Golden Pheasant: So I've done plenty of normal sui interviews. I am not as experienced in system design stuff in general, technically. It's kind of weird. The way it worked was basically like, I wanted a staff engineer who did machine learning, and we couldn't find a match, so they said, we'll throw in a free machine learning infra interview for you. So I was like, okay, cool. So kind of how this is marked as a mentorship session.

The Legendary Avenger: I see.

Golden Pheasant: Yeah. Okay. That makes sense. Cool. And final question, at least, before we jump in here. And have you done any prep along the lines?

The Legendary Avenger: A little bit of prep with Alex Zhu's book.

Golden Pheasant: Oh, nice.

The Legendary Avenger: Yeah, I've read, like, the first four chapters or so. Perfect.

Golden Pheasant: Yeah. If you've done that, at the very least, then you're in a good place. And TLDR just to set expectations. The process is roughly the same as traditional system design. The only difference will be maybe less of the estimates, a bit of extra time on the APIs. And at the end of the day, in terms of monitoring it's, not the traditional stats like P 90 and whatever it is, it will focus on ML specific metrics depending on the model you choose. So it's fairly straightforward process. Okay, I have a proposal, at least for this interview, in order to help make sure that you get that practice. So I can give a question to you, a prompt to you, as though it's a normal question and you'll go through it. And now, unlike traditional MOOCs, where I'll reserve feedback for the end, if there are any glaring weaknesses, I can kind of stop you in the moment, suggest a way we can improve and then try again. And then if there's anything that's good enough, I can reserve feedback towards the end. So minor things reserve feedback for the end, major things kind of stop we iterate and then think of better ways to do that. Way we target the big weaknesses, if any. And then for the minor ones, just give resources or like stuff you can practice on Async. What do you think?

The Legendary Avenger: Yeah, sounds good to me.

Golden Pheasant: Perfect. Okay. And so, building on the same problem, so what you're going to try and do here is essentially something more traditional. We're going to again target then, okay, what is happening with this thing? Are you able to see what's going on on the screen?

The Legendary Avenger: Yeah.

Golden Pheasant: Okay. I'll reload so that the bug goes off. I don't know what's going on with.

The Legendary Avenger: Oh, okay.

Golden Pheasant: You're able to see my screen, right? You see me type?

The Legendary Avenger: Sorry. Yeah, it was some weird infinite loop thing. Okay.

Golden Pheasant: We're going to focus on an anomaly detection problem. And in this regard, we are going to focus on actually, let me even twist it up a bit, so let me generalize. Even instead of anomaly detection, let's think of harmful content removal. So it's somewhat anomaly detection, but I think it gives us a better, more generalized framework. And we're not going to target a payments company yet. So think in terms of reddit. And the beauty of it is it gives us multiple media types. So I'll give you an initialization here. Imagine we have reddit. So we have the subreddits, we have the reactions, and a couple of things to get you started. We have the ability to report posts. We have the ability to, let's say, react posts. And so we also support the following types of posts. So images, short videos, but mostly text. Okay, so this is like our watered down version of reddit. So given that, what I would like you to do is think of how you will build out a system to essentially handle harmful content removal and by harmful content and feel free to add to this. This is what we mean. Content related to abuse, content related to nudity, content related to what else? Violence, misinformation. This one is especially tricky since it's hard to know. But we can think of it as an extended goal. We can discuss it once we have the rest done. So that's kind of the synopsis of this. Feel free to take over and just run me through how if you were to build this system, assuming this watered down version of Reddit exists, how you would go by that?

The Legendary Avenger: So I have a couple of questions in terms of the overall, I guess, requirements gathering. So I guess the first question is how soon do we need to moderate content? Right? So if somebody posts a picture, I guess even before this, is this a reactionary system or is this a proactive system?

Golden Pheasant: Excellent question. So I would say maybe let's target a hybrid where for some content, especially categories of content that are extremely harmful, we basically react ASAP. And for some other contents we can maybe take a day.

The Legendary Avenger: Right?

Golden Pheasant: So we are open to different variation in timing in this case, depending on.

The Legendary Avenger: The type of so certainly like with nudity violence, I think another one is called like that. Those are all things that we'd want to take down immediately. Sorry, this would be proactive and then other pieces of content abuse would go in here as well. But we would say for the reactive case, perhaps misinformation is a different thing here and this is just posts and reactions to posts. There's no comments or anything like that. And then I'm assuming there's other pipelines that when you post something, it's not like in terms of Reddit, I guess slo for post to feed time is probably not on the order of milliseconds, it's probably on the order of seconds. Just an assumption here. Are there other background data processing jobs that are actually going on in this system or will this be the first of these?

Golden Pheasant: Excellent question. So you can assume you have some background jobs. That said, I'll also give you the freedom to define what jobs you think will be convenient and if you think exposing the data that they are processing either in state or after.

The Legendary Avenger: Okay, first async system. Okay, cool. Let's see, let's get a sense of what are the number of the volume of posts and reactions per day? So this is like Reddit, right? Reddit is supposed to have 50 million daily active users. I don't know if they're posting 50 million videos a day, it's probably well then we can just do sort of like a range estimate of in the one to hundreds of millions of posts and then just a clarification. Does that include comments? Not comments. Okay, so we don't care about comments. I probably revise this to be down a little bit, maybe like to 50 million posts per day including all of the content that's in it.

Golden Pheasant: Let's see question on that. So when you say 50 million posts, do you have a distribution in mind? Because given what you have on line six, it will be good to think about how this is distributed, right?

The Legendary Avenger: Oh, I see. Yeah. So if there's 50 million posts, then if we're saying most of the texts are posts sorry, most of the posts are text only. If we said most, meaning maybe 80% if we say 80%, okay, 80% are text only, this would leave us on what would that actually be? I am not good at math right now. 40 million to million text posts per day with a mix of image and video. So like up to 10 million images and videos per day.

Golden Pheasant: In terms of videos in this regard, are there any assumptions you're making with respect to what the size of the video is going to be as well as what the input format is going to be? So are you thinking somebody will upload an actual video or will they link a YouTube video, a vimeo video that's going to be, let's say, rendered in an iframe in reddit?

The Legendary Avenger: Yeah, so good question. My assumption is that's true. So text can have links, so that'll be something to consider. We'll need like a link detection deal here service. If there's a video, I'm assuming the videos are direct uploads so we'll be able to analyze them within our system. These are directly attached? Yeah, because otherwise it's really hard to moderate if somebody's adding like a YouTube embed. I mean, one, we assume that YouTube is doing their own moderation, they're doing a good job, and two, if there's a sketchy link, then what we may want to do is actually queue this stuff up. So I guess thenext question is do we have some data sets around of existing content that is labeled or do we need a labeling manual review system?

Golden Pheasant: Excellent question. So on line 31 I'll make reference to line four. So feel free to make assumptions about having a human in the loop. So we will actually have a continuous stream of some of the posts that will be labeled depending on the category. So we do have humans in the loop and we actually do have a mechanism available for reports.

The Legendary Avenger: Okay, cool. Yeah. So that mechanism from what I'm familiar with is basically the way the mechanism is structured is they monitor the number of reports. If reports exceed some threshold, Q or manual review and then we get the labels there. So in terms of data set size and distribution, can we assume that the training data that we have mirrors the population data set distribution? An assumption is maybe images are more frequently moderated than text only and so it could be the case that the data set has 50% images, 50% text, whereas we're getting 80% text from our prod distribution.

Golden Pheasant: I see. Yeah. I think it's fair to assume that the rate of report should the size of the data set in this case, especially since from my point of view, any abuse is always going to be abuse. So once it's in the database, we can always reuse it in the future. So to a degree, it's kind of hard to argue that the distribution of data we have currently will be the distribution all the time. But for now, I think for simplicity purposes, I think that's a fair assumption to make, at least for design purposes.

The Legendary Avenger: Okay, right, cool.

Golden Pheasant: In addition to that, I was actually going to suggest it might be worth thinking of data drift ahead of time in this case because one year ago, two years ago, even five years ago, we remember images were the thing. Like it was the bigger deal. But now with the age of TikTok, now that's a bigger deal.

The Legendary Avenger: Right.

Golden Pheasant: People are looking at videos more often. And this is why I'm saying it might be hard to argue that the distribution will stay static, but at the end of the day, the data is always going to be the same. It's always going to be abuse.

The Legendary Avenger: Right? Yeah. Another question is what percentage of posts are marked harmful? So I'm assuming it's in like a 1% range kind of thing here. Like a very number. Right? Yeah.

Golden Pheasant: Reddit is crazy. So yes, that's a fair function.

The Legendary Avenger: Maybe reddit is like 5%, I don't know.

Golden Pheasant: Could be with that platform.

The Legendary Avenger: Man, Twitter is lost.

Golden Pheasant: But yeah, I agree.

The Legendary Avenger: Yeah. Or like Twitter. Yeah. So I guess the last thing I think I wanted to cover here well, I think I have enough in terms of requirements to kind of move forward here. I know in terms of machine learning today, multimodal is starting to really take over where you can just sort of give a machine learning model. I think Salesforce came out with Instruct Blip pretty recently and you can just feed the model like kind of anything and it'll figure out whether the model is sorry, it'll figure out whether the actual content is something like abusive or violent or might be misinformation or something like that. The more traditional way of setting it up is just having kind of one model per modality. And so that's the system I'm more familiar with. But today, if I were setting this up from scratch or even just like joining a company, I'd really look at multimodal first to see if there are wins there. But the system I'm proposing for the interview is the one I'm more familiar with, which is just unimodal. Right.

Golden Pheasant: Let's do this. Let's make the assumption that either we have an ensemble that essentially gives us a multimodality functionality, so it kind of marries the two ideas where you have one model that can do it all with all data types. Alternatively, you can spin out separate models. In fact, I would actually argue it might be worth going with your intuition here. Let's go with the traditional approach. Reason being we already made the assumption that there might be more images that are harmful. So if we have one model yet, most of our content is going to expect, let's say images, we might want to scale that up without we might want to scale the processor further, what do you call it for the images in this case while leaving the text on to stay constant. But that said before, with all the model assumptions made, I would like you to talk about preprocessing. Like given this data that we have, is there any preprocessing steps you might want to make in order to make this data available? Because my assumption is regardless of the type of model we have, we still have to process the different data types to make it ready for the model with the target standard input type, right?

The Legendary Avenger: Yeah. So there's a few pieces here in terms of preprocessing, we can kind of split it up. So for text there's tokenization that you need to do. I can't spell tokenization and so basically it's the process of what an NLP used to do where you do stemming, removing stop words and trying to clean up capitalizations, fix misspellings and things like that. Traditionally thing you do nowadays there's libraries that basically break words up into tokens and then creates kind of like I want to say it just assigns like an ID to each of the tokens and then that's sort of like the sequence of features that are fed into an NLP model. I think a popular one was tick token by that will actually do the tokenization of the text. And then for images and video, some of the things that are really important for pre processing, I think the first one is making sure that the dimensions of the images are correctly sized. Because if there's images that are bigger or smaller, trying to fit all those things into the model definitely will cause some issues and so you want to scale the image. So I haven't worked with video much, but I can say for images you definitely want to scale it. And then also if we are working with a very low percent of harmful content, sometimes it can help to upsample these images by doing things like adding a rotation what's it called where you change the hue saturation brightness? I don't. Changing HSL values. So like brightening an image, darkening an image, doing all that or changing the contrast of the image in terms of preprocessing to kind of get more sample in.

Golden Pheasant: My question actually I have two questions particularly know totally okay if you're not familiar with any of this, but this is actually a very interesting point, especially in the context of reddit. So when it comes to tokenization here, there are multiple strategies we can use you don't have to cite any specific tokenization strategies. But in terms of token size, how long do you think each token should be? And for reference, Ngram is usually like intraword. It can be two or three syllables. Yet you can also have bug of words, which in that regard it looks like it looks at entire words in this case. Or you can even have multi word tokenization strategies where it looks at samples of phrases or something like that and maybe looks at overlap. So in terms of tokenization, which strategy do you think would make most sense in our case and why?

The Legendary Avenger: Good question.

Golden Pheasant: I think.

The Legendary Avenger: So, like single I mean, like kind of like single word tokenization sort of seems to make I mean, you can't tokenize a word. I believe there's something like you pick like eight characters and that should be enough to sort of encompass most words and then some words get chunked into two tokens. I think that's at least what I'm kind of familiar with in the space, I guess. What are the trade offs there? Well, there's definitely a trade off between the more unique tokens you sorry, there's a plane.

Golden Pheasant: I live next to the Boeing HQ, so trust me, I hear that a lot.

The Legendary Avenger: Yeah. So I think there's a trade off with the number of token values that your vocabulary has. So you have to be careful. If you only do one character, you end up with like if you use ASCII like 256 values. But I don't think the models are as good at being able to pull out context on a per character basis. Now if you do a bigger token size, then you'll have a lot more unique values. I think up to a certain extent models will do pretty well on attention. It's just that now you're feeding in larger I guess you kind of blow up the size of your embedding table a little bit. So you have to kind of watch out for your memory of your system.

Golden Pheasant: Excellent. That's exactly what I was targeting right there. And it's really good that you brought in context because abuse is in context, dirty words always have the same characters as praise. So in that regard, I think you're right. I think I would also lean towards single word, probably towards multiple word tokenization, especially since nowadays a lot of abuse will lean towards the three, four words. Like people will combine multiple words. A single word that is sometimes abusive might not necessarily be abusive. Let's say I'm describing the word stupid in itself, it might not be abusive, but if you're calling a person abuse or stupid, then that is actually an insult. So I think multi word tokenization, you can also justify single word tokenization in that regard. And then there was something you mentioned that made me think a bit. So you mentioned that we need to be mindful of performance, but that's where maybe I might question it a bit. Because if we have, let's say, single word or multi word tokenization, then to a degree, we are reducing. The total cardinality of the total number of tokens generated from any particular text input which could actually end up helping the performance compared to, let's say doing an Ngram tokenization where we are looking at individual characters. So do you think let's say this approach would actually end up meaning something in terms of path? What's your intuition on that?

The Legendary Avenger: Okay, basically, just to reiterate what you just said, the idea is as you choose a multi wad token, that should result in an increase in the number of what do you call it? It's like a decrease in the sequence length, but also an increase in the kind of, like, embedding table size. Right. So I guess in terms of performance, I'm not actually 100% sure on what would actually end up causing the most issues. I think my feeling is that at training time, having a larger embedding table can be more memory intensive, but at inference time, your context, like the sequence length that you actually need should be shorter, and so the inference should be quicker because you're just doing less matrix multiplication at inference time. That's my intuition on it.

Golden Pheasant: I think that kind of confuses me, though, because technically, with the reduced number of token sizes, with the reduced number of tokens, we should technically also have a reduced embedding table size because the size of the embeddings themselves, actually the number of embeddings should stay roughly the same. The size of the embeddings is what will vary.

The Legendary Avenger: Right, right, yeah.

Golden Pheasant: Just a bit further, in terms of sparsity, do you think there are any implications with, let's say, a multi word tokenization compared to, let's say, Ngram tokenization.

The Legendary Avenger: In terms of sparsity? Yeah, like, I mean, so I I guess, like, with this is just a guess because I don't actually know, but I'm guessing, like, there's a I'm guessing with, like, Ngrams, there's probably a higher sparsity of tokenization. Hello? Did I lose you're? Hello? All right. Hello? It I hear something.

Golden Pheasant: Oh, can you hear me now? Yes. I don't know what happened there. Yeah, sorry.

The Legendary Avenger: I was saying you are right.

Golden Pheasant: Your intuition is very correct on that, because objectively, if we have Ngrams, we have a huge embedding table because there are too many unique values. And so for it to make sense, we have to have huge embeddings. In this case, probably more than 1036, which we have with TikTok. So in that regard, most of the values will be zeros. And so we'll end up with too many zeros and too little information. But with multi word tokenization, smaller embeddings, which will mostly have values, because in this case, essentially, with the same size of input, we have fewer boxes to eat. So not only will they be more useful because they encapsulate context too, and so it also ends up helping with performance in that regard. But anyway, sorry, we spent a bit too much time on that. So I think we can move on in this regard. Let's maybe talk about the API. So given what we have here, talk to me about some APIs you might want to design in order to serve this data.

The Legendary Avenger: Yeah, let's see. So I could see like a few different systems here. So the idea is that we kind of said we're going to do single.

Golden Pheasant: Modal.

The Legendary Avenger: Inference and inference. In this case, it could be done as like a batch inference or online. To me they kind of both work because technically it's an async process. But also this data is kind of coming in as a stream. So to me, I think it's just easier to kind of model it as an online inference system and just sort of like a little more future proof. If you do need it to be more real time, it will still be there. You don't have to migrate from batch to real time in that case. So there's basically going to be three services. One for image, one for text, and one for video. They'll all have the same APIs. So maybe it's like moderate. So you could say service moderate. And then the request should be like post request. And if we're thinking about what kind of data we want for the body, so we should have features and then we'll pass in a list of features. The services should be doing the same preprocessing that we're doing to generate the data and train the model with it. I think these features in reality, the way this would kind of work is you use like feature IDs and then sort of work through some kind of a feature store. So this is actually something I didn't cover, but I feel like we should talk about is it's not just image text and video, but also user profiles might also be something that we want to consider at least in the future. So it's not just like show me an image, it's like show me an image by a particular user. And then maybe that helps with identifying things like misinformation. Like if there's a post and that user has been reported a number of times, maybe that actually positively impact, but maybe that's like a sorry, I'm kind of thinking out loud here and I'm a little disorganized. I think having the three services here and then another service, a fourth one for just text misinformation makes sense. These are a lot easier to moderate. And then with misinformation, it's a little more subjective and the model may end up having features in it that the text model doesn't need to include. So that's why I'm kind of building this out as a separate service here. So for this API, you'll pass in a list of features. Maybe it's as simple as just saying like a dictionary and then we can say the text for texts, right? One, two, three. Or for same with images, it would just be the Blob or video. So the Blob and the bytes. And then same thing with the video, right? So that's just kind of the API there. And then the response is going to be a list of labels and confidences. So you'd say like results and then, so we'll have a label and then, so we may want to say this is the abuse label and then we can give a confidence and then this will be a float between zero and one. So this could be like nine and three. And then we can have other labels here as well. So we'd have like labels for nudity, violence and then yeah, these, these would be like the labels. Yeah. So that's basically the APIs of the service. In terms of like there's a few important things here which is as a system as a whole, the thing I want to touch on is there's training metrics and then there's online metrics. And how are we doing? So for things like online metrics, like actual production metrics, we want to cover things like the number of we call it harmful let's we'll call it harmful impressions. So that's just the number of people that have seen these things and then like number of reports.

Golden Pheasant: Nice.

The Legendary Avenger: And maybe even so there's impressions and then also maybe like a time based one. So like average time to take down. So when something is posted reactively and we need to take it down, we want to be able to take it down quicker and hopefully models will facilitate.

Golden Pheasant: That take down time.

The Legendary Avenger: In terms of actual training metrics, some of the things that we want to consider here so when we're training on image, text or video, we're doing a.

Golden Pheasant: Multi class classification.

The Legendary Avenger: Which means that we'll be doing a get them mixed up all the time.

Golden Pheasant: Precision and recall.

The Legendary Avenger: Oh, sorry. Yeah, so we care about precision and recall. And so precision is basically the number of true positives divided by the two positives plus the false positives. And recall is true positive divided by two positives plus the false negatives. In terms of how we want to actually present this, I think just presenting in terms of precision and recall is we don't have to do an F one score or like a receiver operator curve just leaving as precision recall works. Any questions on this piece?

Golden Pheasant: No question on that. Maybe just a quick suggestion because I was literally typing feedback on this bit because I totally relate to you on one bit. I absolutely hate these two terms because I rarely remember what exactly they are. And so my fallback is literally saying I'm just going to generate a confusion matrix because it's going to allow me to interpret the same metrics the same way. Like I'm better with the intuition rather than the terminologies. But I totally hear you on this. Makes sense.

The Legendary Avenger: Yeah. That makes sense. Okay, so that's that for training metrics.

Golden Pheasant: Actually, sorry, quick question, objectively speaking, because given these training metrics, can you define to me the objective of the model in terms of the two metrics?

The Legendary Avenger: Yeah, you're talking about like the loss.

Golden Pheasant: Function, not necessarily the loss function. We can speak on that later given we have a multiclassification problem here. But I'm interested in what is the goal? Like do you want to minimize precision while maximizing recall or do you want to maximize both? What's the goal? Particularly in our case? Yeah.

The Legendary Avenger: So this is actually interesting because it depends on the business objectives and time. So I've actually seen that at my company where we basically said for a while we were okay with businesses, we were okay with being very strict in terms of, I want to say, like false negatives. And then economic downturn kind of happened and then we kind of said like, hey, we want less false positives because we want people to actually spend on our platform. And so we were actually willing to trade off like recall, I think back for precision or it might have been the other way around, business metrics, but if you're trying to juice engagement, then you want to be careful with okay, so false negative in this case would be classifying something as harmful when it's not. And so if you're trying to juice engagement metrics, having a very high precision and low recall is good because if you have high precision, low recall, what we're basically saying is we're okay with harmful content making it past the model because we don't want to block good content. Makes sense.

Golden Pheasant: So in this case, essentially what you're saying is we are okay with people being too sensitive rather than being less sensitive and us missing on actual abuse content. Like it's okay if they report something and we find out it's not harmful and restore it rather than the opposite way they fail to report or we fail to flag an excellent so we are basically making our thresholds very low for what abuse or what a flag for abuse should be. But then of course we have the human in the loop, so we'll always have a restoration process for those.

The Legendary Avenger: I think that's an excellent approach and.

Golden Pheasant: I was actually very happy with that because I was going to grill you after that on that point, like would you want it to be the opposite and why? But I think you immediately went to it, which was excellent and then final question at least, sorry, I was going.

The Legendary Avenger: To sort of make a joke, which is this is until you have to advertise on the platform and brand, then you have to change your strategy.

Golden Pheasant: Actually, if you think about it though, that strategy can actually help because if we are worried about advertisement then unless we are worried about the rate of reporting on advert because we know it will always annoy people, and then they'll report it. But I think advertisers will probably be happy with a platform that's a bit more strict because look at, let's say Twitter, right? Twitter is not strict at all right now on abuse. And they pulled up because they don't want the possibility of being mixed in with.

The Legendary Avenger: Right.

Golden Pheasant: So to a degree, that strategy will actually help with Ads. What do you think?

The Legendary Avenger: The opposite strategy. Right? Because what we're saying is we'll let some harmful content through, and because we don't want to block good content accidentally, and then if we want to be brand safe, we'll actually start blocking more harmful content at the expense of blocking some good ones every once in a while. So we're more brand safe.

Golden Pheasant: Exactly.

The Legendary Avenger: Yeah.

Golden Pheasant: Makes sense. Okay, we in line on that. Okay. And then, let's see. I had a final question with respect to metrics, and then the next section I wanted you to focus on before, at least to get a feedback, is just scaling. So just pure architecture talk. So I was thinking, how are you going to scale it up? But before we do that, now that we have our objective function, objective in this case, can you talk to me about the loss function? Like, if you're going to choose a loss function here for your model, this is the most technical I'll get to the model, but what loss function do you think would make most sense?

The Legendary Avenger: Yeah, the loss function is I'm blanking, so it's a multiclass classification. And I know that you apply well, you're supposed to apply a soft max at the very end. And then what you do is you compute you compute the loss. The loss is, in this case, like, um, the only one I'm thinking of is well, there's two mean squared error or like, maximum likelihood. I think, like, maximum likelihood sounds like it'd be the right loss function. But I'm going to have a hard time with this section because I definitely need to study up on my loss functions. All right.

Golden Pheasant: Are you familiar with cross entropy?

The Legendary Avenger: Somewhat familiar with cross entropy. I know the word. I don't know what it represents at the moment.

Golden Pheasant: Yeah, so the reason I brought that up is because, at least from my understanding, it's been a while since I had to meddle around with maximum likelihood. But I remember it as a way of estimating my parameters for my model, but not necessarily for a loss function. It's actually used more so for tuning the parameters. But the loss functions, they tend to be things like cross entropy. There's binary cross entropy when you have two levels of multilevel cross entropy. When you have multiple levels. There are things like hinge loss, maybe focal loss, and a couple of other examples. I cannot think of all of them off of the top of my head, but honestly, and this is just a quick tip, cross entropy will almost always be a good answer when you're asked about the loss function. The only difference might be you need to showcase the intuition on the variance. So things like multilevel or binary, as well as understanding how it penalizes false predictions. Now, the downside to it though, is it's excellent for supervised problems where we have labeled input. In our case that would work well, but for unsupervised problems, that becomes a bit of a problem. And this is where if you think of, let's say, pre clustering in order to get a sense of, let's say, some sense of labels, that might actually be helpful.

The Legendary Avenger: Okay.

Golden Pheasant: Does that make sense?

The Legendary Avenger: Yeah.

Golden Pheasant: All right. And then final bit of probing in this case. So talk to me about deployment. So say we've set this up. How are you thinking of, let's say, scaling the system up, if need be? And are you going to have a multi service architecture or monolith? What's your intuition on it?

The Legendary Avenger: Yeah, so there's a few things here. The first point is my initial launch, I'd want the models to be small enough to fit on a single GPU for inference, because if I have to do a distributed GPU thing, that just adds more complexity to the system. So, like single GPU inference, that would be nice. And we basically put a load balancer in front of a number of these instances. I think each of the major frameworks has their own serving framework, so PyTorch and TensorFlow have their own serving infrastructure. So I would just run it on top of that. And yeah, I mean, like in terms of doing a couple of things that are important, I'd want to make sure that I'm logging all of the features at request time. That's one thing I want to do to make sure I minimize the amount of potential data leakage that's happening in terms of rolling that data out into the new data set. And then also we want some metrics and monitoring to watch for data set distribution shifts. So things like anomaly detection just to figure out, like, hey, are we getting a bunch of nulls for a particular feature? Like in this case, it's text, images and videos. But if we're doing something with misinformation where we care about the user profile, you do care about that kind of data distribution shift, and also you want to look for your label shifts too. There could just be all of a sudden one day, everyone's posting a bunch of crypto nonsense and you're getting a lot more abuse on the platform that day. So you do need to be able to alert and say like, hey, our models are kind of going off the rails. Here the system that I would actually set up. Okay, so there's scaling, scaling. I would just do a load balancer with a cluster of machines as CPU or GPU, I guess utilization reaches a threshold. Add more clusters to the machine and auto scale the thing I would want to do though, is on any new model rollout to do an AP study just to see how the model is performing online versus the existing baseline before just blindly deploying it out to production.

Golden Pheasant: Excellent. I think that certainly makes sense. And just a quick point. In addition to what you've just mentioned, there are also other benefits to think about. Given that our service and maybe you should have clarified this initially, it ought to be global abuse in the US. Might not necessarily be abuse in some other place. And in similar regard, we have inputs that might be different, that might completely distort the model. We want to stratify the models by geographic features, like language, what might happen in China versus what might happen in Russia. The languages are completely different, the input is different. So we probably want different flavors of the models. And so your approach here, where we have mini models or microservices essentially, and distributing them depending on the data service, that will then being a very useful approach for this purpose too, right?

The Legendary Avenger: Yeah, having regional support.

Golden Pheasant: Exactly. Okay, we have about five minutes left. I will add a couple more to make sure we are going through feedback exhaustively. But for context, I never give feedback straight up. I usually want to first get your intuition. So, looking back at the interview, what do you think? You did well and what do you think you could improve on?

The Legendary Avenger: I think I did all right with the initial part, the sort of like requirements gathering. I definitely need to go deeper. Well, I got stuck on the tokenization piece, so I think there's just a little bit more focus on some of the machine learning aspects of things like what are some of the more common preprocessing steps. Tokenization be one of them. Looking at the different scalings and things like that, like doing log transformations when your data distribution isn't exactly gaussian. Things like that, I need to just refresh on. I thought the API endpoint part was fine. Yeah, I don't know if there's anything really glaring there. Ultimately, I feel like I'm not sure I called it out in the beginning, which was like, really should be doing a multimodal system that's just like, more intelligent. I have a feeling that would actually perform better. I'm glad I called that out and then went with the system that I'm a little bit more familiar with. So I could actually each of these things. I'm not actually sure if it's better to do it that way in a real interview or if it's like by going this way, I'm giving the interviewer an opportunity to dig on. Let's talk about image related models. Let's talk about text related models versus potentially. If I'm saying let's build some multimodal system and the interviewer doesn't know how to or isn't familiar with multimodal systems, then I'm not sure if they're getting signal. So I'm not sure if that's like the right path to take there. I need to just do better on my loss function stuff and just kind of didn't get that piece. And then other than that, I think everything kind of did all right. So that's sort of how I feel. Like there's some ups and downs. Definitely not like the perfect interview.

Golden Pheasant: Here's the thing, I rarely find a.

The Legendary Avenger: Person who signed up for mentorship, but then at the end I'm telling them I will probably give you a pass. So you actually did pretty darn well. So I still took a ton of notes, as I said. So to start with, keeping it short, I frankly think you're almost there. If I was to give you feedback in case you want to sign up for more sessions, you probably need at most five, if not three, just to kind of refine the process. Because in terms of content, you probably are one of the few people who've been able to keep up with the line of questioning I've asked, which was excellent. So just to go through the feedback and mind you, I'll paste all of this in in addition to some extra resources you can look at at the end. For the most part I agree with you, but objectively communication excellent. Like you are really easy to communicate with, easy to follow, even throwing jokes on occasion in the interview, which keeps it light hearted. So really did well with that. And then the line of questioning. With respect to background processes, I really like that because most people fail to think about that. They think they have to design reddit. In fact, one of the biggest problem I've had with candidates is you ask them to detect to design a harmful content detection process, but they don't think about the background processes that exist. So you started with that. Make it a point in any interview like nobody is asking to design the actual system. They are designing a new system in addition to the platform that's already there. So feel free to denote exactly what you think already exists. And maybe this is where one bit of improvement we can have here is actually detailing those bits. So if we expect that, let's say, ready to give us APIs that can extract text, or alternatively, if we expect that we have a push model where if a post is made, our model automatically consumes that post, or has that post pushed into, let's say, some message queue where we can consume it and process it. It will be good to talk about it from that stage. And in fact, for most machine learning interviews, I think that's probably the best stage to start at. Just thinking about what's my input like, what's the current platform giving me in terms of input. Of course, if it's not like a platform based system, it could be different. But typically data source is always the first stage. But it was really good to hear you actually talk about that because it showed that you had the intuition to think about that bit. And then on line 85, I mentioned the preprocessing bit. I felt as though you were not thinking about it until I pushed you in that direction. I had to actually ask you, okay, how are you going to standardize this? How are you going to preprocess it? But once I asked you, you went in depth on tokenization preprocessing removal of stopwords. It is key to think about that because part of the ML system design, in fact, sometimes you may find that's the stage that requires the most system design work, because these are heavy processes. Like can imagine when you have a bunch of images or even let's say videos where you have to do frame sampling, then that's computationally intensive. And it's usually worthwhile thinking about how that setup is going to look like. What those features once you've extracted the features, because this is basically the feature engineering part. Once you've extracted them, where you're going to store them because you don't want to process all of that. And then all that data is basically output into some log file. You want to actually think about the data output. In that case, talk about, let's say ready, so whatever data tiered storage solution you have where the model can consume from there in a quick manner. So it might be worth talking about API endpoints there, maybe talk about them being backed by, let's say, I don't know, redis or some other system. So that's stage one of the system design. And then I give you a breakdown. You can clearly see online 82 on how I typically do it. It's kind of worked well for me. Maybe it can inspire you to curate a process for you. So typically it's usually requirements preprocessing and then define my APIs, both the input and output before I even jump into the model stuff. And then I think we delved deephere. Yeah, I actually made the point on that. You went deep into it. So I didn't really have any line.

Golden Pheasant: Of questioning on that.

The Legendary Avenger: So that was good. So for the tokenization bit, I think we kind of covered much of that conversation in the interview. But refresh on that. That way at least you have a stronger sense of the intuition between of the implication of performance between short and long. So clearly that's probably the weakest point of the interview because I felt like we were going through it's kind of like throw something at the wall, let's see if it sticks. But for the most part, it was good that you even knew of the different tokenization strategies. You even went to a level of giving me examples which I could identify. So I really appreciate that. And then let's see. I think this is the same point I've made there, where preprocessing is always step one. So it was good that we talked about that. And then in terms of output, I felt like once you were done with the core APIs, the rest of the stages were perfect. It was like you talked about the output you expect. In fact, the funny thing, when you said the API was okay, to me, it was like, this is probably closer to what I would want to see in an interview where somebody actually gives me example payloads. I always ask, for example, payloads. Most people won't give that. So this was actually really good. So these are nits, I would say maybe include the expected protocol. So though for machine learning system design, I don't think I'll push anybody down to that level. But the reason why I might want to converse about that is sometimes it's worth thinking about, let's say the concurrency or the number of requests I expect. So if this API, it's going to expect a lot of, let's say get requests. Get requests tend to be more performance than, let's say, post requests. So when you're talking about that, especially in this case, we know there will be a lot of text request coming in. Like as soon as a post is made, a request is made. So there should be a ton of them, especially since the sheer volume is going to be a lot. But most of the payloads will be tiny. So in that case, we probably want to have lightweight post requests. In this case, maybe the downside might be since most of the requests will correspond to a unique thread, then maybe post might is that actually good? Is it post or put? Like in terms of method? Because I want to create a post.

Golden Pheasant: Is kind of well, there's two ways of thinking about it's, like rest and then there's just like JSON RPC. So if you're thinking about it in terms of like rest, a post request is more tend to create something. And a put is like an update on the full entity. But people who like gRPC is all post requests, even if you want to get something.

The Legendary Avenger: I see. Which one is the one that has the item potency? I think it's a post or put in terms of like a session item potency. I can't remember which one.

Golden Pheasant: I would assume, like, they're both supposed to be idempotent, I think. Yeah, I don't think it should matter. For instance, if you're doing a payment endpoint, you want to post to create the payment, but that item potent. Like if you try and call it twice, it needs to have the same result twice.

The Legendary Avenger: Yeah, actually, I do believe put is actually put is item potent while post is not. And this is actually the key bit right there. So if we made a put request, especially since we know most requests will not be updating, they'll be sending this one request. We process, we're done. And that's where maybe you can justify post because I don't think we worry too much about reprocessing the same thread. But the downside might be if somebody decides to spam our system then once they start sending threads and maybe they know we are preprocessing them for abuse or whatever so they just send like 2030 threads. If you are processing them using post, each and every thread will be processed. But if you are using a put type of request then once they've done it then we kind of might monitor the session. Although if it's, let's say along the same request line, then we could have a conversation on that. But you kind of see why depending on user behavior, the method might be worth talking about. But other than that, most of the content was perfect. In fact, I really appreciated that you thought of the multilevel approach without prompting. I've honestly not had anybody immediately think of that. So you've done a good job with that, thought of how you're going to present them in terms of multiple labels and then I think I break this down for you in order to give a sense of the sequence of steps. For the most part, I think we already touched on the precision versus trickle bit. Honestly, I'm not going to grill you on that because I know I struggle with that too. So the only suggestion I had is just use a confusion metrics. That way you avoid having to talk of the terms themselves. But that said, it always looks good if you immediately talk about it without even defining it because it kind of gives that interviewer the illusion that I work with these metrics regularly, I know what they mean, so I don't need to think too heavily about what they mean. So essentially just review the resources to kind of have a sense of what high precision, low precision means. And then maybe I forgot to mention this, but this objective, the objective function or the objective of the modeling altogether, it might be something we want to highlight in the initial stages, but it's totally understandable to talk about it. Even at the end I felt like it gave the interview a more natural flow. But at the end we want that optimization in terms of process. So maybe try and touch on it right at the get go. That way it's not something that you might miss out on. Let's say if the interview is kind of hard and you're stuck with some steps, you want to just knock it out of the way because it's easy to do and then keep going. Yeah, does that make sense? For sure. But other than that, as I said, I will give you a pass. I think you actually really did a good job. I could clearly pick up on the knowledge. So if you have interviews tomorrow, I honestly will be confident you'll do well. It's more or less about refining the process. So still put in a bit more work and try and get the processes refined and final bit, maybe spend a bit less time on the requirements. But I was okay with this because I felt like you were talking about design as you went through it, and that's why I think that was time well spent. But in case the interview is proving to be a bit more technical, just a bit more crazy, because you'll find some interviewers who are generally hard to work with, just try and maybe time box yourself there to at most seven minutes or so.

Golden Pheasant: Okay. All right, cool. That makes sense.

The Legendary Avenger: Awesome.

Golden Pheasant: All right, any other questions? Anything else I can clarify?

The Legendary Avenger: Nothing else today?

Golden Pheasant: Thank you so much.

The Legendary Avenger: Absolutely.

Golden Pheasant: I enjoyed this. I hope you did, too. And it was nice talking to you.

The Legendary Avenger: Yeah, this was really great. Really appreciate appreciate taking your time today.

Golden Pheasant: Absolutely. All right. Thank you.

The Legendary Avenger: All right. Thank you. Bye.

Unique ID generation

Microsoft Interviewer

Unique ID generation

Invincible Cloud, a Microsoft engineer, interviewed Golden Possum

Watch interview

Order statistic of an unsorted array

Google Interviewer

Order statistic of an unsorted array

Intergalactic Avenger, a Google engineer, interviewed Supersonic Taco in Java

Watch interview

Most frequent integer and pairs of integers sum

Google Interviewer

Most frequent integer and pairs of integers sum

Paisley Wallaby, a Google engineer, interviewed Propitious Bear in Java

Watch interview

Triplet Array

Google Interviewer

Rocket Wind, a Google engineer, interviewed Whirlwind Alligator in C#

Watch interview

See more interviews

We know exactly what to do and say to get the company, title, and salary you want.

Interview prep and job hunting are chaos and pain. We can help. Really.

App screenshot