Interview Transcript
Red Maelstrom: Before we start, can you help me with what level you're applying at Meta?
The Legendary Avenger: Yeah, so I have on site scheduled on Monday and Tuesday, of which I'll have a system design round. So I do know that the target level is e five, but if you can overperform, it's always helpful. I've scheduled a series of mock interviews, one focusing on e six, e five, and two focusing on e five. So this is the first one of them. So I want to just get a sense of where I am, how I can improve. I've also had some conversations before and I got a sense that, well, yes, technically I might have a good sense of how to design systems at scale. One of the key things that I was encouraged to focus on was use oriented design. And so I tried to focus on that in my previous practice. But overall, I want to just keep it neutral, give me a real interview experience, and just give me feedback on ways I can improve.
Red Maelstrom: Sure.
The Legendary Avenger: Cool.
Red Maelstrom: Let's start with the system design problem then. So, yeah, this is the problem. Let's say that you have a food delivery app, and on this app you want to run a marketing campaign. So during this marketing campaign, you would like to distribute 6 million burgers in ten minutes. Another key requirement is that no one should be promised more than one free burger.
The Legendary Avenger: I see. Okay, full context. By the way, my interview is for an infra, so I'm going to try and take that to a perspective. All right, so let's just do a quick recap here. So we have a food delivery app the likes of DoorDash and such. And the goal for us is specifically to design the marketing campaign. So this is the key feature, and I need to distribute 6 million buggers. So in this regard, we have 6 million buggers. So this is six M buggers. So obviously we'll need to track distribution aggregation. And this 6 million buggers, we also have a time limit, which is ten minutes. So no one should be promised more than one free bugger. So each user equals to one free bugger. Right. And so I'll start with some key assumptions here. So can I make the assumption that we have each user having a unique account? Can I assume that on the back end we have some verification means to make sure that each account belongs to one user?
Red Maelstrom: That's a good question. Yeah, you can do that. All right.
The Legendary Avenger: Each user is equals to one account. Okay, that's good. And can I also assume a lot of the infrastructure will you require me to design the infrastructure needed to track the buggers or such? Or is it okay to assume that we already.
Red Maelstrom: Okay, no, you are saying like, delivery, right? So you don't need to worry about actual delivery and tracking.
The Legendary Avenger: Okay, good note. Okay, so app already exists, I'll assume. We have an app like DoorDash. And so this is just a feature within an app. So what you're designing here is a feature within an app. And so we don't have to worry about delivery. And so in terms of APIs that I have access to, I'm just going to flush out the ones that I need or assume I need, and then you can tell me if it makes sense to assume these ones exist or if it's my job to design them. So if I was to basically assign a bug in this regard, I'm assuming that we are going to actually make it available for people to use. So I'm imagining an email campaign where if you're signed in on the app, maybe there's a banner or something even instead of thinking over about the app. So let's actually start discussing features here. So I'm assuming it will be a banner in this regard. It just pops up, pop up on up. So this is how I'm imagining it to look like. And I'm also thinking that in this regard, we'll have the users basically accepting the offer or not, but in our case we need to support a traffic estimate of about 6 million buggers, in this case, within ten minutes. And is it going to run continuously for a while or how long is this campaign expected to run? Just ten minutes and then that's it. Or how do you expect it to work?
Red Maelstrom: Yeah, it's ten minutes and that's it. We might want to have a platform which helps us to run multiple campaigns, but yeah, individual campaign will run for ten minutes and platform can be a stretch goal. The key capability we are expecting is running a single campaign.
The Legendary Avenger: All right. And just out of user, for purposes of user experience here, can I assume that our users would have been notified that this campaign will be running at a certain time? And the reason I'm asking this is we obviously would need to anticipate traffic, right?
Red Maelstrom: So if it's 6 million, yeah, that's a good suggestion. We'd like to notify users.
The Legendary Avenger: All right, so while this is a stretch goal, I'll assume that we preemptively notified users of campaign and as such, we can expect about ten X traffic. So in this case, even it can be more. But let's say for each person that gets, maybe there are about nine more that didn't get. So we can assume maybe you have 100 million active users.
Red Maelstrom: No, we are not expecting that much. Let's assume that there are 20 million people we are expecting.
The Legendary Avenger: Okay, fair enough. All right, so I was going with a higher ratio there because I usually think of these things as lotteries, but fair enough. So 20 m active uses at campaign time, of which 6 million will get buggers. And then is there a strategy we are using? So for the bugger assignment, is it randomly picked or is there a strategy we are going to leverage in this case?
Red Maelstrom: That's a good question. Yeah, so I think randomly is a good one. But we would like to go with strategies as first come, first serve. We don't want it to strictly apply, but we want to provide a sense that it's a first come, first serve.
The Legendary Avenger: Makes sense. Okay, so in order to increase the hype here so FIFO bug assignment. All right, so this makes sense. And so I'm going to ask Lister to do a quick recap here. So I'll assume we have this app here and people have basically signed in a good number of people because we've announced this campaign or giveaway is going to be happening. So we have 20 million people active, they're waiting. And so can I assume that you already have a scalable payment system existing or is it okay, should I worry about that? Should I worry about designing the payment system or can I assume it can happen?
Red Maelstrom: No, it's already there.
The Legendary Avenger: Noted. Okay, and then in that regard, can I maybe like okay, as soon as let's say the user accepts the offer, are they expected to make the payment within ten minutes or they can accept it and then process it after? Should all processing happen within those ten minutes?
Red Maelstrom: No, it can happen after ten minutes also.
The Legendary Avenger: All right, so payment and offer processing happen after. To summarize this, our goal really is to have we have about 6 million people who accept the offer. So if it's a promo code, accepting offer processing promo code. So in this regard, the way I'm envisioning this here is we provide this offer here for users here and actually before I even jump in, is it like geographically limited offer or can I assume that this is a global thing again?
Red Maelstrom: Yeah, that's a good question. Can you just clarify how it would matter for your design?
The Legendary Avenger: So in this regard, if let's say I was processing a global system, then that becomes an issue because I probably have multiple data centers I'll need to track across multiple languages or Evities. But if it's, let's say us only, then I'm worried about let's say a set of three or four data centers. It's not too much data when you are operating at that scale. And so synchronization can happen fairly fast. And so I can make selections even on the validation structure. Like in this regard, if I have one free bugger per user, it will be easier to do if it's geographically confined. In this case, like I'm thinking of leveraging systems like Bloom filters or Cuckoo filters to allow me to verify if a user has actually what do you call it, has actually received a bugger.
Red Maelstrom: Yeah, it's a single reason.
The Legendary Avenger: So geographically confined. So for now, say us only. So this is flushing out to be pretty much Doddish. And we can assume you're working for Burger King or some other company. So this will be our functional requirements here. So this is what we expect to do. Now, for the non functional requirements, let's break this down a bit. So clearly it's an eventual consistency model and by this I mean we will assign this office and then we need to validate them. Now, let's say what particularly is actually eventually consistent here. This will be, let's say actually, let me take a step back. So what exactly is going to be eventually consistent? This will be the number of assigned, number of signed codes. We would expect that it won't really register everything within, let's say, the first minute or so. So we just need to make sure it's all processed within the ten minutes SLA. So ten means SLA. So it's fairly like instantaneous and within a fixed duration, but it's within ten minutes SLA here. But clearly availability is of high priority, so highly available. So in this case we will probably want to minimize latency in this regard because we have a bit of flex with our consistency. So I'm imagining we'll probably end up leveraging queues as such. So promos will need to be processed as fast as possible. Hence why this model here so low latency, high availability. And given that we don't want to frustrate our users, the idea here is as soon as I apply my promo code, it's accepted immediately. That way I at least get into the queue for processing and then my spot is not taken. So this to enhance user experience and so at the selection, enhance user experience and by going with an eventual consistency model, what this gives us is that flexibility to process later. So process accepted promotions later. So these two basically give us that flexibility. I think that's a fair thing, a fair set of technical selections to go with. There are the bits that I can think of, let's say in terms of cost optimization, but I'm assuming that given that this is a feature in an app, it probably won't be too expensive, really. But I'm actually going to ask a couple of questions here. For the consistency model, I know you've said 6 million buggers, but in my experience, especially working at this scale, it's never exactly 6 million. Do we have some flex with respect to how many we can assign? Like, can we go slightly over or stay slightly under? Or let's say somebody accepts.
Red Maelstrom: Slightly over is not acceptable. Slightly under is okay, all right.
The Legendary Avenger: Makes sense. So slightly under acceptable. All right. And so the reason I'm asking this is what I'll do here is I'll come up with a model here under the assumption that let's say about 80% of our users will accept the bug or so and so we'll have that reserve we will maybe assign a bit later. I'm actually curious how we can handle that. And we can discuss potential models. So for those who do not accept, what do we do? So I think if slightly under is okay, we can decide that we are going to accept to assign exactly 6 million and then if, let's say 80% accept, that's okay. But it assures us that we are not going to go over. And so in this case, sign exactly six M. And to show you why I'm thinking about this is in terms of how I'm going to track this, I'm thinking of a structure like a Bloom or a cuckoo filter. I think I mentioned this filter here, and the reason I like this is it allows us to work with that low latency, and it assures us of one thing, and this is one of my favorite features about them, is it tells us if a user is not there, like, it gives us that assurance of true negative. So if a user is not in the list, we assure that they're not there. But if they're in the list, there's a probability that it could be a false positive. But what we need to know though, is whatever user we are searching for, we need to be assured that they've not received the bugger before. And so this is how we can actually hash in all the users that we assign. And then for new users that we are trying to assign, we can use this as a validation to make sure that we are not actually adding extra users. And following this model, we can end up maintaining a global count. So a combination of Bloom filters and global count. So centralized count here, in this case, centralized global count to track signed promos or accepted promos. Let's call them accepted promos. Hoping this is making sense to you so far. I don't know if you have any suggestions or any questions so far about this.
Red Maelstrom: I think yeah, so far so good. I would like to go more in detail like how BloomFilter would be used.
The Legendary Avenger: All right, so I'll make a note of that. I'll make a note of that to make sure I deep dive on this, especially on how we're going to structure that. So I'll put that as one of the priorities here. Let me actually add this thing here. Priority. All right. So for this next section, I'm thinking we can do some quick estimates of the number of the amount of traffic we expect or such. I feel as though it's fairly straightforward. It's 6 million, maybe about 20 million requests or so at most within those ten minutes. So it's about 20 divided by ten, about 2 million, actually. 2,000,002 times ten to power six. Power six requests. What just happened there? Requests. Something happened. Okay. Can you hear me? It.
Red Maelstrom: Yeah, I can hear you.
The Legendary Avenger: All right, so this is per minute. And so if I was to basically estimate this per second, that will come back to this divided by 60, you can slow this up by 100 here. So to power four. So this is two to power what, two times ten to power four requests per second? How about that? Per second? And assuming that each request really or each promotion application is maybe a code or something, and what this code is, is maybe a short string, I'm assuming, like, we are sending them some promo code, it could be uniquely generated or something along the line. So, standard code, really? So we assume the user applies this, and then what that does is it invokes like a request to a back end, and then our back end system includes this request by the user. And then if it's successfully applied, we can process it. So for all these requests here, let's try and project up how much server capacity we should expect. So, server capacity with this spike. So in this regard, I'm assuming what you'll receive here is like a promo code. And then our system here, the request will probably contain, let's say, the user ID, some other basic information to identify the user, maybe location or even like a bugger type. There could be other meta information related to the food. Exactly. We probably don't need to delve too deep into the schema for this food specification. I'll just call that food spec here, and there'll be just some information about whatever food specification. Could you imagine that?
Red Maelstrom: All right.
The Legendary Avenger: But you can imagine this just being like a generic payload containing exactly which buggers they're interested in. But for the most part, you only need to know the user as well as the promo code that they're applying here. So, looking at this at a high level, assuming each of this is about ten bytes, it shouldn't be too big of a payload. At most, it will be, let's say, 100 bytes or so. We can even go overload and just say it's about one KB payload and assume there's a lot of meta information. So is equals to one KB payload. And this is probably overkill, but we can assume we're including some images or some other information we need to transfer. But it's a good upper estimate for what we expect. And as such, we can also expect two to power four here, times the one KB, which is one to power, what, about 1000 bytes in this regard? So it's going to be two to power five, six, seven bytes per second, which is about Kbps. It's actually not as much traffic as I anticipated here. Actually, we saw requests per second, which is Kbps. Actually forgot about that. Two to power four kbps. I just multiply by that. So it's two to power four Kbps, which is okay, fairly significant, but not too much relief to go down two to power one Mbps. Really? So it translates to about two Mbps to handle pretty much all our payload. That sounds way lower than I anticipated if I did that. Right. So let me just verify my math here. So two to power four requests per second, of which if each request is one KB, and that makes sure that you have about 20,000 Kbps, which is about 20,000, I think one MB is equals to 1000 KB. It's actually about 20 mbps. Yeah, my mistake. There two times power one. That makes sense. Yeah, that sounds about reasonable. Which means, honestly, we can probably throw a one Gbps server. We can spin up a one Gbps server fairly cheaply, and total capacity should be able to handle this campaign. So in terms of the sheer amount of memory we need, or scaled up throughput capacity we need, it's not that bad. This is actually a fairly cheap campaign. And given that it's about ten minutes, you can even spin it up for an hour preemptively and leave it running for 1 hour again afterwards to make sure we are processing everything correctly. And similarly, this is running for ten minutes if we were to scale it out in order to NQ all these requests. So it's basically this multiplied by what? Ten x 60. So it's about 1000. So it's about 21. Two, three. That's the amount of MBS. So it's about 20 GB. This is the whole queue capacity to pretty much this is pretty much the whole queue says that we need to have all requests in queues. So the whole reason I'm doing all of this is it tells us that we really don't have too much data to deal with. So we can really afford to be super accurate with this. Like, this is really easy data to distribute. In fact, I know I mentioned that Bloomfield, I can talk about how I can potentially use that for that quick latency, but at this rate, really, I'm actually not too worried about even being exactly consistent. Like, it feels like it's such a low amount of data that it should be easy to verify if a user and a promotion code have actually been successfully processed. We can even perform retries. Like, it's not as bad as anticipated it might be.
Red Maelstrom: When you say 20 GB Q capacity, it means this is for the all 20 million requests, right? That is what you exactly.
The Legendary Avenger: So this is like worst of the worst case. Like, I can literally put everybody's request in a queue and just process them in a FIFO manner, right, until I get my 6 million count. And I wanted to just have a sense of how much data, really am I operating with here in order to get a sense of how accurate can I be with users, do I even need probabilistic models? And at this rate, especially given that we only need this running for at most an hour. It's honestly super cheap. So we can afford to be super exact. We can afford to throw spark instances at this to make sure we are accurate. So really achieving high consistency in this case is fairly easy and it should be performant enough to handle everything within ten to 15 minutes. All right, so I can switch on to our APIs. I think we probably need a couple of things here. So we need an Apply promotion, apply Promo here. We can imagine this as being, let's say a V one. It's a simple Http request. We really don't need fancy protocols in this case. So imagining it being a V one or something and it's going to be called, let's say what, Promos here. We'll have a Promo ID here, promo ID and I anticipate it will be a post request. The client can be designed such that as an input field. And then in this regard we can imagine have a Promo ID. And then the payload that we take this is actually what will actually handle the payload we are discussing above there. So having the promo code the use ID. So basic generic user information here. Now the way I anticipate this happening is the person has already logged in. So given that you've made the assumption about the existing infrastructure that will do the validation of the account, we can assume that the person has already logged in and so we will be able to receive their Use ID as well as, as part of the header. We can have our Auth tokens here. So we can assume in our headers here, let's say an Auth Token and then on the back end we maybe have an Auth service that does all that validation for us. But for now I'm not too worried about that because it seems out of scope for us though this is the key bit that we worried about. But it's okay if you want me to expand on this. But I'm going to assume that we have this and maybe I keep alive to make sure that the session is alive. It could be shortly and important. There maybe 30 minutes per session or something like that with some cookie data amongst other things. But for the most part, this is pretty much the most important endpoint that I think we need here. There might be more.
Red Maelstrom: What is your assumption about how user will know about which promo code they need to use?
The Legendary Avenger: So in this regard, when we send it to, or at least when we roll out the campaign, the way I'm envisioning it here is they will receive a banner. Now that banner will be an image and maybe it contains a link or something. And as soon as they click it on the back end, because most of this will probably be served through an application. So the client itself, it will already have the Promo code encoded and it doesn't even have to be a user friendly promotion code. So to say, because they don't have to apply it manually. Maybe as soon as they click it, we automatically prefill it and then make this request. So it's like we are applying a promo. I'm imagining for it to be a good design, we make it as simple as possible for the user to click it because if you have to tell everybody, okay, copy this promotion code, put it into the input field and then submit. That will probably not be too nice of a user experience, especially given how time crunched this is, right? So the platform itself just click on this. It immediately populates everything. There's a loading screen or something as we try to get it in queued. And that means we are trying to hit this in an async manner. So I'm imagining this would actually need to be an Async endpoint. So it won't really block them necessarily, but as soon as they click on it tells them hey, your promotion code is being applied. Now, this is actually the interesting part. If we are going to make it Async. So there are two compromises you can make here. So you can make this sync or Async. I'm leaning towards Async because it gives us that flexibility to tell the user whether they were lucky or not. And so it gives us that, at least from the computational side, it gives us the freedom to be super accurate about the number of people we assign this. The downside to it though is it's not too deterministic because the user has to wait. Because if you were to go like the sync route, the user has to wait in order to see if they were successful or not, which can give them that closure, so to say. So I think I'll go with Async because I feel like that anticipation will basically buy us time for them to stay on the app as they wait. They can maybe scroll and they can see other potential offers. We can even take the time to introduce more offers at that point. We could even prioritize showing them, let's say, marketing newly marketed posts by companies that you've partnered with. So I think we can combine this campaign with other promotions that you have on the site so that in case they are not lucky, at the very least we've also prioritized showing them cheaper options and so we can even see a latent spike in other activities, not just the one that we are focused on right now. So I'm leaning this direction from a purely business point of view. So this is just me being a bit selfish. Business oriented means making it async, but to also make sure that I'm showing that I'm aware of this. If we wanted to prioritize the user experience here, then in this case if we wanted to just be nicer so we don't care about revenue spikes in this case, we only care about the user. We will probably block until promo is processed. Granted, this can actually end up backfiring because in case the user clicks on apply and you can imagine if, let's say 14 million of our users in this case, they don't get the promotion, they will be mad at us, they will probably see it as a scam. And so to a degree, even the business oriented perspective, this can actually also save this. So the perception I'm imagining as a user, I click on it even if I don't get it, it's not wasting my time as I wait. If I get it well and good, it's a pleasant surprise. And I've also scrolled some more options that I'm able to bundle with whatever order I've made. But the other approach is I get a quick answer immediately. It's fairly straightforward, but the downside is it can also anger me and it will probably anger a good chunk of our users. If they click on apply and then they don't get it. It feels like a piece which is not good. I don't know if are there any questions related to that or should I move on to any section? Is there anything more you want to see here?
Red Maelstrom: No, I think yeah, let's move to the high level design.
The Legendary Avenger: Okay. So I can go ahead and do some schema designs. They'll probably look roughly similar to this. I don't think we need to save too much data. If anything, I think the more important thing will be to break down how the queue will look like because clearly we'll need a queue. But if you're okay with that, I can start doing some drawings and then we can discuss that in light of that. Is that okay? All right, so I'll toggle back to the drawing mode here, and I'll start quickly here. Assume we have the client and that's our DoorDash app or whatever it is. And what's happening is I'll just do a quick high level here. So I'm going to assume that we have our system here and each person will be making this request to our system. So this is just that journey where they're clicking on apply promotion. And so let's assume we have our API gateway here and API gateway to given that we want to process things as fast as possible, let's think about strategies you're going to do for rate limiting purposes here because there'll probably be some people that abuse the system. So we can assume each person is able to click on the promo code because in case, let's say, it doesn't register, we don't want people retrying or trying to abuse the system. There'll be that smart person who finds a way to bypass the click itself and maybe they just capture the back end URL and decide to send manual requests. So in this case, you can rate limit per user. And I think a token bucket approach will probably work. Limit it to maybe five requests per second per person. So this should be a good strategy to help make sure that we're not having the system abused. And of course we can also load balance here. So load balancing here, simple run Robin will probably suffice, although I will imagine we probably have a limited number of DCs here. So rather than even load balancing, I'm thinking we can probably prioritize funding out the right here. So it's still load balancing really prioritize writing to the closest data center. But in this case I would actually want it replicated across multiple DCs. So let's actually just note that here load balance. Now at this scale, I think we'll probably end up having at least in the US with previous companies I've opted, I worked at Microsoft for teams and I knew we had about nine to ten data centers serving teams throughout the US. And the capacity for that was way higher than what I'm seeing here. So I would imagine we are working with at most ten data centers, if not less. So synchronizing across them should be fairly easy. But what I anticipate will happen is after this there'll probably be a fan out and this fan out will allow us to write to multiple DCs here. So let's just add it right here. So fun out. And what's happening here is this fan out operation will write, don't know why this is not scrolling. Okay. And I'll only draw one queue, but imagine this being multiple queues here. So there'll probably be multiple queues we are writing out towards. Let me just throw in our queues here. So you're going to NQ the data such that what will happen is all our data centers will basically be consuming from this. So given that it's a FIFO like fast in, fast out, we want to prioritize those that actually come out fast. So there are a couple of services that anticipate will be subscribing to this or will be draining our queue here. We only need one topic, really, at least if you were to design this in the context of Kafka. But let's start throwing in a couple of services here. So there's a couple of things. The primary draining service here. So let's break it down a bit further here. This is not drawing, so this is a drain queue. Let's break down the series of events that we anticipate to happen. So step one, we have the call service that's going to at least pull a message from the queue here. What I expect this will do is it will first, what do you call it? Pop a message. So pop service. So let's just call it the pop service. Let me not be too specific about that. Now, I don't want to deprioritize users who, let's say, take too long. Neither do I want to block the write out. I don't want this thing to be too bupped up. But what I want this to do is as soon as it pops, it needs to register this message such that one, we attempt to process it. So it needs to, one, forward it to an order processing service, at least an order and queuing service or something like that. So let's assume we have our order service for those users that are able to process immediately. So promo code is applied. Let's assume that we have our order processor here, order processing here. And what will happen is for this particular user, they're going to start processing it immediately and hopefully it's successful. And in this case, I anticipate that once my order has been successfully processed, the pop service will receive, it will actually receive like a success from the come on, this thing is killing me here. It will receive a success status from the order processing service. In which case then that's essentially a successfully applied promotion and so it can successfully forward it to a counter service here. So once it receives that, let's assume we have a counter here, SEC here. And what this will do is for each individual user, we can maintain a simple DBMS here. And since we have their use ID here, what this will do is it will do like a check this person has bought, but also at least on the central counter here, it could be a simple database or it could be a simple pardc database. And each of the data centers, it has the counter, so to say. But they're all updating this global counter value and so the numbers are not too big to a point where we need to worry about consistency and not consistency, they're not too big to a point where we need to worry about synchronizing everything. Like banks probably work with big, but.
Red Maelstrom: We have a good amount of QPS, right? So we are expecting like 20,000 requests per second.
The Legendary Avenger: Yeah. Okay. So what I'm going to do here, at least I'm going to assume as soon as you receive that successful process here, so this is the people who order. So let's actually make a note of that. So this is those who order. So those who order immediately, as soon as the pop has happened, we immediately count their orders, so to say, towards our system here. And I'm thinking of this as being a simple key value, so to say, it's like current count by DC or something. And this can be continuously incremented. And then I'm imagining what we can do is we can have a combination of a periodic aggregator. So assuming this is like DC current count and this is a statistic that's maintained, let's say per DC, and then there's an aggregator that's going to get snapshots of this number and aggregate them and validate them against another global count. So I'm thinking in order to make sure we are accurate with the number given, it's not too big, but we want to make sure that we're doing multiple aggregations here. So we could have one that's doing by DC that could be updated by the individual DC counters. In fact, I think doing this update from here is probably a bit deceiving, at least in terms of visuals. So you can think of this as one DC. But I'm imagining we probably will be maintaining a similar type of counter where as soon as our service is able to process this, it will also write it to a separate central counter. And the reason I'm doing this is so that we have two references, so to say. And this could also be, and here I'll say this is a DC specific KV and this is a global counter. So this right here could be updated periodically and what will happen is this can keep being aggregated until a bit later, but this is what we immediately update. And then I'm imagining we can also have like a recurrently running service that's getting the counts from all the DCs here. So let's call this our count aggregator. So what this will do is it will pick all these statistics, aggregate them and then validate them against what we have here. So it will just validate that the two values are actually correct. And the approach I'll take in this case is if one of them is bigger than the other, go with that one because it's better we overestimate. So pick whatever marks aggregated PCs and global counters. So at any given point we are going to pick the maximum. The reason is it allows us to do an overestimate which gives us that guarantee that we'll always either be exact or slightly lower. So this will serve like the source of truth. At each point we update the counts in the global counter to be whichever is max. But at least we have two references to help us know roughly where we are. So it won't be exact, but whatever. Is there any async issues with the updates? Failures in the middle? This will allow us to at least maintain that looseness. We're not too committed to seeing accurate numbers but we are always like overestimating the count. That way we are not going over the 6 million number. So this model here should ensure that we meet that requirement of not going over. Don't know if that makes sense to you.
Red Maelstrom: Yeah, so basically my question is how would you decide for any particular request which counter to update?
The Legendary Avenger: So I will update both. So I won't just update one of them. So here I'll do an increment by one and here I'll do the DC specific increment. So I anticipate that this one will probably take a bit longer if you are updating both.
Red Maelstrom: Right? So if you are updating both, why do you need DC specific counter anyway? Because.
The Legendary Avenger: If I'm a what?
Red Maelstrom: So what you are saying is as a part of processing the request, you are updating both counter, right? One is specific to DC and one is a global counter. Right. So if you are anyway maintaining the global counter, do you need DC specific counter?
The Legendary Avenger: Yeah. So the reason why I'm maintaining a DC specific counter is in case there are any issues. Let's say aggregating the full counter here. Let's say this takes a bit longer, especially since this will be facing rights from multiple sources. And we can make this know. We can maybe use system that's MVCC protected, like postgres or something along the lines. But the whole idea here is this will be receiving concurrent rights which can end up distorting the number. It can end up either missing some or overestimating. But the DC ones, we would expect the stream of rights to be from the same service and so we can even make it somewhat synchronous and so it might end up being a bit more accurate. But the reason I was even going with this model is the goal here is to make sure that we have like a three way confirmation. So if we have rights to multiple locations, we can always use this aggregation as one reference and use this as a second reference. And then the source of truth can.
Red Maelstrom: Be what would be your source of truth if you use both?
The Legendary Avenger: Yeah, the source of truth and truth is also somewhat subjective. The source of truth in this case, the heuristic I'm going with is whatever is bigger. So the max of the aggregate across all disease and the global counter. So whichever has a big account, that's what I'm going to register as that's what I'm going to register as my current snapshot.
Red Maelstrom: And the reason is, wouldn't there be a difference? Right? So basically, let's say that you take a snapshot at timestamp, you decided to take a snapshot at timestamp 15, right? And by the time you compare that timestamp with the global counter, it's 18. You would be comparing values from two sources at two different time. There would be a delay in collecting the snapshots and aggregating it and actually comparing with the global counter. Right?
The Legendary Avenger: Indeed. To your point. And this is why this aggregation can be periodic. So the fact that it's periodic means that we are pretty much just tracking snapshots part time. Right? And so what we are doing here is we can make it fairly quick. Like I can imagine us doing this aggregation because if we have, let's say, less than 100 diseases, it should be fairly quick to compute that. So will they be too worried about the latency? Overall, it should be fairly fast, but nonetheless, we can make this happen, let's say every two to 5 seconds, even 5 seconds I think is a reasonable number, 5 seconds aggregation. So ping every time, get a request immediately, aggregate, ping again. So we can make this almost like a very constant heartbeat. Now, the aggregation won't be perfect, but the reason why I really like using the max here is it's literally an overestimation. And so while it might not be perfect, it's biasing me towards overestimating how many people have given the number. So even if I'm not 100% accurate, at the very least, I know I'm not violating the critical condition, which is going over 6 million.
Red Maelstrom: Okay, if you are saying that this happens every five, six second, right? How your logic for stopping the request? If the count has reached would be during that if some inconsistency happened during in between, let's say at current time, it's, let's say very close to 6 million, but after five, six second, it's more than 6 million.
The Legendary Avenger: That's an excellent point. I can propose multiple strategies for this. An initial approach I can take is, rather than even targeting 6 million, how about we target maybe 5 million or 5.5 million, and then for the last, we be exact, right? So as soon as we hit 5.5, stop, then process the rest step by step, step by step. So there will be a reduced latency at the end. Makes sense.
Red Maelstrom: Okay, that's a simplified solution. I really like this.
The Legendary Avenger: All right, so you can say target 5.5. And I know I've talked about loom filters, but I've not even highlighted them, so let me just touch on that quickly, because I know I want to respect the time here, but under the assumption that each request has a unique ID here. So as soon as our drain queue receives this, there will be users who process immediately, but some of them do not. What we'll want is for these ones, it's fairly quick, these ones are the nice ones, the uses we want. But while the Bloom filter approach is probably not what I'll go with, I think it's worth mentioning because I initially discussed it. I'll just run you through why I think a Bloom filter could be useful here. So I'm thinking if we maintained a Bloom system here with, let's say, I don't know what the exact number of hashes would look like, but let's say we have about ten hashes or so. I'll tell you this, I know how it works overall. I don't know the exact math. I know that the more hashes we have, the less chances of a false positive we have. But I'll throw up sporadic number here. But the whole idea here is we maintain our array, and at our size, we can maintain a pretty sizable array while still gaining the performances that we have, or at least we want. But the whole idea is as soon as we successfully pop a request, or at least successfully process a request in this case, or drain it from the queue, we immediately hash it with all our values. And this will obviously give us our ones and zeros throughout the filter, et cetera. And so if we get new requests, we immediately validate if we've seen them before. So in addition to the rate limiting. And I'm actually thinking this could be a strategy we use at the rate limit level, because all we care about is the use ID. So you could even do it either at the pop service level or at the rate limiting level. But given that we want to make sure that there's minimal latency there, we can do this at validation because it's better that we are not delaying users processes or introducing latency at this stage and having it in our back end. It's better we take a bit longer to process than take a lot longer to actually give users feedback. So my thought here is we have a blown filter here that we are constantly maintaining on the back end. We are hashing requests that we are successfully processing. And in case there's a false positive, we don't even need to process that. At that point we immediately fail. Not false positive in case like this user has been seeing before, which is we try to hash it and they're all ones, so we know for sure, we've seen that before.
Red Maelstrom: Can there be any race condition in this? What if there are multiple requests from same user coming almost concurrently?
The Legendary Avenger: That's an excellent question. So let me think for a second there. So multiple users trying to write to the Bloom filter here, that could potentially happen. And I have a proposal that I think could also help us. So I'm thinking, given that the promo code really, or even the use IDs, I'm assuming right now they are literally UUIDs or something along the lines, so they're not like values that will be biased by an even distribution. We can have as many queues as the capacity that we want to process. The downside to that might be it can lead to issues with where we are actually trying to search for users. But there's nothing harming us from having a consistent hashing approach here, where we have a hashing system that tells us which of the Bloom filters to look at. And so this actually could help us improve performance because we can then maintain smaller queues. But this is the proposal I have here. So if we had a consistent hashing circle of sorts, of sorts here, and each user sends us our unique ID here. So imagine this being a hash service and each user has a unique hash ID. And the assumption that I'm going with here is they are uniformly distributed. So there's this problem, common problem called the Mohammed problem, where most of the users in the world are called Mohammed. And so if you have to just distribute by name, then names with Mohammed will have a bigger backlog. But if we are the ones who are generating these promo codes or even the use IDs, we can make sure that they're uniformly distributed, which gives us the flexibility of distributing them uniformly across our hashes. And what we can do here is we can have as many hash or at least as many queues as, let's say under the assumption that we need to process 100 users per second. So to say, in our case actually, what were the numbers? Let's go back to that. We saw the numbers being about yeah, it's about 20,000 requests per second, I believe.
Red Maelstrom: Okay, yeah, I think we almost hit like 45 minutes mark. So I think this is what you would be able to do in the actual interview in 45 minutes.
The Legendary Avenger: Right, makes sense.
Red Maelstrom: Okay, so yeah, let's take a pause here. Yeah, maybe in terms of feedback, I really like the way you scope down the problem. So you might be knowing that right, so in Meta there are four rubrics. One is problem scoping, another is designing the working solution, then designing for scale. And the fourth is communication in problem scoping. I think you did way too good. I think you iterated on the fact that marketing campaign is like key feature of this design. You also stated the assumption that user has one account and you are relying on the fact that one account belongs to one user and that's your way of identifying a user. So that's the key assumption you stated of I think you clarified that you should focus on tracking the order infrastructure, which was important. You also clarified that you need to run the marketing campaign if you need to keep running the marketing campaign. And you also clarified that if we need to send a notification to users about the campaign, you also clarified rather than just assuming about how the burgers would be distributed, you also clarified what would be the strategy of assigning the burger. So which is important questions, which I expect from e six. This is one of the place where I got an e six kind of a signal. Then you also clarified that if the payment system exists and you also clarified like, if the order placement can happen after ten minutes or later. Yeah, I think I really like the discussion where you mentioned that with across region there would be more challenges involved as compared to in single region. That's again a good e six kind of signal. I got like foreseeing the challenges and reducing the defining the scope. You mentioned that you would evaluate the and then you move to the non functional requirements. So you mentioned that you would require a consistent model. You also mentioned that high availability and low latency would be important for enhanced user experience, which was good points. So I really like the way you bought user experience angle during the discussion throughout the interview, I think you clarified if you can distribute slightly lower or slightly higher. This is again one of the good e six signal I got. Rather than trying to solve a complex problem, you were working around the requirements, pushing back to the stakeholders because for e six I expect they also have contribution on definition of business requirements. So this is one of that point. So you mentioned that you can use Bloom filter to make sure that user is not assigned the burger again if they have got it early. Right. I think this is something which I will refrain from doing. This is implementation detail. Right. When you were doing high level solution thing, this gives a signal that you are marrying to a known solution before exploring what would be the expected solution is. Right. So what I would expect is rather go in detail about how this problem would be, what would be the loads and then maybe if that fits my known solution, I will try to talk about it rather than sharing the known solution early. I know BloomFilter will solve this problem, but it kind of give a feeling that you know about BloomFilter and you want to use it here without thinking that's what the kind of signal it uses. So, yeah, I will try to reference from doing that. I think you mentioned that you would keep a global counter for counting the accepted promos. I think you fairly estimated the storage and the QPS for the system. I think. I really like your API design. It was abstracted. You talked about what user need to put as an input and what would be automatically fetched. So that was a good discussion. You also identified a trade off that between this API being sync and async you also identified pros and cons in terms of technical simplicity and user experience. So that was again a good e six kind of a signal. Yeah. So I think here I do feel like you could have involved more, you should have involved interviewer more because as an engineer, right, you don't want to make a decision on behalf of users or your stakeholders. You would like to propose how I want this conversation is to be is like you identify trade offs. You can identify like, okay, if we want to focus on this, I think this is a good solution. If we want to focus on this, I think that is a good solution. What do you think is important for our problem?
The Legendary Avenger: I see.
Red Maelstrom: So it would have shown more collaboration.
The Legendary Avenger: Understood.
Red Maelstrom: Indeed. You talked about the I was going.
The Legendary Avenger: To ask just to do a quick recap of that. So basically avoid going down a rabbit hole and make sure I'm bringing the interviewer in, especially on points that require more opinion. So it's not like it's full on perfect, but there's an opinion. Make sure the interview is also involved so that it's more collaborative.
Red Maelstrom: Right. So at least when you are taking a trade off on the user experience or business side of it.
The Legendary Avenger: Right.
Red Maelstrom: So basically I'm okay with if you are taking a decision which is very technical, right, that's okay if you do it on yourself. But let's say, for example, if you are taking a decision between whether asking user to wait for some time is okay or whether trying to make your system faster and giving reply as soon as possible is what we should focus on. I think these kind of things I would expect more involvement with the interviewer.
The Legendary Avenger: Understood. That certainly makes sense.
Red Maelstrom: Right. I think you talked about limiting to avoid abusing the system, which was a good point. I somehow feel like your solution for handling the counter is bit complex and there was application because at the end of a day right. So if you are taking a max of both global counter and account aggregator, why not to just use the count aggregator's data because your system is also updating global counter and individual counter which is extra work. You could have somehow simplified or deduplicated the work there. That's what I feel.
The Legendary Avenger: To your point actually yeah, I was actually thinking with that approach, instead of even going to six M, if we just went to 5.5 and used a global counter, most likely it would have just worked. We didn't even need to worry about aggregation in the first place.
Red Maelstrom: Yeah, but if you would have just a global pointer right. My question would have been like these are operations like updates is expected to be atomic. Right, okay. But you are saying we have a smaller value so we can have configuration while updating this. Okay, yeah, that could have been a good easy solution. Yeah, other simple solution could have been like let's say you fan out, right? So based on you are having a hash based hash on a user ID to identify which bloom filter to apply on. Right. So what you could have done is based on a hash of user ID, you could have distributed on which single machine that request should go and then you don't need to worry about having consistency across the machine and you just check whether this user ID has already been processed by this machine before. And also have some like let's say for example, if you are having 100 machines, so 6 million divided by 100 kind of counter based on that machine. So that would have even simplified your solution a bit more. There are cons of that because we cannot guarantee that load would fairly be distributed across this machine. So there would be few machines for which counters go zero early as compared to others and it wouldn't be that fair. But yeah, I am just giving you that thought of process could have also been there. Indeed, considering the fact that the storage requirement was just 20 GB, makes sense. Most of the things could have been handled in memory.
The Legendary Avenger: Indeed. Yes, I agree with you. I certainly see your point there and I really like this problem overall because to your point, there's always like an alternative approach and it comes with its pros and cons. I really value your approach there, because I can see how not only is it simple, but it actually comes with extra consistency. Because if we were to do the aggregation, at least the counting at hashing level, we also achieve more accurate counts because we're not assured that the right here will be 100%. Neither are we assured that the counting will be successful. And if they're handled by two separate services and not one pipeline that could potentially break, which could cause inconsistency, So to your point yeah, there are definitely pros and cons of each service. This one is probably more async which is a bit faster, minimal latency, but it also comes with a bit less reliability.
Red Maelstrom: Right, but yeah, I think again, I have seen candidates going into rabbit hole when I talk about consistency across counters and they try to come up with very complex solution. But I really like your suggestion that why not to target for 5.5. So even if we under distribute which was okay so you really focus on the requirement you gathered and you use that to simplify your solution rather than having with a complete full complex solution. So I really like that one. So that is again a signal for E six. I had fairly good signals across the interview for E six. Right. The only reason I wouldn't have suggested an E six is I think there wasn't enough time for handling the critical part of a system which was like race condition and how the things would be handled. I think you need to work a bit on your time management. I think for functional requirement itself you spend like close to nine minutes which is a bit more so my session would be like target to complete estimation by twelve to 14 minutes and keep rest of the time for API design and high level design.
The Legendary Avenger: Understood, certainly makes sense. Yeah, I definitely agree with the same thing. All right, makes sense. Yeah I don't know, is there any feedback you have about.
Red Maelstrom: Yeah, based on this interview I would have recommended like E five. Certainly I found some signals as E.
The Legendary Avenger: Six but yeah, certainly makes sense. Yeah, that's actually what I was looking for. For me at least if I'm going to target E Five, if I get an E Six performance and I make a good case for a strong E Five, because my target level is E Five, but I want to make sure I can even showcase signals for the next level just to make sure the interview actually sees my potential overall. But overall, it's good that I'm also showing those signals. And I appreciate the pointers you've given me. I think it's very clear and also very actionable.
Red Maelstrom: Awesome. Yeah glad that I could help. Just like few things I always tell to the people interviewing with Meta is that for coding interview you don't need to worry about labeling part coding interview just decide whether you pass the coding bar or not at Meta. So while providing a feedback for. Coding interview there is no suggestion for level. So basically it's just pass or fail. So don't focus a lot on providing, just focus mostly on completing two problems in 45 minutes. That should be your first goal. And once you are done with that, maybe spend some time on talking about testing and how you could have modularized the code but keep it very late to the interview. That is one thing. And you might be aware that you don't need to worry about dynamic programming in Meta because they don't ask dynamic programming in coding. Yeah, I think for behavioral Meta does have a language of impact. So people do talk about impact a lot internally. So basically, if you have some examples, do compute those examples with impact numbers. So those numbers could be at the dollar figures. Like if you save some money for your company or you increase revenue for your company, or if you reduce the latency or improve the user onboarding or whatever the numbers are, if you can talk in terms of numbers, that would be really great. So my session would be like to come up with a document where you list down three, four problems, share the context regarding that problem, your contributions regarding that problem and mainly the timeline. And lastly, what was the impact? And again, have some examples where you mentored other engineers, not in terms of just project but helping them to improve their skills. It may be in writing the documents or it may be in terms of better prioritization or whatever it is. If you have example, keep those examples ready. The last thing is I had a couple of experience where I have seen a feedback from behavioral interviewers saying that candidate hasn't worked on long running projects. I'm doubtful about if they can handle the complex long running initiatives at E five at Meta. So, basically, I have at least one of the example where you have a project which span multiple quarters. Or even if you didn't have any project, you could club multiple project and talk and use them as a milestone. Of single goal and present it in that way so that you give a sense of confidence that you have worked on projects involving multiple milestones.
The Legendary Avenger: Understood, that certainly makes sense. I do have a good set of examples. Actually, most of my projects that I've worked on so far have all been multi quota. In fact, one of them at Microsoft was multi year. So that actually could be a good thing to yeah, I'll take your suggestion and break them down so that I have exact numbers to refer to.
Red Maelstrom: Perfect. All the best for your actual interviews.
The Legendary Avenger: Absolutely. Thank you so much, this was very helpful. So thank you.
Red Maelstrom: Have a nice weekend. Bye bye.
The Legendary Avenger: Likewise. Bye.