Interview Transcript
Supersonic Seahorse: Great to have you here. So first, I will briefly introduce this myself. And I also want to just know more about your background, your expectations. After that we can jump into the practice.
Occam's Chameleon: Sounds good?
Supersonic Seahorse: Sweet. So yeah, my name is Max. I'm a senior engineer at Meta. And after joining Meta, I've conducted more than 500 interviews ranging from coding behavior all the way to the manager interviews. And with that, I'll hand over to you. Would you please briefly introduce yourself?
Occam's Chameleon: Yeah, sure. My name is Brian. I'm a software engineer with 13 years of experience. I'm currently interviewing with Meta for the E5 loop that's going to be happening this upcoming week. Upcoming week. Yes. At this point, I'm pretty deep in my prep, and I'm mostly just looking at this point to get some feedback about some gap areas where it might make sense for me to do some high-value focused studying over the next day or two, and just working also on how I actually present my ideas and making sure that I have good pacing and things like that.
Supersonic Seahorse: Gotcha. So Brian, is this your first time to interview with the Metra?
Occam's Chameleon: No, it is not.
Supersonic Seahorse: It is not, right? Okay, so when would be the last time?
Occam's Chameleon: The last time was a year ago.
Supersonic Seahorse: One year ago. Still the same level or the senior.
Occam's Chameleon: Or the... Last year's was actually at an E6 level and I did two systems design interviews instead of one in that loop. This time around, the recruiter recommendation after the phone screen was to go for E5. So I thought I would try it at E5 and see if I have more success. Sweet.
Supersonic Seahorse: I wish all the best. And okay, so that being said, since you already familiar with the whole set up and maybe let's try to jump into the whiteboard.
Occam's Chameleon: Yeah, sounds good.
Supersonic Seahorse: Yep.
Occam's Chameleon: Let's do that.
Supersonic Seahorse: Okay. So here's the thing, since you know everything, but I also want to just remind you a little bit, no matter that the E5 or E6 row, basically I think we have 45 minutes for each round. And specifically, after interviewer, we try to finish everything. We try to collect the signals from you within maybe the 40 minutes. Okay, so that being said, I think right now we will start from around 05. and we will end around 45. Because I want to just reiterate, the time management will be always the key for the senior or even the staff role. Okay?
Occam's Chameleon: Sounds good.
Supersonic Seahorse: Great, Brian, with that, let's try this question. Hey, brother, if you were the domain developer, would you be trying to decide the centralized machine learning management platform? With that, I'll hand it over to you.
Occam's Chameleon: Okay, cool. So I think I'm going to need to ask some clarifying questions to make sure I understand. So I have a couple things that come to mind immediately as far as what the core operations are here, but maybe I could kind of clarify them with you. So when we say management here, is this including serving for the purposes of actually the operations of those, or is it more for like model iterations, training, validation, and the production of the models or is it kind of an end-to-end platform?
Supersonic Seahorse: Great question. That should be the end-to-end platform. But more than that, let me just give you some context here. So we have so many machine learning engineers just at the matter. And to develop the machine learning efforts, there are typically three stages. Okay. The first is data collection, the second is model training, and the third is model serving. So basically this platform, we need to cover the three stages.
Occam's Chameleon: Yep, understood. Cool. So let's just go ahead and write those down. So we have data collection, model training, and model serving. Cool. So yeah, I think we can probably talk about each of these different things kind of as a discrete phase and they'll sort of lead into each other one into the next. So let's maybe start with data collection. Well, actually, you know, I would like to take a pause for a second here and just make sure that I understand the constraints. So are there any constraints as far as like the number of models, the size of the data that we intend to be using, maybe the sort of the request rate or transactions per second that we can expect on the inferences during the actual model serving. Is there anything that we can kind of dial in to make sure that we understand where the key scaling points are?
Supersonic Seahorse: Great. So Brad, this is a great question, but I think maybe you can just do some ballpark estimations down the line. But at this point, I can call you, I can share with you.
Occam's Chameleon: Think of-- yeah, I'm happy to do some ballpark estimates as well if it'll move us faster here. So, I mean, on the data collection side, most of these models I imagine will be training over potentially years of historical data. relatively high cardinality. So we're talking terabytes at least, I think, maybe even petabytes in certain circumstances. So our data collection and sort of data management systems are definitely going to need to manage big data. Yes.
Supersonic Seahorse: Sounds good.
Occam's Chameleon: So like hundreds, I'll call it hundreds of terabytes at least there. Model training, again, you know, the actual training process is going to require, you know, relatively horizontally scalable systems that can actually accommodate training using that large amount of data. So I'll just put same. And then on the serving side of things, I'm just going to make some ballpark numbers here. I imagine that there's at least thousands to tens of thousands of individual models in use. and the actual request weight for those models can probably vary a lot between them. And I think that's something that will be important when we talk about model serving is the specific types of use cases that we run into here. But I think, you know, on the sharp end of model serving, we should be prepared to be executing these models, you know, many times a minute, you know, for user-facing workloads. So let's just call it, I don't know, is like a thousand TPS completely off base to you for like a sort of peak load for one of the, you know, higher scale models or am I off by a lot from what you're looking for or is that reasonable for a jump in scale?
Supersonic Seahorse: This is reasonable.
Occam's Chameleon: Yes. Okay, cool. All right, I think we can go ahead and get started with that then. So data collection, maybe let's kind of ask a few more clarifying questions there when it comes to data collection. so the when I hear collection here, I think one of two things, you know, there's, you know, computation of aggregate data and sort of collection in forms that are readily encodable as features, but data that is, you know, basically already being written into databases. And then I also think data collection in the form of like online telemetry and things like that about user behavior, engagement numbers and things like that. So let's maybe just start there. I think for the purposes of this, just in time, I'm going to make some assumptions about the existence of the data collection code in our clients of our systems and just assume that they're able to make requests. Maybe I can just write a brief API for just like I'm just going to completely simplify it for now and just call it metrics. And just for assuming that this exists in the actual clients for actually publishing telemetry about user behavior. Okay, so we're going to, for starters, we're going to need to receive and actually catalog all of this information. And that is going to be relatively high cardinality and relatively high frequency. So I just want to kind of come up here because I'm realizing that I could spend a lot of time here. That could be a whole system design interview in and of itself. Is it fair for me to start with the data existing in the database or would you like me to include sort of upstream of the actual writing of telemetry data?
Supersonic Seahorse: I see. Hey, Brian, I think just for this, just for the data collection part, right? I would expect you to just write some data or maybe read some data just from the database.
Occam's Chameleon: Yeah, okay. So let's go ahead and assume that that data actually lands in some databases. And so we're probably going to have a bunch of different databases here.
Supersonic Seahorse: So.
Occam's Chameleon: I'll just give them a couple minutes.
Supersonic Seahorse: Brian, I have a question here. So are you trying to just start from the high level design? I want to understand what would be your presentation flow here?
Occam's Chameleon: Yeah, absolutely. That's a great question. So I think I'd like to just sort of identify key components at a high level and then we can sort of dive deeply into the actual underlying architecture in cases.
Supersonic Seahorse: I see. But before that, Brian, I do have a very quick suggestion here because I think I would like to start from the future product functional requirements. So basically, I think right now we have three stages that are pretty high level. Let's try to break down a little bit for you.
Occam's Chameleon: Yeah, sure. That absolutely, that sounds good. So let's break that down a little further. So I'll just go ahead and move this over here as an assumption. So breaking that down. So on the model training side of things, I'll just start with the models themselves. So what do we actually need to do for the model training? So there's going to need to be a data access layer there and We're going to need a mechanism for the execution. So there's the actual training execution in terms of the compute that we'll run this on. We're going to need model publishing. So a new version of the methodology or some tweaks to the features that we use that needs to be publishable somewhere. And then we're going to need to actually execute training on those. Then we're probably going to need model validation and guardrails after the actual training effort to make sure that we have a high quality bar for the actual predictive power of the models that we produce. And so, and then once we've actually hit that quality bar, we want to actually move ahead to model serving. So I'll go with model deployments. as kind of like the core areas that I think about when I think about an ML management for the sort of the, well, I guess I'm sort of beginning to blend model training and serving. So it's not necessarily that this is all part of the training step, but rather that we're just sort of breaking it down a little further. So the, I think for the most part, and then there's going to be, let me collect my thoughts for one second.
Supersonic Seahorse: Cool.
Occam's Chameleon: Yeah, okay. So the mechanisms here, I think, what are the sort of data storages that we're going to need? One thing that comes to mind for me is, what does it mean for the model actually to be deployed? We're going to need and the data that's getting used for training execution and also validation, presumably some split, like a training test split of that data. We're going to need a feature data store of the actual features themselves. So there's probably also going to be a step at the beginning ahead of all of this in the data collection side of things that is just sort of a registry of the, creating a registry of potentially heterogeneous data tables, so like a data registry. And then we might also have a feature encoding step. There might be, for example, feature sets that are reused by multiple different types of models. And so being able to share those things efficiently to prevent retraining or to prevent, you know, large-scale processing over and over again would be really useful. And then when I say model publishing here, publishing is maybe a misnomer. What I'm really talking about is iterations of models by actual ML engineers and data scientists. So there's going to need to be an operation for them to actually upload the latest code, or maybe it would be triggered off of a merge to a code pipeline. And then we would sort of automatically flow into the remaining steps downstream of that. So I'll come up for air there. I think that I have maybe enough to go off of here where we can start putting pieces on the board, so to speak, to actually identify the relationships between the components and make sure. So one more thing before I go ahead and start doing that, let's just kind of write down what are the core operations that We need to support. So we need to support registering new data. For example, if there's a new data set that continues or as we start collecting new types of data from our users, new pages, things like that, we are going to want to publish new model iterations. We are going. And then the process for the way that I'm imagining how this would work at a high level is that in the general case, the training and the validation of that training is stuff that we would definitely want to automate downstream of the actual publishing of that model iteration. But I can imagine that there's also going to want to be automated retraining. drift detection to trigger retraining and things like that. So we're probably going to want an operation for actually just running training. And that will probably take a specific model. And then we'll also want to deploy the ability to redeploy specific models. That is something that we need in cases where we might need to do a rollback to reduce the blast radius, the failures, or things like that.
Supersonic Seahorse: And.
Occam's Chameleon: Then finally, the last step for the actual serving side of things, I think we're going to need, you know, operations for actually executing the inferences, which may vary. The nature of the inference may vary on the model, but the actual, so I'll just say under serving. Let's call it inference for now. Cool. All right, so now we have some operations that we can kind of run through as we build this to make sure that we sort of have covered all of the core use cases for how to actually make the platform usable. And then I think we can go ahead and get started. So let's see. So I think How I would choose to build this is that in the general case, I would like to not think too hard about something like the data register. The idea there is that basically we're going to want to make data discoverable across the organization. Probably there will be a couple different ways that it might be stored based on the type of data. And so we'd like there to be some kind of service that's when uploading a new data set, when creating a new data set, or when maybe calling a partition finished of data that's sort of streaming and happening in real life and registering that in a way that it is accessible to training mechanisms. We're going to want some kinds of-- so I'm going to assume that we have potentially a bunch of different kinds of types of data here. So I'm just going to call this-- and there's going to be some layer on top of that that is just a kind of a registry. And it will be the responsibility of this registry to sort of act as you would expect almost on a search engine for all of the data that you have where the process of actually ETL processes that load data will maybe also make an API call to the registry. So I'll just assume that we have, you know, it's maybe a little out of scope for our purposes at this exact moment, but we can go back and talk in more detail about that. So I'll just assume that there's some ETL that will maybe write this data into, you know, blob storage or into, sorry, how do I, oh five is the line. I'm still getting used to this.
Supersonic Seahorse: No, no words.
Occam's Chameleon: Gotta get used to the hotkeys here. So some ETL processes will publish data and it will make that data available to all of our downstreams by registering a new partition or indicating that some data set is now completed. And the actual ML platform will sort of be built on these data building blocks. So when it comes to the actual Feature encodings, I can think off the top of my head of two ways that would happen. Some of them might be very model specific, in which case that could almost be included in the model code itself or how we might want to do that. There are probably plenty of well-known encodings that are reusable, and there actually may be, we may want to run a second layer of ETL in effect that is responsible for reading these sort of raw tables, staging them, potentially doing some transformations on that data, and otherwise generating partial feature encodings of the actual, beginning to look like the data that will actually be in use in the training itself. So I'm just going to, for now, I'll just call that feature I'll just call this generic. And in fact, I think I would like this to sort of feedback directly into the data registry itself such that there are sort of, you know, feature sets that are registered with the data registry. We can use those as if they were the, you know, the raw data and just sort of make them available to you. string processing in the same way. And then when it comes to the actual model training, okay, so this is where we kind of get into the meat of the system a little bit more. And so let's assume for now that model training is going to be a small platform of its own. So I imagine that, for example, they're probably going to be, I'll just write training over it for now, kind of organize my own thoughts here. And maybe start with the actual, rather than the edge of the system, let's start with the building blocks, right? There's going to need to be some compute instances that are ideally very horizontally scalable for, you know, I work with AWS a lot, so something like SageMaker compute or the underlying EC2 instances, for example, So we're going to need the actual compute instances and these are going to be what gets allocated for the purposes of running training. There is a new training request coming in for a specific system. Then we need a layer on top of this supervisory layer that would be responsible for allocating specific partitions from the data registry to be used in training for kind of sharding out the individual training to individual instances. So I think we can go ahead and assume that there will be sort of like a training management layer. So I'm just going to call this.
Supersonic Seahorse: The.
Occam's Chameleon: Like the training organizer. Sorry, let me grow these a little bit. Okay, so as written right now, this is kind of like a single instance. I think that itself would probably be something that we could load balance across a couple different instances. So maybe I'll just... and then ultimately the actual sort of edge of the system would be some kind of API gateway that is responsible for receiving requests to train a model and then farming those out. We would probably want to separate concerns from the actual API gateway that might do authentication or make the data access Middleware permissions requests, things like that to decouple that from the actual responsibility of, okay, we have a specific model version and we know what's and so the training organizer is probably going to be communicating with the data registry. And that would sort of be bidirectional in order to sort of maybe get a manifest of the number of different partitions within a specific set of date ranges or something like that. and then the individual sort of allocation of those to compute units for the purposes of doing partitions in the training, I think, is something that would be happening sort of directly between the compute would sort of request for directly for data registry. And so the-- I spoke very briefly, I kind of glossed over model version as part of the API Gateway here. So yeah, so there's going to need some repository of the actual, I'm realizing that I did not really kind of set up my space effectively here. I'll move some stuff around if that's okay. Okay, I can see that. So I'm going to assume that we have some store of actual models themselves.
Supersonic Seahorse: Okay.
Occam's Chameleon: And there will be some requests that are triggered to the API Gateway. Oh, that's the wrong thing. That will use models. And so I identified a couple of these off the top of my head. So we might want to do ad hoc training, which is something where like an actual user would trigger And then we could also have some kind of like, so an ad hoc request might say, you know, train this model, it may be in development purposes. And so I'll just kind of write, And so the, if we assume that there's kind of like a submit training and that might take some configuration about how, you know, hyper parameters or the, you know, which model to use, how far back to go for specific date ranges on the data, things like that can be included there. But ultimately, the actual events, there could be a couple different types of events that do this. There might be one that's sort of triggered by continuous integration or continuous deployment as well, whenever a new version of the model actually happens. A new version of model code, I should say, is published. That could be something that actually happens here where maybe there's actually an event I feel like I've seen a way for people to do kind of curved lines that I don't actually know how to do. But the idea being that there could be some stream of events from the publishing of new models here that itself, can I-- oh, is this what I want? Perfect. So, maybe, There's ad hoc, maybe when stuff is unpublished or still in development, and then sort of as a new model is finalized after it gets reviewed and ends up getting published to the sort of repository of model methodologies, maybe a stream of the publishing of those events will go to some CIC process, which will then trigger the training and all the downstream that happens after that. So, okay, just kind of checking in with myself here. So we have covered the idea of Data registry, so I'm going to just call this. Assuming that there's some mechanism baked into our ECL processing platform or pipeline such that as we actually generate new data partitions, we register those with the data registry. Okay, so we have that here. The actual publishing of the model iterations itself, I'm not going to go too far into that. Code review process probably going to be in place and things like that. But so in terms of actually publishing a new version of a model that I want to get trained, it's something that would just sort of be upstream of this model's DB itself. So I'll just have a line into that. There's some operation for publish happening. Okay, let's see. So I feel like we've gone through most of the pretty high level building blocks here for actually getting to a place of the actual training happening. One thing that we haven't done is where's the output of that training actually going? Where is sort of the models themselves landing? We're going to need some kind of... oh, that's not what I wanted, sorry. We're going to need some kind of finished model registry and the idea being here that sort of the results, the training would be complete, the actual data that is generated in the form of those models might need to be written to sort of model storage itself. So, I hope that's relatively clear the different shape encodings that I'm using in circles for data storage. I'm just going to, instead of making this bigger, I'll just right next to them. And so the idea being that basically the actual computes will and informing the training organizer of, oh, what's, okay, how do I stop this? Oh boy. Is there a cancel button? Okay, there we go. Sorry. So when the training is actually completed, there's going to be some training output that constitutes the actual model, the artifacts, of the sort of data that's been computed over in the weights of the model itself. And so those weights are going to need to get published, maybe just to disambiguate that, we'll call this weight storage here. And so the I would say that rather than kind of single threading all of that through the individual organizer, I think the individual training as part of the organization we can maybe establish some or some prefix, maybe a model revision number for that training in particular, and the individual training compute units can be responsible for writing their partitions of the weights into the weight storage system, and then at the actual completion of that, the training will actually register the model to indicate, hey, this model is the trained model specifically. And so here, let's say, okay. So now I do feel like we have been relatively end-to-end on what we need to do for the actual training itself. Importantly, we still have a few key elements that are missing here on the actual validation of these, whether or not this is itself going to be usable or if we need to go back and do another iteration or something like that. And so how is that actually going to happen? So we're going to need to add some kind of procedural validation that's going to essentially be batch inference over some subset of the training data that was reserved back for the purposes of doing validation. And in order to actually do that, we're going to need to trigger that sort of after the training itself has actually completed. And so I need to think a little carefully about this because one thing that's coming to mind for me is like, does it make sense to always be registering the trained model and publishing the weights if it turns out that the model itself is not remotely predictive? And I think the answer is yes, because we're probably going to want to introspect the nature of that lack of prediction and look at the data, look at the weights that were actually produced, and maybe use that to guide some intuition about how we might evolve the model forward. So I was thinking for a minute, does it make sense to maybe do the validation upstream of publishing to the model registry? And I think what I'd rather do instead is evolve the idea of the model registry that there's maybe different states the models can be in. and we can sort of establish guardrails. So when it comes to serving and publishing the models from the registry into the actual edge layer of the system that would be customer facing, it would only be doing that over models that are either under active experimentation or are known working and have satisfactory criteria across some kind of model evaluation criteria. So I think at this point we need to kind of make sure we get through everything. So let's go ahead and try to speed back up. So model validation is, as we said, it's going to be functionally similar to the actual training compute. So I think the underlying compute instances in Compute Engine here could be used for both training and validation. Because fundamentally, I think the workloads there are relatively similar. And so the training organizer itself could be something as a layer that is responsible for a couple different types of operations. So there could be a train operation and there could be a validate. operation and depending on which of those.
Supersonic Seahorse: Things.
Occam's Chameleon: That is sorry, which operation that we're doing will either be reading from the raw data registry of feature encodings and of okay, so actually, so one thing that I missed here, sorry, coming up for air, I know I'm talking in a couple different directions at the same time. So some of this feature encoding ETL, like we said, is going to be generic and we want to but the encodings that we generate as part of the training compute here, we may optionally be interested in publishing those back to the data registry and like a model specific namespace or otherwise to kind of introduce some idea of checkpointing so we can sort of speed up and make the training itself a little more efficient in the future. So I'm just going to make this bidirectional for now just because There, I think I can conceive of circumstances where we would want encodings that are sort of defined in the model code itself to still be publishable into the data registry in some way, in which case we would need to potentially write from the training computes into the feature stores as well as the actual weight storage. Okay, so coming back to my original train of thought here. So the training organizer could be responsible for both issuing out the training requests and also issuing out validation. Validation is usually a smaller subset of the data than the training, but it could still be potentially very large if the overall data is really, really big and we reserve back like 10% of it in order to do a relatively exhaustive validation over it. So in that case, I think we would still want to shard it over multiple different horizontally scalable compute instances. And in that circumstance, where would those outputs actually go? I think we would need some sort of reporting layer. So I'm going to, oops, sorry, wrong thing here. So some layer that's responsible for doing validation reporting that will basically take the results from actually running the validation operations and actually publish that somewhere that is, these validation results are things that maybe depending on the stage of maturity of the model could be manually reviewed if we're still in an experimentation phase. this is something that could be also happening as part of a CICD pipeline for the purposes of establishing, you know, if we have automated guardrails in place already, then I think we would need to sorry, not need, but we would be able to sort of move ahead with the actual publishing into or sort of setting the status So I'm going to say that there's maybe a couple different operations on the model registry here or sort of. So we might want to register that a model is trained and then we might also want to publish validated. And publish itself might maybe just be setting some state in the model registry on the model itself that indicates.
Supersonic Seahorse: You.
Occam's Chameleon: Know, maybe if we have, you know, in a number of different states that it could be in, like maybe we ran the validation criteria and then we had a miss or something, and so we might want to preserve those artifacts. in the registry and the weight storage. Okay, so I feel like I've gone into sufficient detail to cover the basics of the actual training of the model, different versioning of them, and the execution of the training, the validation of that. Let's talk deployments. So skipping over for a minute the idea of how we decide when we're going to do the deployment, which I think is important and we can discuss that, but let's just assume for a minute that we don't have to think about that. What does deployment actually entail for us here? There's probably a couple different types of model use cases that I can think of. Some of them are going to be online models that are actually running in the client or in response to client behavior at the edge of our system. And some of them are things that are maybe more expensive, maybe the types of things that we might want to do offline batch generation of inferences and then loading those batch inferences into some kind of cache or something like that. So I can think of, I think for the purposes of talking about model deployments, I'm going to only focus on the former of those two things. The latter is really not all that different conceptually than the ETL that we're doing to actually produce data of facts versus inferences. So we might want to extend our data registry a little bit in some way to make sure that we can differentiate between numbers that we know that are stemming from facts and numbers that are, you know, or not necessarily numbers, but data points and tables that are inferences. But batch inference, I think, is really fundamentally pretty similar to ETL. So I'm going to gloss over that for the moment. When it comes to the former of these two options, which is online models, we're going to need those models to be publishable to some sort of edge servers where the model can actually live. And so what I mean by that is I think the actual weights themselves for most of these were probably going to need to be in memory in order to ensure a performant experience for customers. So when the actual publishing operation, I think, is going to be some mechanism by which we actually get weights in caches. And that is something that basically from the actual weights data storage we might publish. And so I'm writing this out as one cache. You can consider this as a distributed cache where there's potentially a bunch of.
Supersonic Seahorse: Just.
Occam's Chameleon: A bunch of different types of models, maybe one per model or depending on the actual, you know, server farm infrastructure, it might make sense to have the one logical cache here that has a bunch of different entities in it with and the keys would indicate which particular type of model we're using. And then the importantly for whatever application is actually routing user requests to use these weights and to use, there's going to need to be some kind of what I would call an analog to service discovery of informing the application servers. So let's go ahead and like I said, I can't really scroll here easily. I'm going to zoom out.
Supersonic Seahorse: Okay.
Occam's Chameleon: So I'm going to assume that there are application servers that are responsible for, you know, there's probably an API gateway and a whole architecture upstream of this, but fundamentally there are service workers or some compute layer that's going to be responsible for routing requests for customers. And the actual requests themselves might interact with the weights cache to fetch some specific weights and then actually do an inference. and so now we have kind of like a pseudo end to end, but with a relatively glossed over middle here of what does the actual mechanism look like for how we actually do that deployment. So I think probably something relatively similar to what we talked about for some of these other aspects where ideally I'd like this to be a continuously deploying system. so that we can rapidly evolve our system and make sure that we can sort of seamlessly deploy new weights. So I would imagine that sort of after the validation results would sort of directly inform in the event of sort of passing. So let's see, like, it's kind of two outcomes here, right? There's pass and fail, more or less. And so, In the event of a pass, I feel like we would sort of plug back into our continuous integration. Now that there is, you know, approvals have been satisfied and we can proceed to the next step that would be responsible for actually interacting with the model registry to find where the weights are for a specific model. and ultimately those weights would need to be published into the weights cache, and so some system would need to be responsible for writing to the cache the results from the model registry. Okay, so let's see. Let's just kind of go back through my checklist over here to the left. So we've got our data registry, we've got this idea of feature encodings writing back into the data registry, I don't think I actually published a lot in the design here about model iterations and versioning of the models themselves. That's something that I think I would probably put on the data model for how we actually manage the versions of the model in the model registry and the training organizer would be aware of that. So we can maybe talk a little bit about that sort of at the end, but I do feel like we have at least verbally discussed the ways that we could care about model iteration and model versioning. Okay, so one thing I actually mentioned up at the beginning that I haven't included here is online maintenance and monitoring of these models and specifically drift detection. So, you know, as the models, after the models are deployed, they, you know, get stale over time and user behavior, you know, is evolutionary with respect to these. So we might want to be observing the actual inferences and sort of the behavior of the inferences and also user behavior. as the model itself is live. So I think in addition to actually just getting the weights and doing the inferences, we're probably going to want some sort of reporting system. So live model reporting, I'll call it for now. And basically the application servers that are sort of fetching weights doing these inferences, would be recording those inferences. Maybe they would sort of record some information about the user or so I'm just going to put calls context actually because it's not necessarily the user but we might want to, if for example we're seeing that there's a massive spike in positive rate coming out of the inferences for one specific browser or something that might be something that'd be really interesting for us so we want to know as much as we can about the context for both the user and the environment when we're actually looking at these kind of inverses. We can kind of cross correlate that with some of the telemetry data that actually fed into the data registry itself. And ultimately, we would want this live model reporting to potentially, so I'll just call this model telemetry. Telemetry is maybe a little bit of an abused concept here, but it's not too dissimilar to CloudWatch metrics or something like that publishing for the behavior of an application. The behavior of the model is just a different type of application. And to close the loop, what I would like to have happen here is that that model telemetry would be informing some process that is maybe either doing stream processing or a nightly batch that is producing some graphs or something like that. updating dashboards that our operators and on calls can actually look at and potentially triggering alarms if manual intervention is required or even just seamlessly automatically re-triggering a training if we fall below some certain thresholds or guardrails. So I don't really want to draw a line through the entire thing here, but I'm just going to draw a line off into space and just said, Triggering and retraining if performance drifts. I mean, again, I think there are probably certain types of models that are lent well more to that retraining. There are some other models and inference types that may require manual intervention every time or we may always want to get a human in the loop on that. In addition to that, it might also, you know.
Supersonic Seahorse: Hey, Brian, maybe after this part, pause a little bit because we are at time, but I do have a couple of the questions on my list I want to ask you. Okay?
Occam's Chameleon: Yeah, totally.
Supersonic Seahorse: Great. Thank you.
Occam's Chameleon: So yeah, I feel like we can go ahead and put it in in it for now and kind of talk about Any questions or considerations yet?
Supersonic Seahorse: Great. So, actually, let me start from the weight storage part. I'm clicking on the three circles here. So, basically, I think this is very important components where we store the model weights. My question comes to, are you using the database or are you using the blob storage and how should we just store the weights here?
Occam's Chameleon: Yeah, so that's a really great question. So I think what I'm thinking about when I think about that is just what is the actual access pattern that the weights are going to use and how are we going to be fetching them?
Supersonic Seahorse: So.
Occam's Chameleon: I'm a little out of my depth, to be honest, about the usual size of the amount of weights. I know at a high level that there are some types of models that are very large. For example, large language models is in the name. huge amounts of weights. Those are really big. I think storing all of those weights for something of that size in sort of like a relational database, for example, is I think exceedingly unlikely to be performance or scalable at all. And so I think fundamentally we're going to have to use some kind of big data appropriate storage for models past a certain size. And what I'm imagining, I think, is that the relative cardinality of the weights is something that would be pretty heterogeneous across the different types of models that we do. And in many cases, the amount of weights could be thousands, tens of thousands or something that is actually a lot more tractable and could be much more performantly accessed. in a relational database. So I think the way that I would actually approach this would be, you know, the first thing I would do would be to kind of survey our requirements a little bit about what types of models we're intending to deploy with this system and whether or not we need a one size fits all actual backing data store for the types of weights. I think I can conceive of a world where part of the relationship between the training organizer and the model registry is some accounting of the amount of weights that were generated, and that would sort of inform the decision about where and how we're actually publishing that data into storage. If you made me actually pick one specific, you know, one size fits all solution, I would say probably use something like blob storage and actually just writing these as kind of files into S3 or something like that. The downside there is that I think your processing load of actually getting those into a cache and finding specific models, versions of specific weights, if you're not careful, I think that could potentially increase the workload on your system there versus very specifically targeting querying for a very specific ID. Now you can do that with a Blob Storage too if you were to partition on the model identifier in some way. But there are some scaling limitations to the number of small files that you can produce. If a lot of your models are small and you're publishing a lot of versions and iterations of them, you might end up in a circumstance when you have many thousands of really small weights files, which is maybe a less effective fit for how we would want to store those.
Supersonic Seahorse: Okay, I think that sounds good. Maybe let's switch gears. I want to ask you a little bit about scalability. I think right now this is very comprehensive design here. We cover everything, most of everything here. So what will be the bottleneck in terms of scalability and how would you want to address the bottleneck?
Occam's Chameleon: Yeah, so let's think. So I can see a few areas right now that are potential bottlenecks. So let's start with the very superficial ones. I think there are a couple areas here that I just kind of threw one instance in the way, for example, the API gateway or the model registry or the data registry right now in the design has literally written are kind of monolithic instances. I think for the most part, the load on the system for the registering of data partitions and the registering of models and the publishing of validation status for models is going to be several orders of magnitude lower than the actual cardinality on that data themselves. And so I do think that the API gateway for the training organization or, you know, I didn't go into it too deeply, but there might be an API gateway for the actual data registry itself and stuff like that. I don't think those are actual bottlenecks for the system. I think the main bottlenecks for the system are going to be things like the weights cache and actually loading generated weights into a form that is actually deployable, really, you know, how can we make that deployment happen really fast? how can we make sure that it happens globally to make sure that we're trying to keep downtime, or sorry, not downtime, but latency low for the inferences. Even if the model itself is really, really fast, if we have to go halfway across the world to do it, the speed of light is going to be a factor there. So I think if I were to go back and try to introduce some additional elements here, I would assume that this distributed cache is something that we could sort of like regionalize, and we actually may have different trained versions of the model for different regions because that different regions have different user behavior or things like that. So you can imagine kind of a regionalized point of presence based architecture here where a lot of this actually, maybe the generation of the model codes or the data registries themselves for the offline data processing might not necessarily have to be globally distributed. But everything from the actual post-training phase through So I'm trying to just select the whole right half of the screen here up to basically the model registry. I could imagine us maybe wanting that to be kind of like a point of presence based architecture where this is a cell of the ML platform and we might want to deploy that in a number of different regions in order to basically get as close to the edge as possible to our users with the actual caches of the weights. And the actual weights cache itself, yeah. So I kind of glossed over This a fair bit talking about it being distributed and just basically not having a limit on the size of the cache. But the number of different models that we have and the size of the weights there, I think are definitely going to be major design factors for the actual low level design that we go through. And in particular for something like large language models, like I said, just the first example that comes to mind with billions of weights. that even just actually loading all of those into memory is something that would be a relatively latent process. So we need to think very carefully about what our deployment mechanism there is because it might be slow to revert. So we might want to do some kind of blue-green deployment, for example, to make sure that we always have a fast rollback to a previous version of the weights in cases where we have immediate alarming about negative performance impact. So that's something that I would think about in a lower level design for how we could efficiently actually get the weights from storage into cache. That seems like a relatively important bottleneck for us to resolve. Other than that, I think, I mean, the training computes and the validation computes, I think, should be pretty trivially horizontally scalable. It's kind of an embarrassingly parallel problem in the general sense. So I think horizontally scalable compute there should resolve that being a bottleneck. We would want to right size our allocation there to make sure that we're on a reasonable budget. So we would probably establish some kind of scaling factor based on memory usage and compute usage on the sort of fleet of compute instances to indicate whether or not we need to spin up new ones elastically or actually spin them down in times where there's less training happening. But again, I don't really conceive of that as something that would be like a major bottleneck because it's already horizontally scaled in the design.
Supersonic Seahorse: Okay, yeah, let's pause a little bit here because I think we almost here. I just make all the notes on my side. Let's pause a little bit. I want to make sure we have some preform discussions at the end. Okay?
Occam's Chameleon: Yeah, sure, absolutely.
Supersonic Seahorse: Great. Brian, I always ask the candidate on this platform just for the past maybe 55 minutes, you give me a really good presentation. But from your perspective, what is the highlight, what is the lowlight in your presentation today?
Occam's Chameleon: Let's see. I think highlights, I think the architecture that I put together here is pretty comprehensive of the end lifecycle of many different aspects of machine learning model management. I think it was so comprehensive that maybe a low light for me is I didn't really get to actually write a lot in the design that indicates my depth of expertise on some of the actual scaling issues. bottlenecks here. So I sort of feel like from a high level, what signal do I feel I came across in this interview? I feel like it came across that I'm able to think broadly about the problem and keep a large number of different facets of the problem in my head and kind of think reason about complicated problems piece by piece. But I don't think I necessarily demonstrated a huge amount of deep technical expertise on the scaling of the system because I spent a lot of time on the actual comprehensive making sure that every aspect of this was kind of recorded here. And I guess it depends a little on what the interviewer is looking for, but I can imagine a world where the interviewer might come away from this thinking this guy knows a lot about design documents, don't know if he knows a lot about actually building scalable systems. I did my best to try to cover bases there verbally. Some of this is, I think, maybe unfamiliarity with the actual platform, the Scaladraw situation thing, and I need to practice that a little bit. because I think that was my speed limit a few times. I'll come up for air, that's sort of my initial assessment of myself.
Supersonic Seahorse: Sounds great. Second most of the seminar you have come up with, I think this is a very good presentation. I'm so impressed by especially the right side. This is super, super comprehensive because actually I think you just offer way more than I expect from you. just for the scenario because this is the end-to-end machine learning platform and I think you covered the collections training and serving with tons of the great details. You also try to connect the dust, all the dust together, which I think that is terrific. So the TLDR, if I were to interview just for the E5, just for the E5 system design interview, I would give this candidate the higher decision. But actually at this point, I keep thinking when you just show your thinking process, when you just call out your solution, I keep thinking about, okay, you have tons of the great materials just on your side and you can articulate your thinking process in a crystal way. But what will be the missing point for us to just go from why can't we just have the strong high instead of the high? Basically, I think high is great nowadays. but I keep thinking about what might be the headrooms we can go from high all the way to the short high. So this will be my honest feedback just to the presentation here. I just feel like you have tons of the great contents and you just compress everything just within the 15 minutes. But at the same time, I keep thinking about in so think about this is the presentation, right? You are in front of the real human beings and I may want you to sometimes just pause a little bit. such that you can give some spaces back to the audience such that they can ask their question. Because actually, I think you have tons of the great contents, you're just showing your thinking process, build everything together. But actually, I do have some quick questions. I want to chime in, but I can't chime in. It's because I don't want to break your flow. But at some point, maybe that will just put you to another dilemma where, okay, you drive the conversations, But at some point, you have no idea whether that will be the most important topics just on top of the interviewers' site.
Occam's Chameleon: Okay, come up for air and check in a little bit more.
Supersonic Seahorse: Exactly, exactly, exactly. This is what I am thinking because I also serve as the hiring committee members. When I just read through the feedback from different interviewers, I just feel like most of the time, if that would be the strong high, I think we are just past the bar, but we are talking about if that would be the strong high decision. I think I can see most are just from the feedbacks is that the interviewers, they would just feel like this is very good discussions, not simply the presentation. They are just serving as the audience, but this is the very good discussions between them and the interviewee. So that's why I think what we can do, change a little bit here is to maybe at some point you pause a little bit and roll the ball back to the interview to see, to collect the signals from them, to see whether that will be the most important thing. they want because I think a very good presentation or discussions is we fit the right content, the most important content to our stakeholders within the time budget. This is my definition. Okay. And another thing here is that I also just make another note on my side. I think we may still miss something here. It's about the trade-off discussions. So I just feel like this is workable solution. This is the great solution. Don't get me wrong, but I keep thinking about what if the reality is very complex? What if we just decide the system but it doesn't work as we have expected? What would be the plan B? So the reason why we really want our candidate to talk about the trade-off, especially for the staff flow trade-off discussion, is mandatory. Why? Because we want to see the breadth of the knowledge on our candidate side. We want to see, hey, do you have the plan B? plan it doesn't work, what would be the plan B? Do we have multiple options towards one solution? This is something we want to know from you. And I can see, like I said, if that would be the short higher decision, most of the time the interviewer, they will measure, okay, the candidate can proactively talk about the trade-off so the question comes to you here is there will be some places you can, you think you can just talk about the trade-off when you're looking back.
Occam's Chameleon: Yeah. So one week kind of discussed, but it was actually not until you prompted me, but that was about, you know, what are the different trade-offs for how we might to actually store the weights.
Supersonic Seahorse: Exactly. Exactly.
Occam's Chameleon: That's a big one. I think another trade-off I could think of is the need for complexity in the architecture of something like the data registry versus just relying on the individual models themselves to kind of track the tables and versions and data locations that they care about. I think, and actually that kind of raises a question for me. I do want to get back to trade-offs, but How important do you feel like it is to kind of start with a simple architecture and then evolve towards a finalized one as you run into speed bumps versus trying to, I feel like I spent more time kind of trying to arrive at like a finished complex architecture from the start, anticipating issues and kind of building in, but I think I kind of lose out on an opportunity to demonstrate my awareness of those issues by illustrating why one architecture wouldn't work and then one architecture, evolving the architecture would help. Do you think that's important in a case like this?
Supersonic Seahorse: Great question. But I think each interviewer, they have their different preferences. I can't speak for all, but most of the time, when I just looking back, I conduct a lot of the interviews. When I looking back, I just feel like, wait, within this time budget, especially the meta and the Google, we only have 45 minutes compared with the, they have one hour, 15 minutes more. So I keep thinking about maybe just in the meta or the Google interviews, I would suggest let's come up with something, a brute force approach. That'll be the very simple, maybe the workable solutions, and come back to optimize that. And also given the time budget, I don't think we can optimize everything. So that's why I would just encourage you to just talk about just talk with the interviewer to see what would be something that you would like to do. If you just collect the signals, feed them with the right answers and conclude the interview within the 45 minutes. I think that should be the safe bet. Yeah. Cool.
Occam's Chameleon: That's really good feedback. Anyway, to get back to your point as far as like, what are some areas that I can kind of, I think are candidates to talk about trade-offs? I think the weight storage and the actual storage mechanism there is definitely a big one. Even the way it's cash, I think is something that is a little under examined here and we talked a little about that at the end as far as one option for how we could improve scalability, but I think there's definitely more that I could have talked about or identified on there. Maybe a third thing actually is trade-offs regarding the actual model registry. I just gave a kind of canned example of how we might have different statuses for the models that form whether or not we're ready to go to broaden with them. But there's definitely any number of different designs that we could have gone with for how we actually at that point, from training to validation to deployment. So I think that's an area that I could have talked through some trade-offs on. The ones that kind of come to mind, top of mind for me, are there any that you feel like I'm missing there?
Supersonic Seahorse: No, I think you just covered more than I expected. I think the database or the storage part, It's very natural. And I second you, I think cache is very another very obvious thing. If we are talking about Q, maybe that is another thing we can just talk about the trade-off. But typically on my side, I think database or Q or maybe the cache, these three things, we can easily just talk about the trade-off. I think that should be some discussion.
Occam's Chameleon: And you actually mentioned Qs. I didn't even get into the idea of potentially asynchronously triggering the training processes.
Supersonic Seahorse: Exactly.
Occam's Chameleon: That's definitely another candidate area to go into.
Supersonic Seahorse: Yeah, for sure. Great, great discussions. Okay, so, Brian, like I said, I think you have already passed this round, but I share some of the very neat feedback to you. Hopefully you can just get that.
Occam's Chameleon: Yeah, absolutely.
Supersonic Seahorse: Yeah, I strongly higher decision. And with what you mentioned, you will have the re-interview in the next round. With that, I wish you all the best.
Occam's Chameleon: Thank you very much. I appreciate all the feedback. Thanks.
Supersonic Seahorse: Thank you so much. Talk soon.