Watch a technical mock interview with a Google engineer
Intergalactic Avenger, a Google engineer, interviewed Space Pheonix in Systems Design
Share
Summary

Problem type 

Distributed databases

Question(s) asked 

How would you organize a database such that you can add more machines once your current ones reach maximum capacity.

Feedback

Feedback about Space Pheonix (the interviewee)

Advance this person to the next round?
  Yes
How were their technical skills?
3/4
How was their problem solving ability?
N/A
What about their communication ability?
3/4
You had lots of practical knowledge of the field, which really came through. I did have some issues understanding exactly what some of your proposals were. You had the high-level details down, but when it came to specifics like exactly how data would be distributed or queries would flow through this hypothetical database, I wasn't exactly clear at times. Otherwise, great work!

Feedback about Intergalactic Avenger (the interviewer)

Would you want to work with this person?
  Yes
How excited would you be to work with them?
N/A
How good were the questions?
4/4
How helpful was your interviewer in guiding you to the solution(s)?
4/4
An open ended question but really an everyday problem that we run. We had a great discussion exploring different ways of solving this problem and the interviewer had good enough understanding of depth of the problem and discuss them further to solve the problem. Nice discussion! Thanks!
Transcript
Intergalactic Avenger: Hello? Space Pheonix: Hi. Intergalactic Avenger: Hey, how's it going? Space Pheonix: Good, how are you? Intergalactic Avenger: Good, good, doing good. Alright, so if it's okay, I'll just jump right in with a technical question. Space Pheonix: Yes, sure. Intergalactic Avenger: Yeah, so I think we'll do some not necessarily coding but more system design question today, just to keep it... I'm sure you've answered enough questions about hash tables for for one day, so let's start with a distributed systems question. So the idea is that you've got a database that you're keeping in MySQL, some some standard off-the-shelf database system like a SQL database and we could even say for this purpose, for simplicity there's just one one table and it's just... You're getting more and more rows in that table and eventually you found there's more rows than can fit in one machine. So you have access to multiple machines and they're all networked together somehow. So can you think of a way to to organize the database just given that you just have access to SQL, you don't have anything like... you know, any automated tools for distributing and... So just kind of generally describe what this is going to look like. Space Pheonix: Okay sure, so the idea is you just have an access to SQL database from multiple machines where I can store the data. Right now it's going to the one machine and it cannot scale for the data that we are getting. I'm just making sure that I understood correct. And somehow we need to partition this database in a way to query in future, so distributed in multiple machines. So let me first ask a couple of questions, like before splitting the data, is it... so there are a few ways. One is like usually like some geographically like if you have some countries related to the database, for example if you have a country code, we could potentially have our country, our region code, like we can split the universe in like a multiple regions, like one is like North America, South America, or Africa and Asia Pacific, kind of probably like a three regions we can split into three different machines to go like at any that are related to the three specific regions, go into specific machine. Probably one layer, one machine where it load balances these things. Somewhere we need to have a component where the data and then decide which machine to go store this data. Intergalactic Avenger: So the idea that... so yeah that makes sense that there's going to be a lot of... if you have so much data, it might be coming in from different sources, so would the idea be that country code is just something that's in the database or is that also relative to who's asking for that data? Like do the European users talk to the European data and the North American users talk to the North American data? Space Pheonix: Okay, well... Intergalactic Avenger: That kind of brings up the question of, is that is that how you were thinking about it? Space Pheonix: That was probably my next question was, so how the database will be coded in future, whether like a different regions will ask for different region or they will just look for their own region data when they worry, so that's debatable or a clarification question if it's going to be like Asia Pacific people asking for data from Europe, then you're pulling the data from two different machines and then merging and then responding is going to be complicated so we have to think a different way of partitioning this, probably like anything like an old data we can keep it in like in a backup database, like one transaction database where you have only current months data. If anything older, you go to different machine, pull it from there. So depending upon the query pattern, whether always you get quite often you get for last week's, last month, or last six months data that being coded, that can be kept in the main machine where we do the transactional and the rest all can go into like jobs analytics that are... this kind of model I would think of. Intergalactic Avenger: Okay, let me just yeah... there's a couple questions back in there so with respect to the geographical regions, so that's definitely a very good way to start. So if this was the kind of system where there were country specific data that was being read from and written to just with respect to that country and that was... and people in that region were the only ones doing it then that's an excellent way to start. But let's say for the purposes of this exercise that that's not the case. Let's say that all the data is stored centrally in one data warehouse that's located in one geographic area, so that's just to cover the first one. So then that was an interesting point with respect to the the recency, so can you just sort of sketch out for me an algorithm of how that would work. So for example, let's say that you know if I'm reading from... as I'm understanding it, there's going to be one machine in the front that is going to take in the query and it's going to look at the time range that you are looking at and based on the time range it will farm that off to different machines. So how then does that work with respect to... so is there going to be... are you going to... so let's say that you are going to have that split into day or I guess you said month, so by month. So is the idea that every month, you would add a new machine for that month so that there's a machine for 2015 December and then for next month you'll need to make a new machine that is going to host the new data or do you migrate the old data backwards? Space Pheonix: Yeah so migration is what I was thinking. I wasn't proposing to add the new machine every month. Yeah since being the transaction database that is the bottleneck where everyone insert an SQL query for the same month that is going to be one machine and the rest all like... a machine can probably can accommodate more than a month data. So requests will go to probably one machine or two machine depending upon the size, whether it can hold in this one machine, then it could be just one machine. Or it could be like more than one machine, depending upon the size of the database. That was what I was thinking. Intergalactic Avenger: Okay, I like that idea. So this is the interesting point. So okay, so you have one machine that's kind of the the current ones and then you have maybe multiple machines that are holding the older data. So how do you... how do you split that up? So let's say that you right now maybe you have one machine for December and then one machine for all past data. Then, you're looking at this in the future and you say oh whoops, the machine that right now has all the past data in it is about to be full. So now how do we split up that data? Space Pheonix: Yes, so in that case... not in that case. So we have to design to accommodate this... definitely is going to be full is the one machine for all past that's not going to just hold everything for next 10 years or 20 years so definitely does miss this for adding more machines as and when needed in the design to go get the backup database from multiple machines. So in a way I would rather say probably like to go for like a five years, ten years kind of a database. So these are all like a kind of a parallel partition right? There is completely a new concept for virtual partitioning a table can be like... so I I'm not technically like completely aware how the whole thing works like a for example the big data thing, how do they completely vertically partition the table first off of like probably 20 columns, first ten columns would be stored in one and the remaining ten columns will be stored in different machine. But I don't have the technical depth of knowledge for explaining how that is completely implemented or how the whole thing works, the vertical partitioning. Intergalactic Avenger: Let's not worry too much about vertical partitioning. I mean that's certainly one way you could do it, but let's just assume that there's a relatively small number of columns. It's just that there's more and more and more and more and more rows. Space Pheonix: Yeah okay okay so yeah so considering there are a number of backup databases, I would probably suggest like going back like completely like up to five years one machine and complete the past like for more than five years running on more than one machine kind of approaches what I would think of. Intergalactic Avenger: So let me see if I have this correct, if I understand the scheme correctly. So you've got some... or I guess the one piece that's unclear for me is how does the data then sort of move and get migrated from how it is now into the past? Like once you've decided that some row is old by some definition of old, maybe it's one month or one year whatever, how do you decide where it goes? And how does the migration sort of work? So my question is when you do that migration, do you go over all the data like in the entire history as you redistribute it or when you redistribute it are you going... you just take the current data and then push it sort of somewhere? Space Pheonix: Right so... yeah so what I would do... let's say this is what my addition, like in the component which decides to go find where to find the data, so that knows which server I need to go find the data. Depending how it knows, the strategy is let's say I have the current machine which does the transactions and everything for the last just one month. That's all I'm going to keep in a different machine. Intergalactic Avenger: Just so you know, if you want it, there's a little whiteboard if you want to draw on a little whiteboard and there's also like just the text if you want to draw little boxes in the text, so if that helps you to explain it or to think about it, you have those options open to you. Space Pheonix: Yeah sure, let me try that. This is my second interview, just trying this one. Intergalactic Avenger: Yeah there's a little button that has like a pen on it, and that's a little whiteboard so you can you can draw stuff. Space Pheonix: Oh yeah, so yeah. So this is let's call it OLTP this one, like a transaction database. So this is the current data. So even before this, let's say we have a component called the load balancer or whatever, which decides where the data, this should be queries. So all the queries coming in here. The load balancer decides where to go, so this OLTP queries the transactions in the last month, so less than one month old here. And the query here, let's say we get for old data here, so greater than one month old data is going to be stored in this backup server. So in a point where this is going to be filled completely, there should be a way like I said, this is going to be filled up, for example we have for now 10 years of data here. So right now we filled it up and then there is no space in here. At this point, I want to completely split this thing, I think my suggestion would be to just add up one more server where the new data, whatever we backup from this server will get into this machine, like I have one month old here, now I'm in the second month, so just one month old database is that I will fit into this new server. Now I need to find a way where to query from these two machines and then written back somehow that has to be, that has to happen based upon the query in the load balancer. Depending upon the query, I have to find whether just I need to go to backup server or also backup server 1 and backup server two backup server three. Depending upon the quarry I would just add this up, so that probably is what I was thinking out. I wasn't thinking... so there is one other way like I have the backup complete backup of ten years of data. I wasn't thinking of splitting this into like I have four machines now, splitting into all four machines, I wasn't thinking of that. Intergalactic Avenger: Yeah that's fine. So let me see if I understand the path here. I guess I'm a little bit confused... so the load balancer at the beginning looks to see if it's within a month or greater than a month and so then it can choose one or the other, but what is this other piece in in the middle, this this one right here? I guess I'm confused as to okay what its role is in the... because it seems like if the data is either split into less than one month or greater than one month... I guess I don't understand what... Space Pheonix: Yeah, yeah so let me clarify that... So this one, let's say for example I am in 2015 now, my first date observer is filled out, so anything older than 2015... anything older than 2015, we get into this machine. Intergalactic Avenger: I can help erase the old stuff if we want to erase it. Space Pheonix: Yeah sure. Is there a way? Intergalactic Avenger: Yeah, you just have to make the eraser a little bit bigger. It's fine yeah okay. Space Pheonix: Okay so this one contains less than 2015 data. And then anything so now we learn this database is filled... filled up and then we we are adding a new machine and this will contain anything greater than 2015 like greater than or equal to 2016 for example. So going forward from the OLTP machine, I'll just start backing up things every month to get into this database machine two, that's kind of the backup service that we will write it which will run in offline mode. So this load balancer itself could understand the machine 1 and machine 2 and depending upon the query, we can pass... we have to split the query depending upon... let's say someone else, since it is like a pretty recent and if someone asks like a two month of 2015 and one month of 2016, then definitely this load balances really have to pass the query and send to the two machines. So that way we get both the roles and then join and then it sends it back. So that's kind of the load balances job and it can be considerable, we can say ok year 2015 machine one and all that probably you can put it in some xml configuration of something where this strategy kind of easily changeable and stuff so we can add more machines going forward. When we add machine three, we can just say 2017 and beyond, so that can be configured just like that in the configuration file here. So this will be written by program or something like that. Intergalactic Avenger: Okay, now that makes sense. Okay so it seems like... so we started off with the idea of it being geographically partitioned and that's good, that will work with some some things. Snd this is partitioned by time, so in what type of scenario is this going to be ideal? Like what kind of query patterns of people querying the database are you going to see that this is optimal for? Space Pheonix: So this is completely for like a transaction database where like it's a Amazon order service, where you get plenty of orders every minute and every day, so probably that's kind of amount of data where later you get it from same country and and you would get a lot of orders that's probably one scenario, where you get like a tons of records every day. Intergalactic Avenger: Well, let me put it this way. So for example, let's say we're talking about an Amazon like database. And the things you're storing are things like orders, like who ordered what right? And I could imagine that in this type of situation, people are looking up recent orders more often than they're looking up past orders. They're probably looking up something they just ordered this past week to see its status, but they're probably not looking up very often their old orders. It kinda feels like you're going to get a lot of traffic going to this one machine. Then all of the backup machines are going to be less utilized. So can you think of a way... so let's even just keep going with the this sort of Amazon style database and let's just say that you know after putting this into place you notice that this one machine here gets kind of most of the action and becomes the bottleneck and these older ones, people aren't really querying for it that often and so they are less utilized. Can you think of some way that will sort of better distribute the workload between all of these different machines so that you don't have this one as the bottleneck? Space Pheonix: Yeah, sure definitely. So I think, so the other way to distribute would be probably like going by... since country wasn't a thing because every order is coming from same country let's say for example it was from North America, in that case I would say just go with a last name. Usually like every go by last name, people query by... so the user identification, user ID of course it's not the thing we can do. Probably I would go with last name starting from A to like your let's say P to machine 1, from Q to Z to machine 2 and depending upon number of machines I have, I can split evenly to all the machines. But this scenario I'm thinking about is just the orders, like it can vary the condition that we are trying to... the approach we are trying to split the data between the machines depending upon the data we are getting. For example, right now we are thinking everything is orders, everything is placed by some first name A and last name B. That's the scenario I'm completely talking about. So that way any user come to query something, they go to just one machine. They will never get into multiple machines and there is no question of joining queries and merging the data from two different machines and then solving the users. So that's probably one approach I would go with. Intergalactic Avenger: Okay, no that's essentially... because that's definitely going to spread it out so that all of the machines are getting you know the current orders and the past orders so all the machines are kind of spreading them out. Another challenge for you here... so if you split it up by letter, some letters are going to be more popular than others, so there's many more last names that start with the letter T than start with the letter W. Space Pheonix: So I have a strategy coming to it. So I was thinking of see the pattern of orders coming from... we really have to study the pattern last like a few months and see where the number of orders really coming and then that's how we have to decide the number of letters to go on to some machine X and machine Y. There is no way we can really distribute just 4 letters to sorry... like 13 letters to one machine and remaining 13 letters to the other machine. Definitely we have to see the pattern and how the last couple of months went through and depending upon that we have to say it okay probably like five letters, first 5 just to machine 1, remaining 21 letters to machine 2, probably something like that. And one more problem with this again for example since you are storing order and everything here splitted into multiple machines, in the case of like where the same order data database will be accessed by some people like who's fulfilling the orders. In that case, we have to really query, we cannot really query by name or anything, they have to see all orders today and then definitely we run into issues of pulling the data from two different machines and then merging and then assigning everything and then returning all the data, so there is no way we can get around like you're always hitting one machine, that's not high likely... like it is just one side we are solving the problem but the other side we still have the problem of solving coding from multiple machines and unwinding, so yeah. Intergalactic Avenger: That's a very good point, it's a good point. So just going back to that last issue like if you said that you wanted to find all the orders... so let's say we're going into the way you said before with all the last names and so that's now distributing the data a little bit better, but now the dates are are not distributed very essentially centrally. So now you want to issue a query that says you know, show me all of the orders today. Where is the bottleneck going to be in that case? Space Pheonix: It's all going to be the load balancer, where... so the load balancer is the one that will get all the requests. Now it has to decide where to go pick the data. Now machine one, machine two, machine three if for example if you want everything from today, it has to go to all the machines and you got all the data from all the machines, you have the job of joining everything, and then you like union all the results and then returning to the client so the load balancer kind of becoming a bottleneck yeah. Intergalactic Avenger: So that's certainly true if you're doing some aggregation of all the data. So let's say that you wanted to sort all of today's orders, then certainly you're right that the the load balancer becomes the bottleneck because it would have to aggregate all of them and then sort them. But what happens if they just ask "give me all of today's purchases in any order, I'm not concerned about the order." Space Pheonix: Okay, so in that case, so there could be one more possibility where you get the client requests and I know the load balancer knows where all the data is and we could potentially return the results directly from machine 1, machine 2, machine 3 to the client possibly. The way is you get only if the load balancer is probably act like a probably like a velocity server where the client will get the data, the information where to go pick actual data. So I will return back machine 1, machine 2, machine 3 to the client and then client will query on more like a three parallel queries or whatever kind of the way, it can just go query directly from those machines so the balance that the load completely is between all the machines. Intergalactic Avenger: That's an idea, I like that. Okay, so last one. So we're... these are all excellent ideas... so let's just dig a little bit deeper into the last name issue. So you're right, I really like that idea of looking over some past data like usage patterns to sort of see who are the types of users that use it more often and that kind of thing and that's certainly something you do dynamically, but let's try to think of something that you can do more statically without looking at the usage patterns to try to clear up this issue of you know more people have last names with the letter T than have the last names letter W. So is there some way that you can distribute the records so that so that even if... so let's just say for example that you have you know 100 machines and hypothetically this would mean you would put four letters on each machine because there's basically 25 letters in the alphabet but... Space Pheonix: You mean each letter in four different machines. Intergalactic Avenger: Yeah so you would put like the first the first quarter of the letter A would go in one machine and the order of A gets in the second machine set. So that's the sort of one way to do it but it turns out that if you put like you could fit all of the complete letter Z in one machine and actually all of letter Y too, because there's not very many people with last name Y. So two people can actually fit in Y but the letter T needs ten machines. So I mean one way you could do it is you could just sort of keep a table of each individual row with... So I mean one sort of simple way of doing it would be to create a table with each individual row in which machine is going on or you could create like a range you could say well this person to that person is on this one machine but there's kind of a lot of bookkeeping to do there. Can you think sort of a simpler way so that when a query comes into this load balancer, it knows exactly where to go very quickly and easily for which machine is holding that person's data and you don't need to keep any type of like bookkeeping around. So the bookkeeping is fine except that you know let's say you wanted to add someone to the database or you want to add an order, well you would kind of be constantly shifting around how they're all migrated if you wanted to optimize it. So can you think of a way that is going to distribute all of these people in a way that each machine is not overloaded, that all of the machines have roughly the same number of records and that you don't need to do any type of like heavy duty bookkeeping or shifting around data as say more people come online or the orders are made. Space Pheonix: Okay so I was just quickly thinking of just putting the machine and mapping the machine with the user table where each record will say okay this user always goes to this machine blah, but that's easy in a way to just to return the machine or whatever the detail we want to store the data, but in a way it becomes complicated, let's say what happens if I have to bring one machine down? I have to really go back, find all the users assigned to that machine and then update with some new machine. Those are the things we have to really think about but let me think about a little more... so what I was thinking was giving some weightage, but again that sounded like a heavy duty bookkeeping kind where I say starting letter A gives some weightage 1, so a T with a lot of who orders with a weightage 10, something like... so giving weightage for the letter again I think kind of a high bookkeeping, I wouldn't suggest. Intergalactic Avenger: How would you pick the weight? It's the idea that the weight is the number of users with that letter, so like the wait for the letter P is how many users last names start with the letter P? That kind of thing? Well there's something to that, that's an interesting one, I hadn't thought of that one before, but that makes some sense. So then what would you do with that weight? Space Pheonix: Yeah so the weight tells you... let's say you have 100 machines, it will just split up with... that weight will tell you how many machines, what is the percentage of machines that you need for each letter, so you just sum up and then find the weighted percentage for each letter and then depending upon number of machines you can just have those many machines for that specific letter, but that's again another bookkeeping. The other approach I was thinking was since we evenly need to distribute every order comes in... so let's say we want to like a very highly distributed model where you get order one go machine one, order two machine two, order three machine three, and an order n go to machine n, and order n plus 1 goes to mention one again. Intergalactic Avenger: I like that idea. Space Pheonix: Yeah but how we really achieve that is what really becomes more of questionable where let's say if I have order one, order two, order three, I get from one user, let's say if I go to machine one and I say okay machine one order one from user 1 is always machine one I could assign the user one from then on go to always machine one and the next order goes from user 2, goes to machine 2. I will assign that users for the machine two. But again that becomes static afterwards, that's not more of dynamic anymore, so there should be a better way of distributing always all the machines used is like equally that's what I'm what... Intergalactic Avenger: Actually let's explore this latest idea that you had, which I actually think is quite elegant because of its simplicity that you'll have the load balancer as you're writing new things will just have a counter that will roll over, super simple to to figure out where it's going to go. And so then the only question is how do you retrieve it? Space Pheonix: Yeah, so the reporting again becomes complicated there... Intergalactic Avenger: Um well maybe not. I mean so if you think about it so let's say that you say well give me all the orders from today for some certain user? That's what you want to query right? What can the load balancer do to find that information, given that it is totally forgotten where it put that person. Because maybe it saw that person a million transactions ago and it didn't keep track of where it put that person. So how do you get at the, you know, so how do you get that the answer to that query of you know how many purchases did that person make today right? Space Pheonix: So one... these are all reporting afterwards we could potentially... I'm just thinking of these sources. There are two things right? One is OLTP and the other one is the analytics server, which is just for used to file a reporting server. Quite possibly that we can have all these machines data and the transaction data gets duplicated or are pushed to some reporting servers where this load balancer always can go query the data from... so that's... let me think, so let's say we have 100 machines here. The main reason we are distributing is because users are being online and we need to serve them quicker and faster, that's the main reason that we are distributing all these transactions during the insertion. But the reporting usually can like a couple of seconds can take, can take couple of seconds more. We could potentially have the reporting servers separate where all the data gets synced to the reporting servers, always it is the load balancer notice. If any query falls for just coding the data always pointing to the reporting servers, that is one possibility, but let's think of if there is any other way that we can always pull this data from the same machines therein. So I don't see because we lost the pattern as you said the load balancer forgot where the data is and there is no way that load balancer is keeping this mapping between the machines and the pattern of queries. Let me think a bit more. Intergalactic Avenger: So one thing that you're optimizing for, which is which is generally a good thing, is that you're trying not to ask more machines than necessary. So for example with this first you know description that you have here where it's charted by time, you have the less than one month and then 2016, 2015, you have the the load balancers up at the front that decides I'm going to ask only one of these machines and that saves CPU cycles obviously because you're only asking the one proper machine where it is, but that's not necessarily a restriction. You could asked more than one machine. Space Pheonix: Yeah definitely, so in case of let's say we have hundred machines, we store the data like evenly from the load balancer, now the query is to select something. I could go ask more machines, but whether... but my worry is that would slow down things when someone wants the report of today's orders or last weeks orders or whatever, querying more machines will slow down things if it is just from the load balancer. Intergalactic Avenger: Well, actually if you think about it, so if you are the reporting server and you are trying to get the all the days orders, you still have to ask every single machine because you know, somebody with the last of the name with the letter P is going to have an order and someone with the last name of the letter Q is going to have an order, so you have to still ask all the machines when you're reporting, so that's I think that's actually not a big problem. I don't think it slows down the reporting aspect of it. Space Pheonix: Okay um yeah, so could be so that's again that load balancer probably let's say if it is just a select query, then we could have and so probably we need to distribute the load balancer now. The problem is all the insertion going through the load balancer on single servers balancing everything and if it find something select query or the report query, it should just give it back to the client, go find the other load balancer or something on that machine and the other load balancer is just mainly for the reporting load balancer and this is probably just the transaction load balancer. So the second load balancer where all the reporting code will go in and that would probably easily to query from all machines and then respond, serve the user. So that way we could just split the load between the load balancers potentially. But I would not really suggest just this machine coding all the machines for the reporting purpose and as well as the same load balancer doing all the insertions to all different machines. That will really become a bottleneck I think. Intergalactic Avenger: So other question is I wonder starting this iteration, there's going to be multiple load balancers and each one of them is going to query all of the machines or is going to query some subset of machines? I guess I don't understand the multiple levels of load balancing. And then the other question is why can't the original load balancer just ask all of the machines in parallel what all the results are? Space Pheonix: It can, it can. I'm just worried whether it will become a bottleneck for responding the queries because it's doing all the insertion as we are thinking about, we are talking all the data like Amazon sites, like this all the orders coming through this load balancer and as soon as you are doing the reporting through the same load balancer which is querying all different machines, I am really worried that will become really a bottleneck and slow down things. So what I was proposing was if whenever this load balancer gets a reporting query, I know ok this is just a report, go back to the client saying okay this is just reporting. Go to the other load balancer, so now the client hits this load balancer too, this knows okay I deal with just the reporting part of it, that will still query all the machine, but not the same load balancer. So this load balancer one will just do the insertions and stuff like that, all reporting will go through the load balancer two, which will finally query from all the machines and then serve the user. Does it make sense? Intergalactic Avenger: Okay yeah, so I mean if I'm understanding that, so the... it would sort of be introducing like a priority or a tier for the different load balancers and depending on what they were doing, they would sort of have a higher/lower priority to use up the resources, that's very smart yeah because you're right, that the usage of the database is not going to be the same so the... you know a user who's just logging into the site, is going to want to see a small number of records very quickly whereas the reporting server is looking at a very large number of records, but can see them more slowly. So that's great, that make sense. Yeah that's kind of all I had. I know it's a very open-ended question, but I was just curious, wanted to just talk with you about that. I don't know if you've ever done any thinking about this, it seems like you must have thought about this some because yeah you have all this ideas about load balancing and OLTPs and usages and charting, it sounds like you know all of the concepts very well, so do you do this in your in your work already? I know that I normally don't see people that have this much background in this and all the varying context. Space Pheonix: Sure, I did some amount of this work in the past, so I was working really the same kind of ordering platform enough, with like a million users kind of database I was dealing with, so we ran into similar kinds of issues when we were trying to do like multiple computing things like this and exploring it was all again learning and settling down things. There is no just one best solution for everything. So that's why I was trying understand these patterns a little bit and then making sure we balance things like evenly, is really important. But did you have any other better ideas, just in case? Intergalactic Avenger: So I do like the one idea of the just having a round robin where you'd put each record, just over that kind of thing. The other way to do it by the way is with respect to the names. If you wanted to do it based on names, what you could do is instead of it being based on the first letter, you could just take a hash of the entire name or the entire user ID or something like that, so then what that hash will do is say you know take some space of you know a hundred thousand, a million, two million users and break it down into a number between you know one and however many machines you have, so just design your hash function so that it takes whatever your identifier is, be it their name or their ID or something like that, and remap it into a space that is exactly how many machines you have, so that way you avoid the problem of querying too many machines and and there's sort of a lot of network traffic that's excessive, but what you could do is you could do is just hash the persons identifier and then you know that that identifier will go exactly to that one machine, and that way as you add more and more users, they just get a randomly assigned according to that hash function. As you said, there's no one right answer. You know, the the practical nature of your data will will pick whatever the right one is and I thought you had a lot of really good ideas, so that's great. Space Pheonix: Alright, yeah. That's good thinking of doing the hash function, that's yeah. That's something new I learned today. Intergalactic Avenger: Good, great. Alright, so you have a good day. Space Pheonix: Thank you so much, you too. Bye-bye. Intergalactic Avenger: Take care, bye.
Want to get some practice yourself?
Become awesome at interviewing, and get actionable feedback from engineers at top companies – it’s 100% anonymous!
©2020 Interviewing.io Inc. Made with <3 in San Francisco.