Interview Transcript
Intergalactic Avenger: Hi.
The Incredible Hawk: Hey.
Intergalactic Avenger: How's it going?
The Incredible Hawk: Good, how are you?
Intergalactic Avenger: Doing good. Alright, you ready to jump on in?
The Incredible Hawk: Yeah, let's do this.
Intergalactic Avenger: Okay. Just really quick question, what's your background, like what... this isn't really too much... the question I have isn't too heavy in programming, but just what kind of programming languages are you comfortable with so I have an idea.
The Incredible Hawk: Python would be my favorite, a little bit of JavaScript.
Intergalactic Avenger: Okay, perfect. Alright, so today I'm going to ask you a question about data munging and... well, let's just stick with that. Last quick question, do you understand the the fundamentals of baseball? Like the mechanics of innings and batters and pictures and stuff like that?
The Incredible Hawk: I'm gonna go with no.
Intergalactic Avenger: No, okay.
The Incredible Hawk: Yeah, I know very little about baseball. Depends how fundamental, I know there's like nine innings and there's four bases and stuff like that.
Intergalactic Avenger: Okay, I think that's all we need. We certainly aren't going to get too crazy into it. But yeah, if you know that there's nine innings and four bases, that's all we need. So basically what's happened to set the tone... so you've just started to work at a new startup that's called Internet Baseball Database and their customers have started to complain that the features aren't very good and that the site is really slow, so you've been you've been told to come on in and kind of help clean everything up. So the idea is that this company gets a regular data feed from major league baseball that has data that they're going to want to display on their site. Can you read what I'm typing here?
The Incredible Hawk: Yep, I see hi.
Intergalactic Avenger: Yeah okay. So basically what they are sending every day is an update of what happened in each of the the baseball games. So it's a very simple text file with tabs and it's going to have just column fields and it's going to say the date that the game was played, the home team, the away team, and every column actually is going to be an at-bat, so it's going to have the batter and for right now we'll just say the results, like the the outcome. So an example here might be like today's game would be on the 26th. The home team was the Giants and the away team was A's and the batter was Cocoa Crisp and the outcome was that he got a single. So the outcomes will be a like... s is a single, which counts as a hit, d is a double, which is a hit, t is a triple, H home run is a hit, and then there's also... K is a strikeout, it's not a hit, and f would be a flyout, something like that. So there's sort of like a little code for what happened and what... how the site currently works is that you get this text file every day that contains all of these lines, and the owner of the sites is very paranoid, so he doesn't want to put it into a database because he thinks the database... people are out to get him... so he stores it on the file system, so he'll have sort of on the file system it'll say data and then he'll put the year and then he'll put the day and then he'll just have the date for that that day. And the other thing is that he also is very paranoid and doesn't trust file systems developers, so he invented his own file system that's very inefficient and can only hold a thousand files in every directory, so that's why he can't just put the date like this because at some point there'll be too many directories or files in each directory, so he has to kind of split it up this way. And when someone comes to the site, right now the only thing they can do is they can look up a certain date and look at all of the information for the games played on that day. But he started to get some comments that wouldn't it be nice to look up other things, so for example, wouldn't it be great if you could just look up a particular player and see some of the stats for that player, and as you might imagine, using this technique of storing the data, that's a bit inefficient. So, your task in that, sort of a larger, overall task, is to come up with a way, given these constraints that you know you have to store the data in flat files and a file system and no more than a thousand per directory and that kind of thing, find some way to represent the data that would enable searching for a particular player and getting some statistics basis.
The Incredible Hawk: Okay, quick question. If you were to reorganize them as datasets like... something like this... would that directory probably have more than a thousand entries at some point?
Intergalactic Avenger: It would, because this is all of baseball for the hundred years that they've been playing. So it's definitely too many.
The Incredible Hawk: Alright.
Intergalactic Avenger: And specifically just to... so that's just the overall thing, and then the the specific sort of feature that we'd like to do is we'd like to be able to calculate or show on the site their batting average, so the batting average is the number of times that they got a hit divided by the total times that they went up to bat.
The Incredible Hawk: Is that the only thing we'd like to find about these people?
Intergalactic Avenger: For now, I mean I think if you can get that that far, then I think the other ones would sort of come naturally and I think this brings out the the core of where the challenges are. So let's just for this exercise... just explain how you would calculate the batting average.
The Incredible Hawk: Okay, so... another question... is something like a file like this that would have each row being a player and his batting average... that would have a lot of rows, but is there any limitation on that the size of that file?
Intergalactic Avenger: No, that's okay. So I see where you're going with that. That's true that you could just create single file that has all the batting averages and then if that's all the things that you're looking for, then that's pretty good. The one thing that I might say about that is that... if you were going to create that file, it would have one row for every player, right?
The Incredible Hawk: Yes.
Intergalactic Avenger: So then how long would it take to look up a player in that dictionary?
The Incredible Hawk: The size of the file, like where if n is the size of the file, it would take O(n).
Intergalactic Avenger: Right, so if we have hundreds of thousands of players and maybe it's not a very efficient file system, maybe that's also going to be a little bit slow because you got to read in the entire file and then do a search on it and then... But yeah, that's a good start. I mean certainly in the database world you would just create like an aggregate table like batting averages or something and then you could just look it up right away, but I think for this exercise, do you think there's a better way to do it that wouldn't involve having to read in the entire file?
The Incredible Hawk: Okay, the next idea... it makes sense that's going to be a lot of players. It's kind of looking like a RESTful API where like each file, so there's like... you could look through all the players and so this is player1.txt, which isn't terribly informative and we'll have like a thousand players per each each of these files. And it's kind of ideally, I'd want to say sorted by name, then that doesn't work when you're inserting... you don't maintain the sort, so that's not good. I mean then you have to change everything potentially.
Intergalactic Avenger: Well so the updates are not necessarily such a big problem because... so you're getting one of these files every day and so it's possible to just do one big offline update, like once a day. That's not such a big problem. But I'm interested in how you are... so the idea would be that that you'd have a thousand text files and each text file would have the batting average let's say of the individual file. Okay.
The Incredible Hawk: Yeah, so each one of those... the players folder is... so we're trying to look up by batting average and then we have some directories like... ideally we'd want it all in one players folder, but you can only have a thousand sub files, so that's where the the one, two directly come from so then you get a thousand squared. Inside of here is a player name... like something like that. And then you get a thousand players in the first directory and a thousand players in second directory and then the speedup bonus comes up by if it was like alphabetized or... I don't know how many players there are but a decent way to do it would be to default to like 26 folders where each each one is a letter. And so it allows for like expansion if a new player joins the game with like another A, there's one like 500 players with first name or last name, depending how you served it, that start with A, so you can just like insert it into that same folder structure and it won't require like making something different.
Intergalactic Avenger: Okay, so instead of... I think you've written one and two here, so you're saying that you could do A B. Okay. So then what happens when there's more than a thousand people with the first letter A?
The Incredible Hawk: Yeah, so if that was an issue then you have to do something like A1 and A2 and you're going to search through all these files, which is kind of annoying to find a player that you're looking for.
Intergalactic Avenger: So if someone comes in and they're named Anthony, you first open A1... Oh you do a search through the directory to see in A1 to see if the name is there then you do a search the directory A2 and see if they're there.
The Incredible Hawk: Yeah, that's what I was going to do.
Intergalactic Avenger: Okay, there's even a little bit faster way to do that search, even in this way... so that's a linear sort of scan but there's even a little bit faster way to do that one.
The Incredible Hawk: Typically when you're saying faster than a linear I'm thinking binary and if it's sorted you can kind of do a binary search. So I mean yeah you can do that, you can still do a binary search with all the players on A and you know immediately whether it's in A1 by just checking the last value in A1 and then you can just go to A2 and find it there, so that'd be a faster way to do it.
Intergalactic Avenger: But that's not exactly a binary search. So like if you had like...
The Incredible Hawk: Sorry, so I didn't describe the binary search.
Intergalactic Avenger: If you had three, four, five, six, seven it still would be like I think if I understand what you described, so you would look at the last letter of the A1 and if that was too small then you would go on to A2 and look at the last player on A2 kind of thing?
The Incredible Hawk: Kind of, yeah.
Intergalactic Avenger: Okay. So technically a binary search would be going from A1 to A4 and then looking at A4 and then you would decide if you want to go back to A3 or forward to A5 kind of thing.
The Incredible Hawk: Yeah, you're right. Yeah, so I guess I mean I kind of considered like the number of A in this directory would be such a very small, constant number. I mean it's easy enough just to check the last value in each one and then if you're trying to make it faster, even in the lookup of just 1,000 names is relatively fast but if you wanted to make it faster than that, of those thousand you can do a binary search, so first check the middle and decide whether to go up or down. So yeah, it wasn't a true binary. You could do a true binary by checking A4 and going up and down, but you probably wouldn't even notice the time you saved.
Intergalactic Avenger: Right, no - that's a good point. So if we're probably thinking about somewhere on the order of hundreds of thousands of players then once you get down to these individual ones, probably not going to make a difference. But let's say for argument's sake that you know this guy who wrote his own file system, he did it terribly inefficiently and even just doing a search over all of the directories, like getting a list of all the subdirectories inside directory is really slow. So we wanted to still optimize that so that you are doing as few possible sort of linear scans of the of these directories. So it's basically okay to try to say I know this is the name of the directory and give me its content, but doing the search through all of them is a bit slow and also you know it does kind of touch a lot of different pieces of disk actually, you have to say well what's the name of this file. So is there a way... and basically this is kind of getting at a, it's an interesting hybrid because on the one hand the splitting up of it to A B C is sort of very targeted and going directly to a place that you want to go and then the second half is more like a linear scan kind of thing. Is there a way to get this so that it is all of the the first one and none of the second one?
The Incredible Hawk: Yeah, no linear scan. So we're definitely going to keep this sorted thing...
Intergalactic Avenger: Well do you though?
The Incredible Hawk: If we're not trying to do a linear scan?
Intergalactic Avenger: Right. So if you think about it, I mean you're really on the right path with the the A B C thing, so where did you get that and in general what is that technique sort of called when you when you have some kind of value and you're putting them into buckets according to some kind of attributes that they have? So in this case you have the players names, you have a bunch of different players names, and then you're sorting them into the buckets like A B C based on the first letter of their name. So can we generalize that a little bit and make it so that it works even like even after 26 letters and then you don't have to worry about you know the fact that there's very few players whose names start with Y but very many that start with T or something.
The Incredible Hawk: Yeah, I'm not sure what it's called but what we're doing is we're taking attributes that we know and kind of filtering on that and giving them buckets, I mean that's the idea of it, giving it buckets. Should we be looking for like a better bucket system?
Intergalactic Avenger: Well I mean, so this bucket system is not so bad, it's just that it still requires a some sort of post-processing you might say, like it's a little too fine grained or it gets a part of the way but doesn't take us all of the way. So how about this? How about we say in the spirit of divide and conquer... So when you first started, you had this big group of names and they all were different names and you put them into 26 buckets. Now let's go into a bucket. Now you can sort of think of it as - now it's the same problem again, you have a bunch of names, there's too many of them, how do you split them up?
The Incredible Hawk: By... there's too many names...
Intergalactic Avenger: Like the T's for example, it has too many, like there's more than a thousand players whose first name start with T. So now you have a bunch of names, they're all sort of relate a little bit in that they all start letter T, but there's more than a thousand of them. So you have to find some way to to find more buckets to put them in.
The Incredible Hawk: Yeah, so this is a little bit of a shift, but now I was thinking something like team name, player name. So that would narrow down our buckets so you probably wouldn't have a player name dot txt, you're probably good to go. The only thing is... the first thing comes about is like if a player switches teams, I don't know where they're going to be.
Intergalactic Avenger: But also like think about the UI, so in the UI we're just looking up the player name, we don't know what team they are necessarily. If this was the only data source that you had, how would you look up the team?
The Incredible Hawk: I would hope there'd be a function that maps player to team, but if that doesn't exist then yeah...
Intergalactic Avenger: You could certainly write one, but the the data we're looking at is the one source of data that you have and I mean you could add new files you can do some stuff like that, but there's no sort of magic like external source of data that's going to give you that. I mean you could obviously look it up, but it would probably be a very expensive operation using that original sort of data format to look up what team a player is on.
The Incredible Hawk: Got it. So other things that I can think of, like other attributes on the player I mean, it's their batting average that we're looking for. But I'm a little ambiguous of why that would be a good bidding structure.
Intergalactic Avenger: So in all the letter A's, so those are players that whose first name starts with A. What about the second letter in their name?
The Incredible Hawk: Oh, got it. Yeah that would be a fairly reasonable way to sort until you had less than a thousand and I mean reasonably so like you have the first two letters or maybe three letters, whatever it is, seems to work pretty. So yeah, now there's two ways to do this right? You could do like first letter and then second letter and the third letter as all different directories or you could just do all three letters as one directory. I think it's reasonable... So benefit... you said you want to minimize directory calls right, going down directories?
Intergalactic Avenger: Right.
The Incredible Hawk: Yeah it seems reasonable to like you know you could look up a player with his name, you could find a directory really fast with like this structure where there's multiple directories of two or three letter names. I'm not sure which would be better. And inside of here which would be player name.
Intergalactic Avenger: Okay, I mean so the the things that I'm thinking of so with the ANT, one of the benefits like you said is that you have fewer hops that you're going down and you're making calls to the system, but there's one slight disadvantage in that again it's possible because 26 letters, 26 to the third is greater than a thousand, so it's possible that again there might still be too many and then you know it might be one of those things where you you set something up that works at first and then at some point you know as the number of players grows over time, then you hit this weird bug that's kind of somewhat difficult to diagnose and maybe difficult to fix at that point, so there's sort of trade-offs in that but yeah that's basically the the idea. So, here's the the next challenge. So let's just say we're going to move forward with this and now we want to actually go ahead and compute this and we'll say for the time being that we're going to do everything in offline mode where every day we're going to get this new data file and we're going to recreate this players subdirectory like you have it here. And, granted the size of this file is, even all of these put together is probably not that extremely much, but let's just say that our data center is extremely resource poor and has low amounts of RAM and very little disk space and the CPUs are really slow. But we have a ton of them. What our fearless CEO did was he got bargain-basement prices on a ton of really crappy computers that are really slow. So how can we... can you think of a way that you could produce this file or this directory structure in such a way that it could be parallelized across all of these sort of low powered machines? You think after thinking it through that you might even want to change exactly how this is working, like how you want to do this is fine. And just to just to clarify, so when you have ANT player name text, so this is actually you'll have one file per person? Or are you talking about there would still be one file with multiple people in it or one file per person?
The Incredible Hawk: No, no this is one file per person.
Intergalactic Avenger: Per person, okay yeah.
The Incredible Hawk: Yeah, so really quickly... I'm not completely positive I understood the question actually but how do we paralyze the the lookup? Is that it?
Intergalactic Avenger: How do we paralyze the batch data processing job that's going to happen every night that will produce this players directory? So the players directory is only valid for that one day and then at the end of the day you get a text file with the next one and you need to update it. So there's two ways to do it. One way is you could update in sort of an online fashion where you could go and you could take every row of the of the file and go and update that thing or the other way is that you just take the sum total of all of the files and recreate that players directory from scratch. So let's just assume for right now that we're going to do the latter. We're going to throw away the players directory and we're going to just given that that first set of data files here and we're going to recreate the players directory for the for the next day.
The Incredible Hawk: Okay and we want to try and do this in with a parallelized... okay.
Intergalactic Avenger: This sort of general like I don't know if you're familiar with any of these like Hadoop or MapReduce kinds of things or just if you can think of some generic kind of way of parallelizing this across sort of many machines, either way.
The Incredible Hawk: Okay, so... so thinking in terms of like mapping and reducing, I mean reducing we're trying to reduce all of the pushes the same name and their batting average, assuming at least like independent like unique names. Mapping, we're mapping the player and whether they hit or didn't hit. Sorry, so batting averages is hits versus total attempts?
Intergalactic Avenger: Total attempts, yeah.
The Incredible Hawk: Yep, so if you were mapping all that you'd get the total attempts... you can map this to...
Intergalactic Avenger: So if you're not super familiar with like the MapReduce kind of methodology, you can just sort of explain it in generic terms, not necessarily... I mean MapReduce is certainly one common way to do it and that would be certainly be a valid way to do this, but it doesn't have to be limited to that.
The Incredible Hawk: Okay yeah, I'll go a little higher level then. So each computer... I guess we have these data files like this so each computer can grab a data file and it's trying to look for like all the instances of a player... because we have atomic by day. I guess I mean, so yeah the easiest way like with an army of computers is each army of computers can like operate on a single day and come up with like a a batting average per individual per that day and then we need to take that like intermediary and then we need to combine essentially all them on their unique name and then I guess averaging their batting average for each day and then once that operation is done you recreate that data structure or file system so yes I'm going to write in a pipeline or something. So a computer operates on a single day and it wants to try and compute all the players batting average for that day. Let's say like we save that to an intermediate file and so now we have... well now we have a lot of immediate files.
Intergalactic Avenger: That's fine, there could be a lot of intermediate files.
The Incredible Hawk: And each one will have... so now we need to combine on name, how can we do that parallelized? I guess maybe the best way to do it would be like save... so like once you've calculated a batting average for that day, you go to the place where that guy's is saved on disk and we updated batting average and if he doesn't exist we insert it and if he does exist then we update it, so updating will be an average of what what currently exists and the current one to the day. Then we also need to keep track of how many days we've updated him because like if he's done 99 days really well that should be weighted heavier. I don't like that last bit.
Intergalactic Avenger: Yeah, so there's... so this is really close, this is really like good progress. One question that I have is so like let's say that one guy has a batting average of 0.5 today and he has a batting average of 0.3 yesterday, is that necessarily like what's his batting average over the course of these two days? Is it the average of those two numbers or is it something else?
The Incredible Hawk: No it's total total hits versus total attempts. And so what we can store instead of a batting average is actually like hits over attempts. So it's person... like it'll be a mapping of person to hits over attempts, and then now we can have multiple processes saving it if it doesn't exist for a person and then if it does, just updating the hits over attempts values and then this can also after that's been updated, we can calculate like a new batting average.
Intergalactic Avenger: Okay, that sounds good. Now, so this gets into an interesting question having to do with... so you're sort of incrementally building up these players files, but all in parallel. Now if we... one thing that is not available in this file system and actually is that true for a lot of file systems is that locking, from locking the the shared file system from multiple machines can't be done reliably. So it's possible that if you're having multiple machines working at the same time and they're trying to update this file at the same time, that the data could get corrupted. So can you think of a way that you can build up these intermediate files, which is definitely the right way to go, in such a way that you're not going to have multiple machines trying to read and write from the same file at the same time.
The Incredible Hawk: Yeah, so you know back up here I had... it was just a single player name dot txt. So I guess like I won't necessarily be left with a thousand but like each file that I'm updating the single file, there's now a thousand files here and each to write, like this would restructure the directory to be unique, which then it would look a little more like a tree, so it's like A/N/T if that was his name and then his player name, player name and then like ah the day... one. Essentially each file is just saved to this directory structure. So at some point you're going to have to do a computation and combine them all or you could do that, yeah that's right at the end, you have to combine them all. And that would get rid of the locking problem.
Intergalactic Avenger: Yeah, no that's exactly it. And in fact you've just laid out the map and the reduced steps. So this step here where you create player name underbar date, that's the map step. And then the reduced step is after all the mapping is done for each one of these directories, you read all of them and combine the total hits versus the total attempts.
The Incredible Hawk: Got it yep, I see that.
Intergalactic Avenger: That's it! So yeah no, I thought that was really good. That was definitely along the lines of what I was thinking of for this. The only other thing I'll just point out here, this concept of the bucketing and taking like a value and putting them into buckets is generally called hashing and that technique is something that you'll you'll hear a lot of people asking questions about in these kinds of interviews, that hashing and hash tables and hash maps and all this kind of thing is something that comes up quite a lot and you were definitely on the right the right track with the A B C thing and and this sort of hierarchical thing is interesting and that's definitely the way that these things tend to go. Another way that you could do it just sort of to throw it out there is that there's some pretty standard techniques for turning strings, like people's names. into long series of numbers, like it would just take the the name, you know, Coco Crisp and it would turn it into the number like 7482 or some 64-bit number or some whatever, and those are called hash functions and that's another way that you could do it, that actually helps to alleviate this problem of the... like many players in the A and the T and very few players in the Y and the Z, so that way it kind of evens things out a little bit.
The Incredible Hawk: I'm fairly familiar with hashing but I never really thought about hashing and then saving on a file structure like...
Intergalactic Avenger: It's a little bit different way of thinking about it, but you'll actually see that this kind of technique shows up in all kinds of things, like where you most often see it in things like memory that like you know you're just trying to find a an index into array, that's sort of the most common one, but this is an example of another application of it where it can actually be quite handy, so. I thought you did great and yeah, did you have any last questions?
The Incredible Hawk: No, this is awesome, this is really interesting. Yeah, thanks so much for taking time to do this.
Intergalactic Avenger: Cool, alright, take care man.
The Incredible Hawk: Bye bye, man.
Intergalactic Avenger: Bye.