Samurai Razor: Hello?
Intergalactic Avenger: Oh hi.
Samurai Razor: Hi. Good to meet you.
Intergalactic Avenger: Nice to meet you too. Alright, so we'll just dive on into it. Really question, what computer language are you the most familiar and comfortable with?
Intergalactic Avenger: Okay, so you're more like a front-end?
Intergalactic Avenger: Okay, um so yeah I think I'm a little bit less familiar with with the ins and outs of that one, but I'm sure that this should be a general enough question that you can... you'll be able to make it work. So the kind of thing I'm thinking about here, the problem we're trying to solve is an efficient and... like a space efficient means of storing a large dictionary... like a key value store.
Samurai Razor: Okay, so...
Samurai Razor: Sure, so basically we're dealing with text just to make sure I have the premise correctly. So we're dealing with a bunch of sort of text files or text input where we know that there are sort of natural language patterns present, that kind of sound okay?
Intergalactic Avenger: Yeah, so just like words, you know. Like we're not looking at paragraphs and we're not doing anything crazy like that and we could even say that maybe they're all in even in the same language, they're all in English or they're all...
Samurai Razor: Right okay, let me add that to the description. So one language. Sure, so there's a pattern associated with them and we want to take advantage of those patterns. So let's see what kind of different kinds of patterns there might be. So there might be... I know that in translating one thing you might do is you have tuples of words or you have n grams, so two grams or three grams or four grams, where you might say that a certain combination of words appears most frequently, and so if know that if you can maybe do compression right? So if you think about the compression algorithms work, then you might be able to describe n-gram or four words that appear together in one phrase if that makes sense. So basically, so we were basically trying to summarize a lot of words into a smaller data structure, smaller than a key value store basically.
Intergalactic Avenger: Yes.
Samurai Razor: Okay, so would it be like the key is a word and the value is like a translation or the value is a larger set of words or?
Intergalactic Avenger: Yeah, I mean for our purposes, you could just think of there being some kind of relationship between the key and the value like plural to singular or misspelling, proper spelling or you know like what whatever it is.
Samurai Razor: Okay, got it.
Intergalactic Avenger: So if we really want to make it concrete, we might say that it's... let's say it's the the word to its singular form, so like dogs become dog and walks becomes walk and that kind of thing, but you know in general the kinds of techniques that show up should, you know, take advantage of some some kind of thing. So let's just work out this n-gram thing, it's an interesting idea. Can you flesh out that idea a little bit more? I think I see where you're going with it, but can you go on?
Samurai Razor: Yes, sure. So basically the initial idea would be to parse the entire corpus we have and look for any combinations of words that appear or like... so say you have the dog walks, that's two two random examples. Basically you would take the... you know these three words appeared together or two or three four words be together and you would map a specific... one thing you could do is you could map a specific abbreviation to that sequence to three words. So that way you compress the amount of space you need to store the entire corpus.
Intergalactic Avenger: Well let's say... so for the application that I'm thinking of, we're not necessarily storing the corpus... we're trying to say well, you know let's say I have a document the dogs walk and I want to transform that into the dog walk, singular. So I don't really have that much use for the multiple words next to each other because for my particular application, I'm just looking up individual words.
Samurai Razor: Gotcha, so you're not looking for patterns between the words or within the words. You're looking for meta patterns that can be sort of mappings that aren't inherent in the text but are inherent in language itself, transforms that are applicable to language as a language in general and can be applied to each of the words to kind of create a meta out of data.
Intergalactic Avenger: Exactly and to make this a little more concrete, they just take this over that you might say as input, you have a list of these words. So you'll have dogs dog and dog also goes dogs and walk walks and walks goes to walk kind of. So you'll have an input list that you'll want to be storing, but you want to store it in an efficient fashion so that you can sort of look them up. And certainly a corpus can help you with this because you might say, well you know I'm not looking up these words in a vacuum, I'm looking at these words in a certain context and I might want to find a more efficient way to store the more commonly occurring words for purposes of storage or something like that or for lookups. Ao from that perspective I would think... where I think you were going with that would be things like storing n-grams of letters that you might say something like, you know the pair of letters d o might show up very often in this input list and so then you could potentially have a very short sort of code for what d o stands for and then maybe w a doesn't show up very often in this input list, so you could have it be long, but that was sort of where I was thinking you were going with that, but yeah.
Samurai Razor: Okay I got it. Maybe we can do tries. I'm not sure whether that is applicable to this problem, but
Intergalactic Avenger: That's exactly where I was going with this problem and I was going to ask if you've heard of tries, because that is what I see as the canonical sort of best solution to this.
Samurai Razor: Exactly, within the try you reduce the source because the traverse like you for every word that starts with s, you use the same storage, so like you use the same s and you go down the tree. I think this is the idea for T-9 word completion and so on.
Intergalactic Avenger: Oh yeah.
Samurai Razor: So tries would be the way to go probably.
Intergalactic Avenger: Yeah okay, so let's just see if we can put one of these together really quick. It's not, you know, it doesn't necessarily have to be the most efficient implementation but in whatever language you like, maybe let's just make a little function that... and you know the interface here you can choose language up on top and it'll even give you a little way to test it out. And yeah, so like the interface the function would be it takes in a word and it returns the the mapping and for simplicity's sake we can even just say that you'll assume that the input word is always there, we don't have to do all of the verification.
Samurai Razor: Yeah. So you go from word to mapping... so I mean I'm visualizing like a trie and like I'm trying to think like... so you input a word and then you kind of walk down the tree and kind of follow where... I'm not sure in this case what a mapping...
Intergalactic Avenger: So it takes in the word dogs and it returns the word dog.
Samurai Razor: Gotcha okay. Okay so it returns that mapping okay.
Intergalactic Avenger: So it returns the value of that key value pair.
Samurai Razor: Got it, okay. Let's see, so I'm trying to think of how that would work.
Intergalactic Avenger: What type is the children?
Samurai Razor: What type? Oh the children are nodes too.
Intergalactic Avenger: So then, and where are you going to get the output from? So like if you... so as you see here, you might have a node for D, and then it points to a child that's O and it points to a child that is G, but what is the output? How do you know what to return at the end of all of this?
Samurai Razor: So basically you walk down... So what you determine basically?
Intergalactic Avenger: Where do you determine the output? So I'm seeing here like how you walk through this structure based on the input, but then how do you know what the appropriate output is?
Samurai Razor: Got it, let's see. So I think the... so you go D O G and then or... what if you go D O G S, it should match dog, yeah how does it match dog. One way to do it would be to have in the in the G or in the S have a pointer to a string that is where it maps to basically. So you go D O G and the G has a pointer to dog. When you go D O G S and the S has a pointer to dog.
Intergalactic Avenger: Okay, and how does that change your node class here?
Intergalactic Avenger: Perfect, yeah. Sounds good.
Samurai Razor: Okay, let me see. So you will go down the Trie. You go D O G, the G would have a property for the output. I think that should there should be some methods on the node so this should be a contains method probably or a map, so you should be able to ask a node take this word and map it to the output.
Intergalactic Avenger: Yep, or I mean either way, you could have it be on the node itself or you could sort of do it like it says up here where there's a function say hello and it takes in a node and the word or whatever it is, either way.
Samurai Razor: It could be it could be based on the word on the actual data or it could be based on node. Yeah totally. You probably don't need that okay. So here we go. So basically first I think we basically need... if you ask a node whether it contains a certain word okay. I think we're good. Do you see any other properties we need or?
Intergalactic Avenger: I actually... I'm curious if we even need the value, but let's work through it. Let's just write this up, see how it looks.
Samurai Razor: Sure, yeah. Okay so first we need a factory for nodes, so we do something like this.
Intergalactic Avenger: Let's even just simplify things for a little bit and say, let's assume that you have the node and you're just going to try to to walk it with some input.
Intergalactic Avenger: Ha okay, so with with this approach it's just an array and it has a set of nodes and so what then is the lookup time? So this is gonna be very efficient in space, but what is the lookup time now if you have to go and for each letter, you have to look up all of its children, like in an asymptotic notation kind of thing?
Samurai Razor: Sure, so basically I guess in this function we're right now on a node that has a particular letter, so if you're looking up dogs then you're on the node that has a value D and you're asking it if we start looking on this node for this string, what is the eventual output we get? So I would say that if you like have them if you visualize like a massive trie that has all the strings inside of it, this should be sort of on the order of O(log n) I think because you're like going on a trie, so like for every next letter in the string, you basically choose...
Intergalactic Avenger: What is it N in that? Like is that the length of the string, is that the finding out of the node? What is n?
Samurai Razor: N would be the length of the string. I'm just trying to think, in a binary tree you go left to right, so you can just do binary comparison. If you have say there are 10 children and if you just put them in an array, then you would have to go through each of the items in the array to find the next child or the next letter. So this probably has to be an object over here, so this with children should be an object, so a hashmap basically, where you can instantly look up the next letter in the string.
Intergalactic Avenger: Excellent, yeah. So let's just flesh that out a little bit. That's definitely the right direction to go in. If we said, if we make a couple assumptions here that n is the length of the string and M is the size of the alphabet, so like in English this is basically 26 but in Chinese it's like thousands. Using the children as an array, we're looking at big O of what? And children as an object, it's big O of what?
Samurai Razor: So the... if you look at the number of children in an array, you could potentially be the size of the alphabet because the next letter could basically be any of those letters, so you're looking at every level of the tree with a trie, you're looking at a potential of M different letters and you have to go N levels deep, so that n times m at each level m times, I want to say might be wrong with m times n probably, n times I could join the m letters.
Intergalactic Avenger: Just type it in there so we can see. And then when it's object?
Samurai Razor: So an object, the lookup of a specific next letter will be order 1 because you do a hashmap. And the... you would have to do that at n times, so it would be order n I think, because it n levels of the tree, you're going to have to look up in a table which is order of n.
Intergalactic Avenger: Yeah okay, so we're going to choose the the object one for obvious reasons, okay.
Samurai Razor: Order one look up.
Intergalactic Avenger: Cool okay, yeah let's fill out this this prototype that you have here.
Samurai Razor: Okay, so the get out function. Let's see, so what do we need to do. So we check whether... So right now we're just asking some node to check to return the mapping with the output of the given string. So first we to check whether the current node equals the current string character.
Intergalactic Avenger: So if you're using the children... if the children is a map from letters to nodes, then do you need to have a current value and do you need to make this check?
Samurai Razor: I guess we would have found the this current node bye actually just... even if you're at the top level when you're looking up the first letter in a hashmap, so you know by definition that we're on the current the current letter exists actually, that this node is for the current letter. So basically we would check whether there is a node has a output value. If so, return that. Else if there is a next character in the string then find that child and recurse, else return false or null. Sort of like that. So basically if we're on the current node and there is a mapping, we return the first mapping we find.
Intergalactic Avenger: So is that correct? So for example, the the word do, the prefix of dog, do, it if this is an English language thing, it will have a mapping and it will it will have the mapping of do, like do is the dictionary form of the word do. Is that right mapping for dog?
Samurai Razor: We might actually want to first use all the characters in the string and then see whether after the last one, when we're after the last character, whether at that point there's a mapping or that's an output, instead of looking for the first output we can find, so like reverse it actually. Otherwise we're going to find... so if we're at the end of the string that we're given, we just check what the output is and return that. If there's no output, we return null. Else if the string still has characters then we just go to the next child.
Samurai Razor: Yep. So let's see... so I always do string dot checking as opposed to double equals, I don't like the...
Intergalactic Avenger: All the different kinds of equals.
Samurai Razor: Yeah, I'd like to force new type checking and so on. I also like to do like nested if, so you could here for example do if strings of length is 0 and there's an output, but I like to sort of make the way the code flow is explicit and like indentation on the left, so it's easy to read. Again you could also do like falsey, but I liked it make it really explicit. So now we need to know where we are in the string. So what we'll do is we'll... so we need a way to pass the entire string obviously, but also the first letter, so we'll pass the first substring. So the substring will get smaller, so we'll take the first character in the set in the string and use that to find the child and then we'll pass the child the... like if the string has 3 characters then we'll pass to child the 4th and 5th character.
Intergalactic Avenger: Okay, that sounds good. What actually gets returned in that case? Because it looks like you're just calling the function. So how does the get output... yeah, what value gets returned?
Samurai Razor: Oh gotcha, yeah. So eventually we need to actually return the output to the back up, so... Let's see, so we do child dot get output. Eventually we reached the last, the string will be empty and we'll reach the node that has the output and then we return it up here I guess.
Intergalactic Avenger: That just returns it to the previous recursed function. What about like the code that calls this in the first place. Like if you call this with just like the letter D or something, then it will end up down here, and the previous one will return you know whatever the mapping is for D, but then it will continue on down and sort of get nothing down here.
Samurai Razor: I mean if there is no actual mapping.
Intergalactic Avenger: No like let's say there is a mapping. I'm just seeing that the top level that called this, I'm not convinced that it's going to ever see the output because these return statements over here only happen for one case and for the other cases, there's no return statement.
Samurai Razor: So you mean like if there is no... like if there is no output at all, then there won't be any return basically?
Intergalactic Avenger: Uh no. So let's say there is output and you put in dog and dog is in the map and you and you want to get that back out. I think that this implementation is not going to give you that output at the very end because you know child here is going to get called with dog and it's going to call the substring with og and then this will be called again with the substring of G and then up here it will finally get to returning the the output here. That just gets lost sort of here because nothing is actually... this return value kind of just gets thrown away.
Samurai Razor: Gotcha, I think that's exactly, I see what you mean. Yeah you do return it's like at the bottom leaf so to speak, but then it doesn't go actually up the chain. So I think we might need to do this. We at least need to make sure that the top level intermediate levels return the actual value that gets returned from the bottom. Yeah totally. Rookie mistake.
Intergalactic Avenger: No problem. So one other thing I'm noticing up in here. It's a very minor point, but do you see a way that this is a little bit more... this piece from sixty to sixty five is more convoluted than it needs to be?
Samurai Razor: I think we can probably simplify a little bit, let's see. So if the string dot length is zero and this output is null, then return output else return null. Well I mean one thing I can think of is we can combine the two if statements.
Intergalactic Avenger: So specifically speaking from sixty to sixty-five, that I think there's more lines of code there.
Samurai Razor: Okay um, we could just return output. Yeah, so output is either going to be a value we want or null and either way we want to return exactly that, so we basically could say... I'm just going to edit this here... basically we could turn this into that.
Intergalactic Avenger: Exactly. There you go, you've got a trie. Very good.
Samurai Razor: I like that they were doing a trie, I've never implemented that but it's a fun data structure.
Intergalactic Avenger: Yeah, and as you can see, the lookup function is very simple. It's just what? Five, six lines of code we have here. So here's an interesting question for you. If you take this function and copy and paste it and now turn it into put output. So let's say you want to add something to this trie, and instead of pulling out the value, put the value in. And what you'll notice, if you think about it for a sec, is that the two implementations are very similar with only one very small change.
Samurai Razor: Got it, let's write that. So I'm going to copy the entire thing. Put it down here, and call this add node.
Intergalactic Avenger: Or add value.
Samurai Razor: Yes. I'm going to add a string. So let's see...
Intergalactic Avenger: So when we add it, we're adding a mapping. So remember that what's happening is we're giving it as input basically this list of like dogs to dog and walked to walk and so forth, so we're actually going to be adding a mapping to it. So we're going to be adding a key and value.
Samurai Razor: Got it okay. So we're going to basically traverse down the trie up to the... I'm going to take key, go through it and at the end of key, can we assume that... maybe if we start by assuming that the key already exists, the entire like chain is present, and then at the end of key, the last node, add a mapping to the value. Does that make sense? So again, if string is length 0, then add value as output. We don't need this step and that step we do need. So basically we're going to do if string dot length equals 0, then this dot output equals value, else var child equals this children key 0. Not sure if we need to return... so this would be key of substring one and we could... I mean depending on what we want the output to look like we could return true for success, but...
Intergalactic Avenger: Let's just not have it return anything, it doesn't need it.
Samurai Razor: Yeah, it only goes down the tree and just performs the action.
Intergalactic Avenger: Alright, so you're definitely along the right the right track here. So we had renamed str to key, so you want to change that on line 71. This still says get output. And add value takes two parameters.
Samurai Razor: Yep, you're totally right. Let's see. So this would be the key and the value would be values, you have it.
Intergalactic Avenger: No that sounds good. So the interesting thing here is that it does make a rather stark assumption that the the key is already in place and so if the key is already there, then this will work. But let's add just a tiny little bit of code so that if the key is not in there, that it will go ahead and add it.
Samurai Razor: Cool, yeah so sure. So we're going to do that over here. Let's see, it should be good. This is children equals new node.
Intergalactic Avenger: So you just need to set child again.
Samurai Razor: Yes. This is like a double code. So basically all I need to do is if this is like you know repeatable piece of code, so all I need to do is say if the children do not yet have that next letter as a property then create a new node and add it. So all I need to do is if that's not the case, then we do this else do that.
Intergalactic Avenger: There we go so.
Samurai Razor: What we want to find is the next character. We say if among the children of this node there is no such characters yet and we make a new node over here and add it to the children and then once we have that child we can say we get that child and we recurse next char.
Intergalactic Avenger: The child was right the first time.
Samurai Razor: It was? Okay, thanks. Yes right. There we go.
Intergalactic Avenger: Alright, and now you've created a trie. Alright yeah that's looking really good. So did you have any last questions for me? I think that's pretty good.
Samurai Razor: Cool thanks, thank so much for time by the way. Yeah I mean, so for me like I'm sort of getting back into the game, so this is like one of my first... this afternoon I did like a whiteboard interview, so this is like very, you know. well shared coding. So it's been really interesting. I love that I'm getting back to tries, that's really good. So whatever feedback you have is totally welcome.
Intergalactic Avenger: Okay great, yeah I'll write it up in the little report.
Samurai Razor: Oh perfect. How does that work? Is there like a count or like is it automatically sent? Or something like that or?
Intergalactic Avenger: Yeah, at the end of it, like I write up a little thing and you can see the feedback and all that kind of stuff.
Samurai Razor: Perfect, cool. Thank you so much.
Intergalactic Avenger: Yup, take care.
Samurai Razor: Thanks man, I appreciate it.
Intergalactic Avenger: Yup, bye.