We helped write the sequel to "Cracking the Coding Interview". Read 9 chapters for free

MML Engineering Fundamentals Interview

Watch someone solve the the problem: ml engineering fundamentals (medium) problem in an interview with a FAANG engineer and see the feedback their interviewer left them. Explore this problem and others in our library of interview replays.

Interview Summary

Problem type

The Problem: ML Engineering Fundamentals (Medium)

Interview question

This technical interview covers core machine learning concepts including supervised/unsupervised learning algorithms, model training techniques, overfitting prevention, neural network architecture, loss functions, optimization methods, and practical problem modeling. The candidate must demonstrate understanding of both theoretical concepts and their real-world applications, culminating in designing an ML solution for Amazon's product recommendation system.

Interview Feedback

Feedback about Spasmodic Gyroscope (the interviewee)

Advance this person to the next round?

How were their technical skills?

3/4

How was their problem solving ability?

2/4

What about their communication ability?

3/4

Strengths and what went well * Clear high-level understanding of supervised vs. unsupervised learning; good example with K-means. * Solid intuition on decision trees (splitting to reduce impurity), overfitting/underfitting, and why max_depth helps. * Sensible diagnostics using train/validation/test splits to spot overfitting. * Comfortable describing neural-net architecture, backprop at a conceptual level, and gradient descent with learning-rate scheduling. * Mentioned L1 regularization and the idea of constraining model complexity. * Communicated calmly, took feedback well, and corrected course when prompted. Areas of improvement * Mathematical precision: stumbled on Euclidean distance; couldn’t recall binary cross-entropy; didn’t state Gini impurity or information gain formally. * Regularization toolkit felt narrow: little mention of L2, dropout, early stopping, weight decay, or data augmentation (CV). * Optimizers and schedules: defaulted to “constant 0.1”; be ready to discuss SGD vs momentum vs Adam, cosine decay, step decay, warmups. * Model selection framing: initial instinct to use clustering for recommendations; need to quickly frame it as supervised (user,item) → y with negatives/implicit feedback and ranking. * Evaluation metrics: limited beyond generic accuracy/MSE; for classification (AUC, F1), regression (MAE/RMSE), and recommender systems (CTR, precision/recall@k, MAP, NDCG). * Feature engineering vs dimensionality reduction: conflated at times; be crisp on when to create domain features vs apply PCA/autoencoders. * Hyperparameter search and validation: no mention of k-fold CV, stratification, or Bayesian/random search; watch for data leakage. * Class imbalance and sampling: didn’t cover weighting, focal loss, or negative sampling. * Gradient descent wording: said “choose direction with highest gradient”; correct is step in the negative gradient (steepest descent). * Practicality: could not quickly outline dataset construction for recommendations (user features + item features + interaction features; negative sampling; temporal splits). Advice for future interviews * Memorize and be able to write quickly: * Euclidean distance: ‖x−y‖₂ = sqrt(∑(x_i−y_i)²) * Gini impurity: G = 1 − ∑_c p_c²; Information Gain = G_parent − ∑_k (n_k/n) G_k * MSE/RMSE; Binary cross-entropy: −[y log p + (1−y) log(1−p)] * Softmax cross-entropy for multi-class * Build practical reps: * Implement from scratch (NumPy): K-means, logistic regression with BCE, a tiny MLP with backprop. Then train scikit-learn/XGBoost baselines on 2–3 tabular datasets. * Do 1–2 small projects end-to-end (data → features → model → metrics → error analysis). Kaggle “Getting Started” comps are perfect. * Recommender case (be fluent): * Problem framing: training rows are (user_id, item_id, features) → clicked/purchased (0/1). Use logistic regression/GBDT as baseline; consider matrix factorization or two-tower NN. * Data: user features (demographics/history/recency), item features (category/price/brand/text embeddings), interaction features (user×item, recency, popularity). * Training: negative sampling for implicit feedback; temporal train/val/test split. * Metrics: offline precision/recall@k, MAP/NDCG; online CTR/CVR. * Regularization & overfitting fixes cheat-sheet: * Simpler model; L2/weight decay; dropout; early stopping; data augmentation; reduce depth/width; feature pruning; noise robustness; cross-validation. * Optimization cheat-sheet: * Start with Adam (1e-3), try cosine decay or step decay with warmup; consider SGD+momentum when fine-tuning; explain why schedules shrink steps near minima. * Answer structure (use this aloud): * Clarify → Define objective/loss/metrics → Describe data & features → Choose baseline → Training/validation protocol → Risks (leakage, imbalance) → Next iterations. * Common interview traps to prepare: * Data leakage examples and prevention. * Handling class imbalance (class weights, resampling, threshold tuning, focal loss). * Error analysis loop: slice by cohort, calibration, precision–recall trade-offs. * Practice whiteboard math: * Derive gradient of BCE for logistic regression; show one GD/SGD update; compute a Gini split on a toy table; write pseudocode for K-means and ID3/CART split search. * Communication: * Slow down 5–10% and outline before answering. State assumptions. If unsure, propose a baseline and iterate. With these tweaks—especially crisp formulas, supervised framing for real products, and stronger practical reps—you’ll move from “borderline” to a confident ML new-grad hire.

Feedback about Admiral Hex (the interviewer)

Would you want to work with this person?

Yes

How excited would you be to work with them?

4/4

How good were the questions?

4/4

How helpful was your interviewer in guiding you to the solution(s)?

4/4

Interview Transcript

Admiral Hex: Hey, how's it going? Hey, can you hear me? Hello, can you hear me? Yes, I can hear you. Hey, so hi, welcome to your practice interview. We have one hour for the interview, um, and we'll do the mock in about 45 minutes, and then we can go over feedback towards the end. Before we get started, maybe you can briefly introduce yourself and mention which company you're doing for and when is your next interview.

Spasmodic Gyroscope: Yeah, so, uh, you know, my name is [REDACTED]. I'm a data science major at Wilkes University. and with a minor in statistics so far. And so I have a specific interest in computer vision. There's a little intro.

Admiral Hex: All right, and, uh, why are you doing this mock interview? Like, do you have some interviews scheduled or—

Spasmodic Gyroscope: Yeah, so I'm actually in a program called AI for All and they introduced to us this platform. So this is actually my first time doing this video on this, on this website. Um, I'm not sure if you've heard of AI for All.

Admiral Hex: Um, yeah, I did a bunch of interviews from that program.

Spasmodic Gyroscope: Yeah, yeah. So this is my first.

Admiral Hex: All right, great. Uh, and then what would you say your level of ML knowledge is so far? Like Have you done any courses in college apart from the AI for All, of course?

Spasmodic Gyroscope: So mainly it's been self-learning. I haven't done, I haven't completed any ML courses in college. So it's completely up to just some of the books I read and some personal projects. So I have like a brief high-level understanding of kind of the fundamental machine learning algorithms.

Admiral Hex: All right, cool. If you're ready, then we can start the interview. We'll just use the Coderpad. Feel free to type anything you want here. And you know, I find it good to like, you know, type a bit of context about your answers as you're answering them. But you don't have to do that if you're not comfortable. Just like, you know, it's there for you to use as you would like. All right, um, let's start with the basics. So do you know the difference between supervised and unsupervised learning?

Spasmodic Gyroscope: Yes.

Admiral Hex: Yeah, what's the difference?

Spasmodic Gyroscope: Do you want me to type it out, or you want me just to say it out loud, or—

Admiral Hex: You can say it out loud. If you want to type anything, feel free to do that as well.

Spasmodic Gyroscope: So, all right, so with a supervised dataset, we're going to provide the model with labels, a target, so they know what to— how to tune the parameters to achieve the target. And with unsupervised algorithms, they don't necessarily know what the right answer is. There's no labels, so they're just trying to figure out specific patterns and present that instead.

Admiral Hex: Alright.

Spasmodic Gyroscope: So an example of an unsupervised algorithm would be K-means clustering. Okay.

Admiral Hex: How does K-means clustering work?

Spasmodic Gyroscope: K-means clustering starts by placing a centroid, basically a dot, into this data space filled with data points. And it calculates the average distance between each point. To the cluster. And the amount of centroids you can place, you can choose to do that. So you can ask to place like 2 clusters or 3 clusters. And so you calculate these distances to the clusters and you measure— sorry, I'm going a little too fast. Let me slow down a bit. You place the clusters and you figure out how much you need to move the cluster to reduce the average distance between the points and the nearest cluster. And you keep on doing that until the clusters don't move as much anymore. Once they stop moving, essentially, you place your clusters optimally in the DataSpace cloud. Um, yeah.

Admiral Hex: All right, so do you know what's the distance? Can you write the formula for the distance that you minimize?

Spasmodic Gyroscope: Yeah, so I believe that we can just use, um, a Euclidean distance formula. And we want to find the average distance of all the nearby clusters. So we can use Euclidean distance here. I'm going to write it E, uppercase for E, square root. So now these points can have multiple dimensions. So I'll just say, to write the formula, I'll just say like x sub i, x sub j, x sub k, but there could be multiple dimensions with when we're measuring the distance. So for, you know, the first point, our centroid here is going to be x sub i squared— sorry— plus x sub x sub j squared plus x sub k squared. Okay, that would be like the first dimension of— that would be the dimensions of the centroid. So this is the center. I could probably name these a little better, so I'm going to call it centroid, or I'll call it c on the square. To abbreviate to mean the centroid's points. This is a single point in the space. So here are the coordinates to the centroid, and then we're going to add to the centroid the, the neighbor of the— of a nearby point. So we can call this n for neighbor, x sub i squared.

Admiral Hex: I'm sure that's right, because what's the distance between two points in the space if you have n-dimensional points? Um, let's say x1 to xn is one point, and the other point is y1 to yn. What's the distance between these two points?

Spasmodic Gyroscope: Also, you'll find the— well, to get the distance, you want to find the difference between each dimension, right?

Admiral Hex: Right, and then square that and then square root, right?

Spasmodic Gyroscope: Oh, so I see what I did here. I made a little error. So it's actually going to be minus the, um, The neighbors, the neighbors point as well.

Admiral Hex: I'm going to square that and it's also going to be whole squared, so it's not x1 minus x. It's not x1 squared minus y1 squared. Right, instead it's like this is wrong. Instead it's x1 minus y1. Whole squared.

Spasmodic Gyroscope: Oh, okay, I see.

Admiral Hex: So on each dimension, you calculate the differences, and the reason you do the squaring, uh, before you add and then do the square root is just to— it's just an easy way to handle like negatives.

Spasmodic Gyroscope: Yeah, yeah.

Admiral Hex: Okay, yeah, so the absolute difference, right? So, all right, um, yeah, so that's cool. Do you know any supervised learning algorithms?

Spasmodic Gyroscope: Um, so, you know, you have decision trees.

Admiral Hex: Okay, let's start there. What are decision trees?

Spasmodic Gyroscope: So yeah, decision trees are supervised learning algorithm that The way you can describe it as you make a decision at each step within the tree. And the way the decision is calculated is— so you ask a question to the data set. This can be like, OK, is this value— is this predictor value greater than or equal to 0.5? Or something like that. You ask a question and then you branch off to the left if it's false, or— and you branch off to the right if it's true. And essentially you recursively go down the tree asking more and more questions. And when you reach the end of the tree, your goal is to— you get to separate data as much as you can. Using a measure called Gini impurity.

Admiral Hex: All right, um, so yeah, that brings us to like, how do you actually train the decision tree, right? So you want to decrease the Gini impurity. Do you know the formula for that? Um, or okay, maybe the concept behind like that, uh, what does that measure? So it measures—

Spasmodic Gyroscope: so if to explain a bit easier, if we're talking a classification context, right, Gini impurity is how imbalanced that subset of data is. Right? So if your goal is to minimize as much as you can, you want it to reach to zero. Zero means that your subset data is all one, belongs to all one class, right? So if you have— so the higher number your Gini impurity is, the more imbalanced it is. So you have, you have like a mix of classes in your subset data, right?

Admiral Hex: So you want to find the splits in the dataset that sort of are able to split the dataset into— let's say it's binary classification, so you want all the trues on one side, all the falses on the other side. And if like a split is, you know, giving you half trues and half falses, it's a bad split. If it's able to split, you know, all the trues on one side, all the falses on the other, then it's a good split, right?

Spasmodic Gyroscope: Yes.

Admiral Hex: And, and that's how you measure the quality of a split is using this Gini impurity function. All right, so, um, how— so how would you actually train a decision tree? Like, would you just find different splits and keep doing that till you find the Best split?

Spasmodic Gyroscope: Yes, so what I would do is I would, um, let me think here. I would go through each feature, right? I would, you know, iterate through each feature, find the best split, find the feature that gives you the best split. And then I would recursively do that again, find, like, keep on finding the best split for the features that are left over until, you know, the user-specified parameters, like, you know, for how deep you want the tree to go. And, you know, for that specific feature, what I mean by find the best split, I mean find the lowest Gini impurity of the feature, and you have a whole selection of features, you want to choose the feature that gives you the lowest amount of gene impurity right off the bat, and then you will recursively do that, you know, you will split off the branches, doing the best you can, like finding the best gene impurity each step of the way.

Admiral Hex: A common parameter used in decision trees is max depth. Why is that a good parameter?

Spasmodic Gyroscope: Uh, what do you say, max depth?

Admiral Hex: Yeah, yeah.

Spasmodic Gyroscope: So this, this kind of goes around the concept of, uh, um, what's the word, overfitting, right? Yeah, underfitting, overfitting. That this is, you know, that this is— that's exactly why the parameters is is used for. So if you have a high depth, that means that your tree is going to be large. It's going to be very large. It's going to have a lot of levels and a lot of leaf nodes. That means that your tree will be likely to overfit your data.

Admiral Hex: Right?

Spasmodic Gyroscope: So if you want to— If you want to regulate that, if you want to reduce the depth so that it's not as— it doesn't learn the data too much, right? You can reduce the depth to lower levels. Like, instead of 10, you can go to like 3. And it won't fit as closely to the data. It'll be more generalizable if you lower the parameter, the hyperparameter.

Admiral Hex: I think that makes sense, but can you explain a bit more why a higher depth would fit the data more? What do you exactly mean by that?

Spasmodic Gyroscope: Yeah, so what happens when you increase the depth, right? At a certain point, when you keep increasing that depth, it's eventually going to capture spurious relationships or specific patterns in the data that you don't want to capture, that don't really help you in terms of predicting the target. So you have to find a good balance between the max depth Yeah, I get that, but like why?

Admiral Hex: That's what I'm trying to like just double-click on. Why would it learn patterns that you don't want it to learn, and what are those patterns?

Spasmodic Gyroscope: Like, yeah, so, um, so eventually it starts straying away off like the main, the main relationship in the data, and it starts picking up on noise, which is like outside that relationship. So when you increase that depth by too much, then it starts venturing in the noise of the data. And that noise is not a large contributor to predicting the target. And so then you start learning these spurious relationships that don't really help you in the end.

Admiral Hex: Yeah, that's the right answer, right? It learns the noise.

Spasmodic Gyroscope: Yeah.

Admiral Hex: Memorizes the noise, right? And then that hurts generalizability. So, um, how do you know if a model is overfitting?

Spasmodic Gyroscope: If it isn't overfitting?

Admiral Hex: Yeah, just generally outside, like decision trees, for any machine learning model, is there a way that you would know that it's overfitting?

Spasmodic Gyroscope: Yeah, so one of the main ways, um, when you have your dataset, you, you often split it into partitions, right? You have a training set, you have a, uh, a validation set, and you have a testing set. So signs of overfitting in a model are usually indicated in the testing set, where you have a much higher error rate in the testing set, but in the training set it's like extremely, like extremely good, like a very low error rate in the training set and a very higher rate in the testing set. That would be one way to, to see if your model is overfitting.

Admiral Hex: Okay, and what do you do if the model is underfitting? So if the model is underfitting, or like just before that, um, like how would you fix overfitting in general? Like, so with With decision trees, we have this max depth that we use, right? But do you know any general concepts that we might use for any model? Like, what's the general concept behind, uh, like, techniques that you use to reduce overfitting?

Spasmodic Gyroscope: Yeah, so there's a lot, there's a lot of methods in terms of, uh, reducing overfitting your model. Um, one of the, one of the, one of the ways you can do is take a look at your features, right? If you have features that aren't really providing you with information relative to your target, they could introduce spurious relationships and noise that you don't want your model to capture, which will then make it overfit. So that, you know, that is called feature engineering, right? We're reworking your features so that they are the best as informative as possible. So that is one way you can help with overfitting. Like I just mentioned before, tuning hyperparameters so that your model doesn't go as complex as you want it to go can help with overfitting. We also have other techniques. There are also models themselves called principal component analysis that can help you with dimensionality reduction. As well. And when you reduce your dimensions, right, you're, you're, you're eliminating, uh, noise that, that can throw off your model and focusing on the central features of your model. So yeah, dimensionality reduction, which kind of coincides with feature engineering as well. Um, any other techniques that I can do. Um, let me see.

Admiral Hex: Other than hyperparameter tuning, uh, dimensionality reduction and features, uh, so what about the model itself? Like, I, I know you said hyperparameter tuning. Yeah, like what kinds of hyperparameters would be tuned? Like, uh, yeah, so if you were to think about the model in general, what would cause a model to overfit? And this is similar to what you said with the decision tree max depth, right?

Spasmodic Gyroscope: Yeah.

Admiral Hex: Um, like if you were to tune max depth, for example, for different problems, right?

Spasmodic Gyroscope: Mm-hmm.

Admiral Hex: In some cases, you might see that, you know, with max depth of 7, it starts overfitting, right? But with others, you might see that you actually need— it doesn't perform as well. It probably underfits with max depth 7. So what's going on there?

Spasmodic Gyroscope: Yeah, so, you know, it all comes down to the dataset and the problem you're working with. If it constitutes a more, a higher max depth because your model needs to be more complex, then you're gonna get better performance with a more complex model on a more complex problem. So, yeah, it just depends on the problem you have, you know.

Admiral Hex: Right, so model complexity, right? That's what you're saying. So you want to increase model complexity if it's overfit— if it's underfitting and decrease it. So if it were to— like, are you familiar with neural networks by any chance?

Spasmodic Gyroscope: Oh, sorry, can you repeat that?

Admiral Hex: Are you familiar with neural networks? Yeah, a little bit.

Spasmodic Gyroscope: Yep.

Admiral Hex: All right, so if you have deep neural networks, right? So firstly, can you just briefly describe like how What's the architecture of like a deep neural network model?

Spasmodic Gyroscope: Yeah, so a neural network, right, the components of it really is you have your input layer, right, which is gonna be the input you would put to like a normal machine learning algorithm, like the features of your dataset, right? And then between the input layer and the output layer, Oh, sorry, the output layer would be what you want when you're like the target of your model, what you want it to output. So in between the input and output layer, you have these layers called hidden layers, and essentially they transform the data, they, you know, through learned parameters of training, they're transformers, they transform the data at each step, at each step through the layers all the way down and funnels it into the output. So you can think of each layer in the hidden layers of a neural network as a transformation function.

Admiral Hex: So what's the analog of max depth here?

Spasmodic Gyroscope: What's the what, sorry?

Admiral Hex: Like max depth is for decision trees. Yeah, what's a similar property for neural networks?

Spasmodic Gyroscope: Yeah, so now that can be for your hidden layers, like the number of, uh, the number of neurons you assign to the layers.

Admiral Hex: Um, the number of layers itself, right?

Spasmodic Gyroscope: Yeah.

Admiral Hex: Okay, so how do you choose the right number of layers?

Spasmodic Gyroscope: Yeah, so number one, you need to take a look at what kind of problem you're dealing with. If it's a complex problem that is not really— it's nonlinear, it's— you're going to require more layers with that. Let me think here.

Admiral Hex: Like this empirically, like how would you go about the process, right? Because yeah, definitely you would start with some number of layers, and then what would you do? Like you would run the test, then what— oh, I see. How's the process go?

Spasmodic Gyroscope: Yeah, so I would, I would start with an educated guess of the amount of layers I would need for my problem. I would run the test, I would, you know, uh, run it against my test data, see, see my, my my accuracy or whatever my— your metric is for your testing. And I'll see its performance based on the training set as well, the performance on the training set, and see whether if it's underfitting, then I need to add more layers. If it's overfitting, then I need to reduce the amount of layers.

Admiral Hex: Makes sense.

Spasmodic Gyroscope: From the neural network.

Admiral Hex: All right, how are these neural networks trained?

Spasmodic Gyroscope: Yes, so initially all the parameters are set to some random numbers, and you pass through your input data through the neural network, and it propagates through, right? You get your result at the end, and then you use an algorithm called backpropagation. That essentially it finds the step in which it needs to adjust the parameter so that the target at the end will be closer to the true target. And you do that for each parameter using backpropagation.

Admiral Hex: What do you backpropagate?

Spasmodic Gyroscope: How do I backpropagate? You said?

Admiral Hex: Yeah, what, what do you backpropagate? Yes, so, so you, you put, you did the forward pass, you got an answer. Yeah, label. What next?

Spasmodic Gyroscope: Okay, yeah, then you go, uh, you go backwards, right? Hence the name. And you need to calculate the, the gradient for the gradient of what? The gradient of the parameter, or the partial derivative of the parameter, with respect to your cost function, right, your error function.

Admiral Hex: Your loss function, right? And it's the other way around. You find the gradient of the loss function with respect to the parameter.

Spasmodic Gyroscope: Mm-hmm.

Admiral Hex: What is the loss function?

Spasmodic Gyroscope: Loss function is going to essentially tell you how off your predictions are off from the target. That's what a loss— a typical loss function would tell you. So an example would be like mean squared error, which is just the squared distance— the squared difference of your target and your prediction.

Admiral Hex: Right. When would you use mean squared error, and when would you use something else as a loss function?

Spasmodic Gyroscope: Yes, so Um, let me think here. So, you know, you have different loss functions for different, um, different types of problems. Like, you have mean squared error and then you have cross-entropy loss, which are both for different types of problems. And, uh, but another way, another reason is Some of the loss functions have built-in regularization, which would help also with overfitting. So that's also one reason why you might choose one over the other. Also, some of them are more interpretable than others. Like you have root mean squared error and you have mean squared error. Mean squared error, you can't just look directly at the value and say, oh, okay, so this value is in scale with the my data. If you want it to be interpretable and just look at the error and be like, oh, okay, that's exactly off by like 3 windshares, your target was windshares, you would use root mean squared error because then it's scaled back down to your data.

Admiral Hex: If you had a binary classification problem, which loss function would you use?

Spasmodic Gyroscope: If I had a— sorry, sorry?

Admiral Hex: A binary classification.

Spasmodic Gyroscope: A binary classification problem, what would I use?

Admiral Hex: Yeah.

Spasmodic Gyroscope: Yeah, so you can just use the cross entropy loss. You can use that binary cross entropy loss.

Admiral Hex: And if you had a regression problem?

Spasmodic Gyroscope: A regression problem? You can use mean squared error if you want.

Admiral Hex: Right, so that's— That's the main difference, right? For regression, you would use something like mean squared error using the same argument that you said, because you care about the magnitude of the difference. For binary, you only care about like, oh, one said 0, the other said 1, is the 0. So do you know the formula for the binary cross-entropy loss?

Spasmodic Gyroscope: Uh, not on the top of my head.

Admiral Hex: Yeah, so basically it's very simple. It's like, um, minus P log P, right? Sum of— so like, uh, your binary classification algorithm gives you a probability. So you do like P log P if it's a positive example If it's a negative example, you calculate 1 minus p log 1 minus p, right? OK, so because like the general form is just a sum, right? You just add up because it's for classification in general. If it's binary classification, you only have one probability and like two terms here, and then one of these terms becomes 0. If the— if it's a positive example, the, you know, so if it's a positive example, this term would become 0. And if it's a negative example, then this term would become 0. Right. So technically, like, y into p log p, right? So yeah, that's the formula. And then you'll see that, you know, if it's a positive example and your probability is low, Because they're always predicting probability of positive. So if your probability is low, you know, the loss is lower— sorry, loss is higher. Whereas if it's a positive example and the probability is higher, then you know the loss is lower. So yeah, you can work that, work that out. But, uh, that's the kind of formula. And do you know any regularization terms that you would add to the loss function?

Spasmodic Gyroscope: Yeah, so you can add a regulation term called L1, which essentially is just the absolute distance between— oh, sorry, that's just the sum of your parameters as a regularization term that would be added onto your loss.

Admiral Hex: Right. And so what's the idea behind that? Why would you add the weights, this L1 of the weights, to the loss?

Spasmodic Gyroscope: Yeah, so what that does essentially is it, it, it constrains your model to keep your weights as small as possible so you can achieve a small loss. Because if you're, if your parameters are large and you're summing that up, you're going to get a very large loss. So it forces your model to keep them close to like relatively zero.

Admiral Hex: Makes sense. Um, all right. Have you heard the term gradient descent?

Spasmodic Gyroscope: Yes.

Admiral Hex: What is it?

Spasmodic Gyroscope: It is an optimization algorithm that is used to update the parameters in a way that reduces the the cost function or your error function.

Admiral Hex: Right. And do you know the concept behind it?

Spasmodic Gyroscope: Yeah, so what we're doing is we're taking the derivative, right? Here and here, there's going to be multiple dimensions, so we end up taking the partial derivative with respect to each dimension with respect to the cost function. And we're essentially calculating the step. In which we need to adjust the parameters so that they are going to reach the optimum minimum of the cost function, right? Or when the partial derivative is equal to 0 for that specific parameter.

Admiral Hex: And how do we decide which direction to take the step in?

Spasmodic Gyroscope: Yeah, so we're going to be taking the— We're going to be taking— we're going to be negating our step here, so then we can go down towards the optimum. Otherwise, we'd be going up.

Admiral Hex: Right, and then we could go in multiple directions. How do we choose a direction?

Spasmodic Gyroscope: So yeah, the direction is going to be towards the optimum minimum. So we essentially set it equal to 0, because that's our goal, is to so that the derivative of that parameter is equal to 0. And then we look at that step that would make it closer to 0.

Admiral Hex: Right, so we choose the direction that reduces the loss function the most.

Spasmodic Gyroscope: Yeah.

Admiral Hex: Yeah, so you find the gradient of the loss function and we choose the direction with the highest gradient. Right?

Spasmodic Gyroscope: Mm-hmm.

Admiral Hex: And what about the step? How do we know how much of a step to take?

Spasmodic Gyroscope: Yeah, so that is actually, um, determined by your, uh, a parameter called your— a hyperparameter called alpha or a learning rate.

Admiral Hex: Yeah. As you're training a model, how does the alpha change across time? Um, like, you have one alpha that you use throughout your entire training run, or are there sophisticated ways of choosing alpha?

Spasmodic Gyroscope: Yes, so your alpha, your learning rate, is basically how big of a step you're making towards, uh, you know, um, uh, achieving, you know, the optimal parameters. So usually people have it a constant where it stays the same. The step is always the same size, but you could actually adjust it with our learning schedule, and it actually dynamically updates during your gradient descent.

Admiral Hex: And how would you choose it? Like, what would be a reason? Like, any idea on what kind of schedule you could choose?

Spasmodic Gyroscope: Yeah, so a common, common learning rate is about 0.1, because it's not— it's a balance between taking too big of a step and taking too little of a step. So 0.1 is a harmonic type of value that a lot of people use to start off with. But if you notice—

Admiral Hex: oh, sorry, like as a schedule, right? So maybe you could have a higher learning rate in the first epoch, or like initial runs and then gradually reduce it, right? Because when you initialize your model from random, so initially you want to take like big steps because everything is very random. But after you've taken a bunch of these big steps, you probably converge to something more reasonable already. So you don't want to lose that. You don't want to go off in a random direction again. So over time, then you sort of reduce the learning rate, and then that helps it converge, right? Otherwise, it could just go into a different space altogether. Like, there's a risk that it could just, like, miss the minimum, go off in a different direction, right? So if you come to the whiteboard for a second, uh, Oh, that's all the whiteboard. Okay, so this is like your gradient surface, right? And then let's say you're here, right? And then let's say you took like a big step, so you might miss this and you might like— you might just go like— you might just end up here because you took too big a step, right? Yeah, so if you keep doing that, you just keep oscillating like this. So when you're here somewhere, you know, when you're, when you're here somewhere, you want to take like big steps towards the minimum. Because those big steps will get you closer and closer. But when you're here, you want actually to take smaller steps. Right, so And how do you know you're here? Because just by saying, I've seen more of the data, so I'm probably in a reasonable space, right? So I'll now take smaller steps. All right. What else is there? Okay, just one sort of more practical question. Let's say, You're Amazon and you're designing the, the front page which has a list of products, right? Um, so, you know, user opens a page, they see a list of products, and now obviously for this user we know their buying history and, you know, we have all the data about the interactions at Amazon and, you know, we have all the data about all the products used in Amazon as well. How would you model this as an ML problem?

Spasmodic Gyroscope: Yeah, so what you said was you have an Amazon front page and your goal is to, uh, use an ML model to suggest the most recommended products based on their buying history?

Admiral Hex: Yes.

Spasmodic Gyroscope: Yeah, so with this type of problem Clustering really shines here. You can use an unsupervised algorithm like clustering to find similar products to what the consumer has bought. And—

Admiral Hex: But what if you want to do it, you know, you want to model it as a supervised learning? Because, you know, the problem with clustering is it's really not that complex. So if you go back to the underfitting, overfitting discussion, clustering models will almost certainly underfit, especially a problem like this which is very, very difficult. The clustering just doesn't have the generalizability or the power to represent the actual relationships. Between users and items at that scale. I mean, you will get some reasonable results, don't get me wrong, but firstly, like, clustering won't be personalized in a way because any user who's purchased a product, like, will sort of fall in a similar category, right? Uh, it's not really personalized is what I'm saying, right? It's not really tuned to, like, you as a person. Uh, because it's just like you are similar to like a bunch of other users who all have like very similar buying history. And then the other thing is like, I've bought like 100 things in the past. How do you combine— like, each one of those belongs to maybe different clusters. How do you use that to generate recommendations? So let's try to model it as a supervised learning algorithm. Approach. How would we do that?

Spasmodic Gyroscope: Let me think here. Um, so I don't know if this is viable, but you can use a linear regression to predict the probability they would like this specific product.

Admiral Hex: Okay.

Spasmodic Gyroscope: Or better yet, you know, instead of linear regression, use logistic regression because that's what it's if we're doing a probability, we want to use logistic regression. But— and the way we can create a model like that is— so we need to represent these products as a vector, a collection of values. So we need to take the main part, the main points of information from each product. And standardize it into a vector. And that, that way we can feed into a model and then they can train, learn the parameters that are specific to the buyer and then output a probability. But that isn't really scalable if you have a large consumer base and you have to train a model for each consumer. But that is also another way you can do it. Supervised.

Admiral Hex: All right, uh, and how would the training data look for that logistic regression model? What is like one row in the training data?

Spasmodic Gyroscope: So for your input or your X, it's going to be your features of the product. So maybe you want to, um, One of the features will be a multi-class one, like if it's, you know, used for sports or entertainment. But basically, you know, the features described in the product that could be applicable to all products on the Amazon website. That'll be, you know, a vector there. And then for your output, it's just going to be 1 representing that, okay, the consumer bought this product. And then 0 would mean the consumer did not buy the product.

Admiral Hex: Right, but then is it not just the product features that are in there, but also the user features, right? So the input is a user and product pair, isn't it? And then the output is a probability. So you want some way to represent the user as well, right? Because like you do this for all the users and all the products, all the possibilities of users and products. That's your dataset.

Spasmodic Gyroscope: Yeah, I see what you mean.

Admiral Hex: Okay.

Spasmodic Gyroscope: Yeah.

Admiral Hex: All right, let's, uh, we can end the interview here. So before I share my feedback, I want to understand from you how this interview went. What are the things you did well? And what are the areas for improvement?

Spasmodic Gyroscope: Yeah, so for me, I definitely can point out I have a lack of knowledge in machine learning. I can definitely brush up on a lot of the fundamentals, the algorithms, a little bit about why you would use one algorithm over the other. Um, so, you know, I need to work on my fundamentals. Also, I need to slow it down a bit. Sometimes I like to jump in and just answer the question without a kind of path in my mind of what I want to say exactly. I should slow down, try to mentally structure what I'm going to say before saying it so it comes out clear and understandable to the interviewer. So yeah, those two right now I could think of. As a strength, I think that when the words do come out, I can communicate loudly and somewhat clear. It's a little bit of communication, though it could work. I think that communication is a little more my strong suit. And that's at the moment, that's all I can think of right now for benefit, uh, advantages.

Admiral Hex: Yeah, so overall, you know, you actually did pretty well. Uh, this has been like— I've been interviewing a lot of the people from AI for All, and I'll be honest, like, this is one of the better interviews that I've conducted, uh, in just in terms of like, you know, I think you did a good job, uh, with most of the concepts. Um, and then yeah, I think in terms of areas for improvement, uh, I think maybe on the practical aspect is where I would say your biggest gap is. I think on theory you're doing pretty well, right? So Um, yeah, one thing on the theory, maybe you're missing a bit of depth there, which, you know, you would have realized through this interview as well. Like cross-entropy loss, you know, binary cross-entropy loss is, you know, pretty simple formula. It's used for almost every binary classification, so it's worth knowing that, right? Um, It's also worth knowing regularization. When I asked about overfitting, you didn't mention regularization, but when you're talking about loss functions, you did mention it. And then, you know, just being able to simply say that, you know, regression problems would use mean squared error and classification would use cross entropy. I think that's like a pretty fundamental thing which almost everyone who works in ML knows. right? Okay. Um, so I think those were 2 or 3 things that, you know, was sort of missing. Uh, and then on the practical aspect, like, yeah, just how do you model a problem as an ML problem, right? Um, so I think for you, where I would focus more on is actually not too much on theory per se. I would actually encourage you to try running machine learning training in practice, right? So like actually training these models and, uh, you know, getting the sort of more practical experience. So there are websites like Kaggle where you can try out various challenges that they post from time to time. Um, or, you know, you don't even have to do that. You can just train some models on your own machine, just see that underfitting, overfitting thing happening in practice. Actually work with constructing training data that you feed into the model yourself, right? That helps you understand, like, what are the challenges with the training data? Like, if you don't get your training data right, what happens? Uh, and then feature engineering. So it did mention some good feature engineering techniques, but, uh, you know, like, being able to implement them in practice is a very valuable skill. But overall, I said you did pretty well. I would say it's borderline, like I wouldn't necessarily hire you if you were interviewing right now, but given that you're still in like university and still learning, it's pretty good, right? It's one of the better performances I've seen. So I'm pretty confident that if you, you know, continue down the road that you are on, you know, continue learning, continue practicing you have a really good chance to actually, you know, become an ML engineer when you graduate. Um, so yeah, good, good luck and congrats.

Spasmodic Gyroscope: Yeah, yeah, thank you for coming out today and, uh, doing this with me. Um, yeah, that was, that was my first ever, like, I've done, like, you know, uh, software engineering interviews, but I never done, like, a machine learning one before, so that was definitely an eye-opener. Um, so yeah, I really appreciate you coming out and helping me for this.

Admiral Hex: So, all right, um, I'm happy to answer any questions you have.

Spasmodic Gyroscope: Yeah, um, so I'm actually like, um, though I just interviewed today for machine learning, I'm actually, you know, striving to become a data scientist. So I know that machine learning is more as a tool in that type of field. But it's still— I still think it's just as important to achieve this level of, like, to go as deep as I can with machine learning in terms of, you know, it's like a really good skill to have as a data scientist or as an ML engineer.

Admiral Hex: I would say that, you know, ML engineer is probably, um, you know, better in terms of your career opportunities. Not that data scientist is anything to sneeze at, but, you know, like, um, you just get paid more, I guess, as an ML engineer, and you have more opportunities to advance, I would say, especially in like big tech companies, like bank companies But given that your background is more in statistics, you know, data scientist is also like a solid choice. So the main difference I would say is that ML engineers have better coding skills. So they know, like, they're more familiar, like, you know, they're better at coding is what I would say. Like, they just can write like better Python code. Um, so yeah, like, if you're, if you're good at coding, then I would actually encourage you to go down the engineering path. I think that's, uh, ML engineering is like a really good, uh, role where you, you need like deep statistics knowledge as well as coding knowledge. And then, you know, it is like one of the most prestigious and most high-paying jobs that you can get out there and also sets you up well for the future.

Spasmodic Gyroscope: Mhm, yeah.

Admiral Hex: But at the same time, I won't force it, you know, like if you don't enjoy coding as much and you're more deep into statistics, data scientist is also like a very solid choice. But like you said, the data scientist, you need like the core stats knowledge much more than you need like the ML knowledge. Yeah. Okay, any more questions?

Spasmodic Gyroscope: Um, let's see, let me think here. I had a question in mind, but it isn't coming straight to mind.

Admiral Hex: Think about it, it's okay.

Spasmodic Gyroscope: Yeah, let me just think because I want to make sure I answer all I want to answer, you know. Um, so for a position like an ML engineer and like, and, you know, interviewing for ML engineers, how large of a portion you think like the neural networks would be for the— like for the interview? How large— how, how big of a part of the interview would be just dedicated to neural network knowledge?

Admiral Hex: Um, not a huge lot. I think, uh, conceptually, like the concepts that you already know, you know, based on like your knowledge from this interview, I think in terms of the pure concepts, you would actually stand a good chance. The main gaps I would see is mainly around like the practical aspects, which is like things like, you know, I want to do like product recommendation for Amazon. How do I model that as an ML problem? Right? And just being like any ML engineer will just say in like a couple of seconds that, oh, first thing you do is use a product, predict probability, right? And then go into like, okay, what are the user features? What are the product features? Whether you use logistic regression or neural network doesn't matter in the scheme of things because more or less it's a black box. But you do need to be able to understand concepts like overfitting and then saying like, if it's neural network, I add more layers that might cause more overfitting, things like that. Um, it's more about like how do you translate a business problem into an ML problem correctly. and then how do you construct the training data? How do you construct the features, right? And then reasonable choices on models, like, you know, obviously you won't use clustering, for example, right? You would use random forests or neural networks or, you know, GBDTs, or even logistic regression is like a good choice. For to start with. Um, and then it's about how do you evaluate the model, right? If something goes wrong, how do you know? How do you figure, figure out like what went wrong and how do you fix that? And those are more the practical aspects. And the way you learn that is just by doing it yourself, right? Just by like training models yourself and trying out different things and seeing what happens. So it's learning by doing is the main thing I would focus on for you, right? Like, if you're able to do that in the next year or so, that sets you up really well with the theoretical knowledge that you already have.

Spasmodic Gyroscope: Okay.

Admiral Hex: All right, if that's— yeah, any— do you have any other questions?

Spasmodic Gyroscope: Uh, no, that's, that's all my questions.

Admiral Hex: All right, cool. Uh, all right, best of luck with your preparation and your studies, and hope you do well in the future. And, you know, have a nice day. Yep, you too. Thank you.