We helped write the sequel to "Cracking the Coding Interview". Read 9 chapters for free

Our business depends on having the best interviewers, so we built an interviewer rating system. And you can too.

By Aline Lerner | Published:

interviewing.io is an anonymous mock interview platform and eng hiring marketplace. Engineers use us for mock interviews, and we use the data from those interviews to surface top performers, in a much fairer and more predictive way than a resume. If you’re a top performer on interviewing.io, we fast-track you at the world’s best companies.

We make money in two ways: engineers pay us for mock interviews, and employers pay us for access to the best performers. Companies connect with our engineers through anonymous technical interviews — this is really important to us because we don’t want employers to be biased by where people went to school or where they previously worked. When employers use interviewing.io, they get great candidates, on the condition that they’re willing to talk to them anonymously. And it really is anonymous – they get zero information about the candidates ahead of time, and they agree to devote precious engineering time to interviewing people who are, from their perspective, basically randos off the internet.

To keep our engineer customers happy, we have to make sure that our interviewers deliver value to them by conducting realistic mock interviews and giving useful, actionable feedback afterwards.

To keep our employer customers happy, we have to make sure that the engineers we send them are way better than the ones they’re getting without us¹. Otherwise, it’s just not worth it for them.

This means that we live and die by the quality of our interviewers, in a way that no single employer does, no matter how much they say they care about people analytics or interviewer metrics or training. If we don’t have really well-calibrated interviewers, who also create great candidate experience, we don’t get paid.

Fortunately, we have two pieces of data that no single employer can collect: 1) honest candidate feedback and 2) real outcomes for candidates, from multiple interviews at multiple companies.

The former helps us figure out how effective our interviewers are at engaging candidates and creating great candidate experience. The latter helps us figure out how well-calibrated our interviewers are, with a high degree of confidence.

Over time, as we’ve used these metrics, our interviewer quality has gone way up.

In this post, we’ll explain exactly how we compute and use these metrics to get the best work out of our interviewers. In a follow-up post, we’ll talk about how to bring some of these approaches and metrics to your own interview process.

Before we get into all that, I’d like to provide some context about why these metrics are important, and why you can’t assume that just because an interviewer comes from a great company, they’ll be great at conducting interviews.

Interviewer quality generally sucks and isn’t something companies really care about

We host thousands of interviews a month, and that means we need a lot of interviewers. But, as you saw above, we live and die by interviewer quality, so we need those interviewers to be world-class.

When I first started interviewing.io, I assumed that if we were very picky about whom we let be an interviewer, then all of our interviewers would be great. These are the criteria we came up with and still use today:

Engineers from FAANG and FAANG-adjacent companies (e.g., Uber, Stripe, Dropbox)
At least 4 years of experience (though today the average on our platform is 8)
Conducted at least 20 interviews at FAANG or a FAANG-adjacent company

As we quickly learned, however, even these stringent criteria weren’t enough. We found ourselves facing two problems. First, our users would regularly tell us that their interviewer didn’t seem like they were paying attention, was a jerk, or didn’t teach them anything useful. Second, over time, as interviewing.io became more popular and gained more users, we saw that our candidates weren’t performing as well in real interviews as they previously had… which meant that our interviewers weren’t as well calibrated as we thought.

I could speculate at length about why just choosing interviewers from top companies isn’t good enough, but my suspicion is that interviewer training is not created equal.

Some companies invest considerable time and effort into teaching their interviewers how to be present during an interview, how to give hints, how to ask good follow-up questions, and how to make the candidate walk away feeling like they spent an hour working with a smart friend on a cool problem.

Others just throw new interviewers into the fray.

Some companies, like Opendoor (whose eng manager shared with us the notion of “superforecasters” keep detailed metrics on which interviewers have the best eye for talent. But other companies shoot blind.

Even with great training, some interviewers won’t put in the work to stay present and engaged during interviews. Interviewing people is polarizing — some love it and some hate it — and if you don’t love it, it’ll show.

With this said, while our criteria might sound like a pretty high barrier to entry, the reality is that interviewer quality, before we instituted some formal metrics, was all over the place, both with respect to candidate experience and calibration.

Enter the money-making metrics!

The need for two metrics

As you saw, we care about two separate things: how good of a candidate experience our interviewers create and how well-calibrated they are.

Candidate experience matters for retention. We want engineers to complete as many mock interviews as they need on our platform. Some users will be ready after one mock, but others will need more practice. What we don’t want is for someone to leave us before they’re ready simply because they had a negative experience with their first (or second) interviewer.

We also care about retention on a multi-year timeline. We’re a hiring marketplace, so we know that — like a dating site — if we do our job well, our users will leave us. Unlike a dating site, though, people today aren’t married to their jobs. The average tenure of a software engineer in the US is 2.4 years. We hope that once our users are considering their next move, they’ll come back to us to practice and de-rust.²

Calibration matters just as much as candidate experience. I already talked about how the employers who hire through us are taking a leap of faith because they’re agreeing to spend eng time on our candidates, sight unseen. Calibration and candidate experience are also inextricably linked — feedback that’s too lenient might boost a candidate’s confidence in the short term, but it will also hurt their chances of doing well in interviews, which ultimately reflects poorly on us.

Candidate experience and calibration are related, but they’re different enough to warrant tracking separately. We ultimately decided to create a “candidate experience” score and a “calibration” score for each interviewer and track them over time, both individually and in aggregate across the whole platform.

First, how did we find the interviewers who drove customer success?

Metric 1: The candidate experience score

After each interview on interviewing.io, we collect feedback from both candidates and interviewers. As you may recall, interviews are fully anonymous, and you don’t see the other person’s feedback until you submit yours. This means that on our platform, candidates are incentivized to tell us honestly how things went.

This is what our candidate feedback form looks like:

Though we’d been qualitatively tracking candidate feedback for years, we recently quantified it and created a per-interviewer candidate experience score. Here’s what we did.

Ranking how well people engage customers is a wide-ranging task: similar questions could be asked about Uber drivers, Instacart shoppers, sales reps, or even schoolteachers.

A core issue in customer-facing platforms is that customers are too… nice. There’s grade inflation in the feedback. For example, Uber driver ratings are compressed at the top, with most drivers clustered around 4.8 or so. In absolute terms, a 4.8 seems not that different from a 4.7, but seasoned riders have learned the difference. Despite interviews being anonymous and feedback being mutually incentivized, it’s true in our data as well: a seemingly high average in absolute terms could mask a chronic underperformer, so it’s important to use relative rankings and concrete outcomes (e.g., customer churn) in addition to customer ratings.

What did our score need to do? We saw three overarching attributes that needed to be balanced:

Predictive: Scores should capture something persistent about the interviewer’s ability. Someone’s score today should tell you something about their performance tomorrow.
Fair: Scores shouldn’t feel arbitrary; they should be based on meaningful metrics and interviewers should be given space to improve.
Legible: Scores should be interpretable to both the interviewing.io ops team and to the interviewers themselves.

Our first task was to check whether an interviewer-based metric was predictive. This wasn’t guaranteed: most of the variation in the customer experience could be driven by the attitudes of the customers or extraneous factors like the weather that day. If this were the case, any ranking of interviewers would be entirely based on luck.

Instead, we found evidence that interviewers had stable skill levels — someone’s history of customer ratings was highly predictive of what the next customer would say about them. These results extended beyond just survey responses: customers who got paired with a lower-ranked interviewer were more likely to leave the platform entirely. A bad interviewer increased churn by almost 10 percentage points.

The graph below shows the result. If interviewers had no effect on churn, we would expect the fitted line to be flat. Instead, once we rank interviewers based on their past sessions, we can predict how likely a customer is to leave the platform after an interview with them. Around 30% of customers leave if they get an interviewer from the bottom fifth. But only 20% of customers leave if they get an interviewer from above the 80th percentile.

So, interviewers had stable traits which, in the worst cases, could actually scare customers off the platform. Our next step was to condense this valuable information into a single score. Machine learning methods have one suggestion for this: use the interviewer identities as features in a model predicting a customer outcome like attrition or dissatisfaction. The coefficients could then be used as a measure of quality.

While we did estimate these, a ranking based on these numbers wouldn’t be very legible. At best, we could tell the interviewers that their quality scores were based on a well-calibrated model.

It turned out that averages of the more readily understood dimensions of customer feedback (how excited they are to work with them and the quality of the interview questions, hints, and communication) were closely correlated with the machine learning predictions. So to reduce the complexity in the score, we used the linear combination of the customer feedback factors that were most correlated with the output from the more complicated model.

Thus, we boiled down our optimal prediction into a simpler linear model that the interviewers could even calculate themselves, giving them insight into which components were holding them back. This would also help our ops team provide focused nudges to the interviewers.

Was it fair? In general, the model’s out-of-sample predictions were more accurate when it used all past interviewers in the data to train. This is unsurprising, but it meant that veteran interviewers who had shown recent improvement might have to wait a year for this to be reflected in their score. This low-score purgatory would be too demoralizing.

Another fairness concern was that interviewers in hot water with objectively high averages could complain that they’d received only “Excellent” and “Great” ratings, much like a 4.6-star Uber driver. We emphasized to interviewers that these scores put them in the bottom 5-10% of interviewers, with real business consequences.

One early challenge we faced was setting the right observation window. Originally, we decided to look at interviewers who had conducted at least 5 interviews in a rolling 90-day period. But because some interviewers had 5 in that 90-day period and others had 150, this created very different sample sizes and made it possible for a single interview to derail someone’s score.

After measuring the predictiveness of different look-back periods, we eventually decided to limit it to the most recent 30 interviews. We saw that, while more data was always better, the curve flattened around 30. Sessions older than that would be ignored so that interviewers on an upward trajectory would have this reflected in their scores.

Our final result was a simple linear function of someone’s average customer feedback, with different weights according to how predictive that factor was of the machine learning estimates, and with the limited look-back period. Below, you can see our weighted (by number of interviews) aggregate candidate experience score across all the mock interviews on the platform. We track this as an internal health metric, and it’s a remarkably useful metric that shows us how happy our users are, at a glance (and correlates very directly with NPS).

Metric 2: The accuracy score

Outside of delivering great candidate experience, it’s critical to our business that our interviewers are well-calibrated. After all, if our candidates aren’t amazing, the companies who hire through us will stop giving precious eng time to interviewing our users. They’ll just go back to using resumes.

So how much better do our candidates have to be?

In a good funnel at a company with a high engineering bar, candidates pass the technical screen about 20-25% of the time. Over time and after a bunch of trial and error, we realized that we needed our candidates to pass interviews about 70% of the time.³ That’s about 3X what our customers see in their funnels, and it’s the threshold where working with us started to feel like magic. After all, when companies work with us, we’re removing their ability to choose who they talk to. They just have to trust us… and that makes the psychological burden of proof higher than you might expect. 50% pass rate, for instance, wasn’t enough to feel like magic.

In a nutshell, we had to ensure that our interviewers were well-calibrated enough to achieve a 70% passthrough rate for our candidates in real interviews.

Fortunately, we had the data to make this happen.

On interviewing.io, engineers do mock interviews, and top performers do real ones. Both sets of interviews have an identical feedback form that the interviewer completes after each interview:

To figure out how “accurate” our mock interviewers were — a measure of whether their judgments tended to be too strict, too lenient, or well-calibrated, relative to the hiring decisions of companies — we compared how they rated their candidates to how those candidates performed later, in real interviews.

In other words, this was based on a similar principle as the interviewer quality score, although based on less subjective measures than customer feedback.

Using just the interviewers’ own ratings and decisions of hiring companies, we asked whether some interviewers regularly under- or overestimated their candidates compared to candidates’ subsequent interviews with real companies. The answer was yes, and this result persisted in out-of-sample testing.

Much like the customer feedback, our interviewers had a bigger problem with positivity: the overly positive interviewers indicated they would hire their candidates 75% of the time, while their candidates received a yes just 60% of the time in real interviews — a considerable gap. But a sizable portion of well-calibrated interviewers closely matched the judgments of companies, and we told them so.

Our accuracy calculations were a continuous score, but we wanted to condense this. We converged on a -4 to 4 scale, a small number of values that we could connect to a distinct message. People with a 0 are among the best-calibrated interviewers. People with 3s and 4s are definitely too positive: they’re almost twice as likely to want to hire someone compared to the company. People with negative values are similarly too negative.

How these scores play together, or why brutal honesty makes for great candidate experience

Once we built out both sets of scores, we could finally answer a question that interviewers new to our platform pose regularly: Is there a tension between delivering great candidate experience and being brutally honest with candidates about their performance?

After all, doesn’t candidate experience depend, in part, on NOT giving harsh, unvarnished feedback? And doesn’t failing people create a bad experience that makes them not want to return?

Fortunately, for hard questions like these, we have the data! Below is a graph of average candidate experience score as a function of interviewer accuracy, representing data from over 1,000 interviewers. As you can see, the candidate experience score peaks right at the point where interviewers are neither too strict or too lenient but are, in Goldilocks terms, just right. It drops off pretty dramatically on either side after that.⁴

How we used these scores to significantly improve candidate experience over time

In the graph at the bottom of the “candidate experience score” section, you can see that starting in Q1 of 2022, our aggregate candidate experience score improved dramatically. This was intentional. Once we realized how well this metric predicted both candidate experience and attrition, we started to actively drive it up. Here’s what we did:

Systematically, ruthlessly paused interviewers who fell below the bar
Created full transparency and visibility around both scores
Built automatic throttling
Started running monthly onboarding meetings to explain our business model to interviewers and share how both candidate experience and accurate vetting are critical to interviewing.io’s success

Ruthlessly pausing interviewers below the bar

Once we built both scores, we started to use them to filter out underperforming interviewers.

For candidate experience, we chose a line below which candidate attrition was unacceptable. If an interviewer toed that line, we would reach out and encourage them to improve their performance. If an interviewer consistently fell below that line, then we paused their ability to do interviews.

We took a similar approach to accuracy. If an interviewer was too strict or too lenient (below a -1 or above a 1), we issued a warning and ultimately paused. In a pinch, we’d keep the strict interviewers and just let go of the lenient ones.

It’s hard to frame letting people go in a really positive light, but it’s important to distinguish between the two abilities in question here. As often seen in sports, the best players are not necessarily the best coaches. Similarly, we were forced to cut some really stellar engineers who, regrettably, didn’t have the disposition or interpersonal skills to be great mentors.

As we started to roll out our new, ruthless policy of letting go underperformers, we realized that we had made a mistake. Though interviewers’ scores were visible to us, they were not visible to interviewers, so when we started issuing warnings, they were often blindsided and not in a place where they could take the news well.

Transparency and visibility

To fix the problem and not blindside our interviewers, we made a dashboard where interviewers could see both their “candidate experience” and “accuracy” scores in near-real time (for accuracy, we bucketed the score into “too strict,” “OK,” and “too lenient”). We also shared how we calculate both scores.

Exposing the scores, as well as the logic behind them, helped our interviewer community buy into the scoring system and to see firsthand that it was built on data and not our subjective opinions. It also made it so no one was shocked when we reached out about their performance. Over time, we started encouraging interviewers to come to us if they saw their score slipping… before we had the chance to reach out to them.

Automatic throttling

Having our ops team manually review interviewer scores every week and manually pause underperforming interviewers obviously didn’t scale, so we decided to build a throttling system that would put our best interviewers in front of the most customers. That meant adjusting our interview scheduling algorithm to prioritize interviewers who maintained higher ratings. Those rated lower still got assignments but only if higher-rated interviewers weren’t available for the same type of interview or time slot. The net result is the top 20% of our interviewers administering 80% of practice interviews.

New interviewers still get the chance to prove their mettle, of course.

Monthly onboarding

As we spoke to more and more interviewers, it became clear that many of them didn’t actually know what interviewing.io did or why candidate experience and accuracy mattered so much to our bottom line.

When I first started the company, I’d spend 30-60 minutes on the phone with every new interviewer, explaining how hiring was broken and how they played a critical part in fixing it. Then, at some point, we started growing fast, and I stopped.

It was clearly time to start again. Though it was no longer sustainable for me to do 1:1 calls, I now run a 1.25-hour monthly onboarding session along with our ops team, and every interviewer who’s joined our community since the last session has to attend.

In these sessions, we discuss a bunch of the stuff covered in this post. We talk about the mission of the company, why the two metrics matter, how to deliver harsh feedback, how important it is not to sugarcoat things, and what it means to be a good interviewer.

Explaining what we do, why we do it, and how high our expectations are made a step function difference in the quality of our interviewer community. It also reaffirmed my faith in humanity. If you want people to do a great job, tell them what a great job looks like and why it matters. And if you’re fortunate enough to have talented people working for you who are committed to your mission, they will knock it out of the park.

In the next post (coming soon), we’ll talk about how to bring some of these approaches and metrics to your own interview process. Because you are a single company, not a dedicated interview platform, you may not be able to replicate what we did exactly, but you can get pretty close with just a little bit of work.

Big thank you to Maxim Massenkoff, Liz Graves, and Richard Graves for their contributions to and help with this post.

Footnotes

Thankfully, our candidates are better. On average, interviewing.io top performers perform 3X better in technical interviews than candidates from other sources (70% pass rate compared to 20-25%, the latter being the industry standard at companies with a high bar). ↩
We’re proud to say that 70% of our users return to interviewing.io for their next job search. ↩
While we could go higher than 70%, we didn’t want to have too many false negatives, which aren’t great for business or for candidate experience. ↩
There are some notable peculiarities in the data as well. For instance, why do interviewers with an accuracy score of -3 get rated so much worse than interviewers who are even more strict? And why does candidate experience keep improving ever so slightly, as interviewers get more and more lenient, only to drop off again at the most lenient (accuracy = 4)? I expect these are functions of needing more data and that if we had, say, 10K interviewers in our ecosystem, the curve would be a lot more smooth. ↩