How to Solve File Parsing

Written By
Adam Bhula

What is the File Parsing Problem?

The File Parsing problem asks us to design a file system for a baseball database website given a list of constraints. The challenge in this problem is identifying an appropriate naming convention to not exceed the 1000 file limit per directory as well as being able to handle large data inputs efficiently per day and distributing this workload across multiple computers.

An Example of the File Parsing Problem

Given an inefficient file structure, how would you store data to efficiently look up the query? Also how would you alter this if you had many computers available?

Some more context to the problem:

So you've just started to work at a new startup that's called Internet Baseball Database and their customers have started to complain that the features aren't very good and that the site is really slow, so you've been you've been told to come on in and help clean everything up. So the idea is that this company gets a regular data feed from major league baseball that has data that they're going to want to display on their site. Basically what they are sending every day is an update of what happened in each of the baseball games. It's a very simple text file with tabs, it's going to have just column fields and it's going to say the date that the game was played, the home team, the away team, it's going to have the batter and for right now we'll just say the outcome.

How the site currently works is that you get this text file every day that contains all of these lines, and the owner of the site is very paranoid, so he doesn't want to put it into a database because people are out to get him. so he stores it on a file system. So on the file system it'll have the data and then he'll put the year and then he'll put the day and then he'll just have the date for that day. The other thing is that he invented his own file system that's very inefficient and can only hold a thousand files in every directory, so that's why he can't just put any general date or names because at some point there'll be too many directories or files in each directory, so he split it up some way. When someone comes to the site, right now the only thing they can do is look up a certain date and look at all of the information for the games played on that day. The owner started to get some comments like "wouldn't it be nice to be able to look up other things?". For example, wouldn't it be great if you could just look up a particular player and see some of the stats for that player? As you might imagine, using this technique of storing the data, that's a bit inefficient.

So, your overall task is to come up with a way, given the constraints that you have to store the data in flat files and a file system with no more than a thousand per directory, find some way to represent the data that would enable searching for a particular player and getting some basic statistics.

How to Solve the File Parsing Problem

In the problem statement, the Internet Baseball Database is a startup that receives daily updates from Major League Baseball. These updates contain information about game outcomes, players, and their statistics. However, the site has been facing complaints about slow performance and limited features. To address these issues, we can propose a new architecture.

The daily update files will be stored in the following directory structure: /data/yyyy/mm-dd/data.txt. Each file contains the game outcomes with columns for date, home team, away team, batter, and outcome. This directory structure allows for easy management and retrieval of data.

To calculate player statistics, a MapReduce approach will be used. This involves processing the data in three steps: mapping, shuffling and sorting, and reducing. The mapping phase extracts relevant information like player IDs and outcomes. The shuffling and sorting phase then groups and sorts the data based on player IDs. Finally, the reducing phase calculates aggregated statistics for each player. This approach ensures accurate and efficient computation of player statistics. The final aggregated statistics are emitted for each player.

In order to efficiently organize player data, a hierarchical directory structure based on player names will be implemented. The player names will be broken down into multiple levels, such as the first letter of the last name or a combination of the first letter of the last name and the first letter of the first name.

For example, if we have a player named “Adams Arthur” whose last name starts with “A” and first name starts with “Ad”, their player data file will be stored at the location /data/players/A/Ad/Adams.txt (the data file can include their full name for better readability, but that depends on file name size restrictions with the system we are using).

This hierarchical structure allows for easy storage and retrieval of player data based on their names. By dividing the player names into multiple levels, we can manage a large number of players while keeping the number of files within each directory manageable. Each directory will contain a subset of players whose names share the same initials, ensuring that the directory size remains within the file system’s limits.

When a player lookup is requested, the system will determine the appropriate directory based on the first letter of the last name and the first two letters of the first name. It will then retrieve the player statistics file within that directory, load and parse the file, and provide the requested player’s statistics for the lookup.

Originally, our approach was to use a single letter from either the first name or last name; however, this could result in directories exceeding our limit. Storing player data in directories based on the first letter of their last name and first two letters of their names allows for a more balanced distribution of files across the directory structure.

There is a potential issue we could encounter in our system. In the event that two players have the same first and last name, this could create confusion in the storage and retrieval system. To address this issue, additional measures can be implemented to ensure uniqueness and distinguishability of player data. Some possible approaches include: assigning unique identifiers, requesting additional user input, providing contextual information such as team/year, etc., to narrow down the search. Alternatively, we can provide both results alongside their differentiating attributes to identify their differences and allow users to discern it themselves.

To ensure up-to-date statistics, a daily job will be scheduled to process the new update file. MapReduce will be used to calculate player statistics for the new update, and the player statistics files will be updated by appending or replacing the relevant statistics. This ensures that the player statistics are regularly updated to reflect the latest game data.

In conclusion, this redesigned architecture for the Internet Baseball Database addresses the issues of slow performance and limited features. By leveraging MapReduce for the player statistics calculation and organizing the data in a structured file system, the system can efficiently handle the daily data feed and provide enhanced player lookup capabilities. Users will have access to comprehensive and up-to-date player statistics, enhancing their overall experience on the site.

About interviewing.io

interviewing.io is a mock interview practice platform. We've hosted over 100K mock interviews, conducted by senior engineers from FAANG & other top companies. We've drawn on data from these interviews to bring you the best interview prep resource on the web.

We know exactly what to do and say to get the company, title, and salary you want.

Interview prep and job hunting are chaos and pain. We can help. Really.