What Is The Distributed Databases Problem?
What Is The Distributed Databases Problem?
The Distributed Databases Problem provides us with a scenario in which we have an SQL database and asks us to create a solution that allows us to add more machines once the first has reached capacity. In simple terms we must find a way to distribute this database. The caveat to this is that we do not have access to any automated tools for distributing. This problem demands consideration for the bottlenecks that may occur while manually distributing a SQL database.
An Example of the Distributed Databases Problem
An Example of the Distributed Databases Problem
How would you organize a SQL database like MySQL such that you can add more machines once your current ones reach maximum capacity? With the limitation that you do not have access to any automated tools for distributing.
How to Solve the Distributed Databases Problem
How to Solve the Distributed Databases Problem
To handle the increasing capacity in our SQL-based database system without relying on automated distributing tools, we can take a manual sharding approach. This means we'll manually divide the data across multiple machines or servers, allowing us to scale the system and add more machines when needed.
SQL Databases
Let’s start by talking about SQL databases:
- SQL guarantees atomicity and isolation of transactions.
- SQL seamlessly supports multi-table queries (JOINs).
- SQL optimizes both indexes and queries on the DB level.
Before we go further, let’s clarify the functional requirements for this problem. Making an RDMBS (a relational database management system, the generalization of the loose term “SQL database”) horizontally scalable is, in general, a next-to-impossible task. For the purposes of scoping this [interview] conversation, let’s agree what we should focus on, and what can be left outside the spotlight.
Specifically, there are several optimization directions:
- Scale to support more data volume (capacity),
- Scale to support more query throughput (TPS, transactions per second),
- Scale to support faster queries (tail latency), and
- Scale to enable broad cross-shard complex queries with JOINs and other cross-shard data transfer.
Option #4 is the holy grail, which we’d rather not touch today. Let’s assume, for this 45-minute discussion, that most high-throughput / low-latency queries only need to access the data that lives on one shard. The queries that require the data from multiple shards are allowed to be slow.
(A useful exception: if some small amount of metadata exists, we can keep this data copied on every single shard, duplicated N times.)
So let’s focus on #1 while considering the original problem statement of “you can add more machines once your current ones reach maximum capacity”.
Sharding the database
Next, the question is: how do we go about sharding our database? First, we have to identify the sharding key. We’ll choose a column or attribute in our table that can act as a sharding key. This key can help to evenly distribute the data across multiple machines if we can assume even load. In situations where performance is a consideration and we have a key that is hit more often than others, we may need special considerations to distribute the data unevenly to instead even out the load. Common examples include things such as user IDs. For this question we aren’t given any information about the data we have, but it may be worthwhile to ask for this information from your interviewer.
Once we have identified the sharding key, we can employ consistent hashing as our sharding technique. Consistent hashing ensures that the distribution of data across shards is balanced and minimizes the amount of data that needs to be remapped when adding or removing machines from the system. With consistent hashing, we can achieve a scalable sharding strategy that can handle the increasing data volume.
When it comes to partitioning the data, we can split it based on our sharding key among the different machines in the network. Each machine will be responsible for storing a specific range or subset of data based on the sharding key value.
Keep in mind that we’ll have to adjust our SQL queries to include the sharding key in the WHERE clause. This ensures that the queries are routed to the appropriate machine holding the relevant data.
Now that we’ve determined the sharding strategy, we'll set up multiple instances of the SQL database, each running on a separate machine. Each instance will be responsible for storing and serving a specific shard of the data. We'll configure our application to interact with the appropriate database instance based on the shard key/criteria we used for sharding.
Load Balancer
To ensure the requests hit the correct shards, we introduce a load balancer into the system architecture. The load balancer acts as a traffic controller, receiving incoming requests and routing them to the correct database shards. In addition to routing the traffic correctly, if our consistent hashing approach keeps each data element on more than one shard, the load balancer can balance the load over these shards, thus evening the load and improving top-line performance.
Indexing
Even in a system with a distributed SQL database, optimizing query performance is still vital. We'll create appropriate indexes in our database schema to improve search and retrieval efficiency. As long as the requested data lives within one shard, a local DB index on this shard would improve query performance (e.g., looking for my WhatsApp messages to my friend Dima from July 2017, as long as the DB of messages is sharded by sender ID). For cross-shard queries, however, we will need far more sophisticated, custom, outside-the-DB indexes.
But what do we mean by indexing? In the context of an individual database (a single shard in our current design), indexing is the process of creating a data structure that enhances the speed and efficiency of data retrieval operations. An index is essentially a separate data structure that may contain a sorted copy of specific data fields, and/or reference these fields in other ways, which allows for faster lookup and retrieval of information.
What about "indexing” the sharding key? Well, that’s a little different. Logically, it does mean creating an index specifically on the column or attribute that is used as the sharding key in a sharded database setup.
But we already decided to use consistent hashing for shard selection. So, yes, it is a custom data structure, but it is not a database index that maps the sharding key values to the corresponding shard or machine; it’s “the ring” of the consistent hashing. This “ring” data structure enables the load balancer to quickly determine the location of the data based on the sharding key value, which improves the efficiency of data retrieval operations and effectively acts as the index on this “column,” despite living outside the very DB realm, but rather inside the load balancers.
Here’s an example: if the sharding key is the user ID, and we want to retrieve all the data related to a specific user, querying the database using the user ID as a filter would be faster with the indexing in place. This allows the database system to quickly locate the shard or machine where the data for that user is stored, reducing the number of network requests and tail latency.
In summary, indexing the sharding key involves creating a separate data structure that helps optimize the process of locating and retrieving data stored in specific shards or machines based on the sharding key value.
Benefits
How does this benefit us? Realistically, this is a fairly advanced topic, as it will come into play once we are talking about cross-shard data retrieval. Indexing the sharding key in a sharded database setup primarily improves the performance of SELECT statements, specifically those that involve filtering or searching based on the sharding key. These select statements benefit from the index by enabling faster data retrieval from the relevant shards or machines. Note that, in a simple version, making use of this newly created index would require the user to write their SELECT statements while keeping in mind the very fact that the data is sharded; an “automagical” SQL query engine that runs queries cross-shards as needed is a far more complex problem for this margin to contain.
However, indexing the sharding key may not have a direct impact on the speed of UPDATE or MERGE statements. UPDATE and MERGE statements typically involve modifying or merging existing data rather than searching or filtering based on the sharding key. The performance of update and merge operations in a sharded database is more influenced by factors such as network latency, data synchronization, and other factors pertaining to the underlying database management system. Our design approach might be different if our optimization criteria had been to allow for faster data querying; since we are optimizing for capacity, however, the above should be good enough.
While this manual sharding approach requires more effort and management compared to automated distributing tools, it gives us flexibility and control over our database infrastructure. We can scale the system by adding more machines (up to a certain point) while still benefiting from the advantages of a SQL-based database. It’s important to note that the critical mass for this is quite low; after just a dozen or so machines simply adding more instances may very well slow the whole system down, not speed it up. To resolve this, we have DBAs to not just add machines but understand exactly what the query patterns are, configure load, and manual sharding to keep the data local for the most expensive and/or most frequent queries.
Watch These Related Mock Interviews

About interviewing.io
interviewing.io is a mock interview practice platform. We've hosted over 100K mock interviews, conducted by senior engineers from FAANG & other top companies. We've drawn on data from these interviews to bring you the best interview prep resource on the web.