What is the Design WhatsApp Problem?
What is the Design WhatsApp Problem?
The Design WhatsApp problem asks us to design a message app system and client that supports a number of requirements. Due to the nature of the problem, it casts a wide scope. It's important to not get lost in all the features and spread yourself too thin when answering this problem. Because the question is so broad, pay extra attention to the requirements laid out by the interviewer. A key area to focus on will ensuring no messages are lost.
An Example of the Design WhatsApp Problem
An Example of the Design WhatsApp Problem
Design a message app system and client (e.g., WhatsApp) supporting the following requirements:
- 1:1 messaging
- Broadcasting (a few people can write; others can read)
- Group messaging (everyone can write, but there are owners/moderators)
- Msg send/receive confirmation (acknowledge) to make sure no message lost between clients
- Image/media/file sharing
How to Solve the Design WhatsApp Problem
How to Solve the Design WhatsApp Problem
Aside from basic topics such as the messaging server, there are many topics we can cover. To save time during a system design interview, let’s first focus on the functional requirements. This exercise will also help us clarify the problem statement. WhatsApp is a broad problem, and each interviewer will have different goals and priorities in mind, so it is even more important here than in other questions to first clarify what we are solving for.
User-facing features (close to the FRs in this case) to clarify include:
- Online status
- Offline messaging, and storing the message
- Push notifications
- Location sharing
- Registration
- Invite links
- SMS/email confirm
- Group management
- Bio lines
- “Is typing …”
- Statuses
- Profile pictures
- Encryption / security / privacy
Note:
When it comes to storing the messages, even though for 1:1 communication it may be desirable to create a solution in which all messages live on users’ devices, it is hard to scale to channels and groups. So, unless the problem statement is specifically about designing an app like "Signal" and not WhatsApp, let’s assume our servers do store all the messages. This also makes connecting new user devices easier.
Moreover, if you are feeling ambitious, and/or if this is an L6+ conversation, there are deeper clarifications to make, and they may be worth making during the FR/NFR clarifications phase.
- Lost Messages: How do we handle lost messages? Is this sometimes acceptable or never okay?
- We would hope to not lose any messages. The way we display when a message is lost is the usage of ticks. One tick means the message has been sent from the user to the servers, a second tick means it’s been delivered to the recipient, and a change in color of the ticks to blue signifies it has been read. In the event that there is an outage or high load on the system, we can display a loading symbol on the message that the user is attempting to send. If this message fails to reach the server, we display a cross symbol and provide the user with an option to retry.
- Message order: Should each user see the messages in the same order?
- Ideally, the order should be preserved. But it’s not the end of the world if in a busy group multiple users’ feeds do not 100% agree with each other about which message came first if they were sent around the same time. In 1:1 conversations, of course, we expect the order of messages to be preserved (although there might be some wiggle room for multiple devices, see below).
- History: Hot/Cold. Hot when a particular device was offline for a short while; Cold (long) when adding a new device, or are powering on a device that was offline for ~years.
- If the history is stored on our servers, delivering most recent conversations to a new user device should not be a problem. If the user wants to pull earlier messages, they can scroll up and it’s acceptable to show a loading symbol and a small wait of a few seconds to pull the next batch of messages (30 or so).
- Multiple devices: What if I have WhatsApp open on my phone, laptop (browser), and tablet. How do the messages appear everywhere?
- Synchronizing multiple devices later on is a separate challenge, but it boils down to sending my own messages to “myself” on other devices.
For non-functional requirements, it would be useful to mention scalability (can add machines to serve more users / more messages), durability, and fault tolerance (to keep delivering and to not lose messages even if our service is partially down).
High-level architecture
Here’s an overview of our high-level architecture. It’s useful to keep in mind that, although this represents a hierarchy for implementation, it does not represent an order of priority for discussion in a real interview.
Most interviewers would want to get down to the critical path, the “life of a message,” i.e., how the messages travel through the system to ultimately get delivered from user A to user B. This is the part of the conversation where message routing and message queues come into play, both of which we will cover below.
With this said, the high-level architecture — which most interviewers would only want to spend a few minutes talking about — would include approximately the following components:
- User Management
- Handle user registration and authentication.
- Maintain user profiles, including contact lists and group memberships.
- Group Management Service
- Admin controls.
- Remove users.
- Add users.
- Change group name/profile picture.
- Provide moderator privileges.
- Messaging Server
- Responsible for handling message routing and delivery.
- Supports both group messaging and 1:1 messaging.
- Manages message queues for offline users and ensures message delivery upon their availability.
- Connected Users Registry Service (message routing)
- So that Messaging Servers could know if other users — the recipients of messages — are online at a given time.
- Note: Be sure to not go too deep into this component, as it’s a more low-level technical one compared to the ones below. A strong interviewer may derail the conversation here, and you’d lose points for not mentioning stuff like the Storage or the Notifications service. Roll on with whatever is more comfortable for you; mentioning this later on, once you get to the details, works too.
- Message Storage
- Stores messages temporarily for offline users.
- Provides message history for users to retrieve past conversations.
- Acknowledgment Mechanism
- Implements a confirmation mechanism to ensure message delivery and receipt acknowledgment.
- The server tracks acknowledgments from clients and resends messages if not acknowledged within a certain timeframe.
- Image/File Sharing
- Implements functionality to upload, store, and retrieve images and files.
- Assigns unique identifiers to each uploaded file for efficient referencing in messages.
- Notification Service
- Sends push notifications to mobile devices and real-time updates to web-based clients.
- Notifies users about new messages, message delivery, and other relevant activities.
- Client Applications
- Provide user interfaces for sending/receiving messages, managing contacts, and joining groups.
- Allow users to send text messages, images, and files.
- Implement real-time updates for incoming messages and notifications.
- Display message delivery status per communication with the Acknowledgements service.
Overall Flow
Here's how the overall flow of the system could look:
-
User registration and authentication are handled by the User Management component. Users create accounts, provide necessary details, and authenticate themselves to access the messaging system.
-
Once authenticated, users can send/receive messages using the client application. They can choose between 1:1 messaging or group messaging. For 1:1 messaging, users select a recipient from their contact list. For group messaging, users join specific group servers and send messages to all group members.
-
When a user sends a message, the client application communicates with the Messaging Server. The server receives the message, stores it temporarily for offline users (in a queue), and routes it to the intended recipients or group members.
-
The server tracks message acknowledgments from clients. Upon receiving an acknowledgment, the server marks the message as delivered. If an acknowledgment is not received within a specific timeframe, the server resends the message to ensure delivery.
-
The Message Storage component stores messages temporarily for offline users and maintains message history for users to retrieve past conversations.
-
Image/File sharing is supported by the system. Users can upload images/files, which are assigned unique identifiers. These identifiers are then included in messages, allowing recipients to download and view the shared content.
-
The Notification Service sends push notifications to mobile devices and real-time updates to web-based clients. Users receive notifications about new messages, message delivery, and other relevant activities.
-
Client applications continuously check for incoming messages and display real-time updates. When a new message arrives, the client application receives it from the server and displays it to the user. The application also shows the delivery status of sent messages, indicating whether they are sent, delivered, or read.
Now let’s go into more detail about each component:
Notification system:
We want a scalable and reliable notification system that can handle delivery of notifications to millions of users. In practice, the vast majority of push notifications will be sent to Apple and Android devices, and both of these ecosystems have mature APIs. On our side, we just need to scale the fanout part, so that if a user is not presently online and in the app, a message to them does, by default, result in a push notification sent their way. Handling this correctly intersects a lot with the ultimate message delivery path. If we end up using some kind of a publish-subscribe paradigm, the service that sends out push notifications can be one of the services on the message consuming side, i.e., one of the services that is decoupled from the sender of the message. If this is our design, this particular consumer can be referred to as the Notifications Service, although it does not have to be factored out into its own dedicated service with alternate designs.
Image and file handling:
To handle image and file handling in the message app system, it is important to set certain parameters and implement appropriate storage mechanisms. First, a maximum file size limit should be defined (this limit will be separate for images / videos which have their own limits). For this problem we can assume images will be 5 MB in size — this limit can be considered as the maximum size for images shared within the system. Additionally, it is crucial to define acceptable image formats and implement server-side validation to prevent the sharing of malicious files that could compromise the system's security.
To efficiently store and retrieve images and files, scalable and performant storage mechanisms should be implemented. One recommended approach is to utilize a content delivery network (CDN) or distributed storage solution. Among the available options, Amazon S3 (Simple Storage Service) stands out as a reliable and feature-rich service.
Amazon S3 provides a highly scalable, durable, and secure object storage service. It offers low latency retrieval of images and files, ensuring fast and reliable content delivery to users across different geographical locations. Later on, we only refer to images by their IDs, so that the very message-related payloads are kept small.
Message Length and Size:
We can define a maximum message length to prevent abuse of the service and ensure efficient storage and transmission. For example, we can set a maximum limit of 4096 characters per message. We can also determine the maximum size of a message by considering the total length of the text, metadata, and attachments (e.g., images). Considering the 5 MB image size limit, the maximum size of a message can be calculated based on the estimated average length of text and metadata.
API Design for Sending Messages:
For the API design of a messaging app, you would typically have the following API endpoints for posting and deleting messages:
Post 1:1 message api
{"sender": "sender_auth_token","recipient": "recipient_id", "message": "message_content", "client_timestamp": "message_timestamp"}
Note: Internally, we use server timestamps, but it is nice to know the client-side one. Obviously, we do not “trust” client-sent timestamps, because they can be anything.
For group messages:
{"sender": "sender_auth_token", “recipient_group”: “group_id”, "message": "message_content", "client_timestamp": "message_timestamp"}
Response:
{ "messageId": "unique_message_id", "status": "success", "server_timestamp": "message_timestamp"}
Note: the server may well return the server-set timestamp for the message. So that, moving on, some “seen X seconds/minutes/hours ago reflects the reality, not the potentially faulty clock on the client.
Description: This API endpoint allows a user to post a new message to one or more recipients, which can be either individual users or a group (the server will then be responsible for figuring out who are the group members, as well as whether you have the permission to post to this group). The request body includes the sender’s auth token, the recipient ID or the ID of the group to send the message to, the message content, and the [client] timestamp. The API returns a unique message ID and a status indicating the success of the message reaching the server. We can display a loading symbol until the server has given a response and then display a checkmark; the “blue ticks” would appear later, as the confirmation of the message being read would come from a different service, at a later time.
This API design ensures that both one-on-one and group messaging capabilities are supported in the application.
Message Routing and Delivery:
First and foremost, we will have more than one Messaging Service in our system, and Messaging Servers will be talking to one another — because, at least in the case of 1:1 messaging when both parties are online, delivering the message directly is the best way to get the job done. This calls for the presence of some Connected Clients Registry Service, which is effectively a database that keeps the record of which Messaging service the now-online user devices have open connections to. Push notifications aside, the client device, whether it’s the browser or a native app, can only receive data from the very Messaging Service it is connected to. Therefore, if user A is connected to shard P of the Messaging Service, and user B is connected to shard Q of the Messaging Service, the very message would have to travel from P to Q in order to get delivered from A to B. And P has to figure out which other shard of the Messaging Service, if any, has an open connection to user B (i.e., which other shard is the Q shard).
Upon receiving a message from a sender, the Messaging Server determines the recipients. For 1:1 messaging, the server identifies the recipient's client device or devices and establishes direct connections to deliver the message. The message also needs to be stored into a database. Moreover, since the recipient may be offline, the notification service should receive the message too, so that they could be notified.
We have a good use case for the pub-sub bus here, as it becomes crucial to separate the producer of the message from the consumers of it. Even in the case of 1:1 messaging where both users are online, things may go wrong; for example, the user’s device may have run out of battery at the very moment they were supposed to be seeing the message. And, in a well-designed system, things going wrong should not result in lost messages.
Thus, the publish-subscribe bus, or some logical equivalent, would be the backbone of our system. For group messages, the currently online group members should ideally receive the message right away, but the server that has originally received the messages may not be the best choice for doing the job of identifying the other ~hundreds of servers to which all the now-online group members are connected. Therefore, for group messages, it is better for the very message delivery to happen on the consumer side of this pub-sub bus, decoupled from the producer side.
For 1:1 messaging, while it is feasible for the messaging server that has received the message to locate the now-online recipient’s device(s) and directly deliver the message to them through the Messaging Servers to which those devices are connected, it is still the pub-sub bus that comes out on top when it comes to consistent message delivery. To begin with, the message will need to be stored in a database, and it will need to be delivered to the Notification Service, so that the user devices that have not received the message right away can be push-notified of it.
As an advanced idea, to optimize routing efficiency, for sharding users across the shards of Messaging Servers, we can employ algorithms like consistent hashing to distribute the message load across multiple servers or partitions, which ensures balanced handling of messages. This way, the work of the Registry service could be simplified, as shard P could query shard Q directly with the “do you have user B connected to you?” request, instead of inquiring with the Registry re: “if user B is online now, which shard(s) should I go talk to?”
Real-time Communication:
To enable real-time communication between the server and clients, we can utilize WebSockets between user devices and Messaging Servers. WebSockets provide full-duplex communication channels over a single TCP connection, enabling bidirectional real-time communication between clients and the server. By adopting these protocols, the server can establish persistent connections with clients, allowing for efficient and low-latency message delivery.
There should also be a response sent once a recipient has opened the message. We will take a snapshot of the time they have read the message and consider this our “read_timestamp”. To promptly display the “two blue ticks” icon on the client device, we employ the same WebSocket that all connected users keep open with one of the shards of the Messaging Service. This, in part, implies that the Messaging Service is responsible for not only routing messages “inward,” “into the system,” but also for routing any and all system notifications “outward,” “from the system.” For instance, if a friend from your contact list has just joined WhatsApp, this notification is also something that would reach the user through the Websocket that this user keeps open to some Messaging Server shard.
Note: At this point, be careful to not open more cans of worms, unless you are comfortable talking about them in detail. The truth is that the job of the Messaging Server has a lot more to do with the “inward” communication, compared to enabling users to send (post) messages. In particular, it may be beneficial to think of some “1:1 chat with the system user” that is “open” on every device of every user. Events, or “messages,” such as “your friend is online / your friend if offline” would go through this “system user chat,” and this “system user” will most likely be more chatty than all the other “human” users combined. Specifically, the very “track the online status” problem is more difficult than routing all the 1:1 messages combined, in sheer traffic numbers alone. Other system notifications, such as the blue ticks, “is typing …”, etc., would also travel through this “system user” “chat.” This is a fairly advanced (L6+) topic though, hence the warning!
Message Queues and Publish-Subscribe Mechanisms:
To handle message distribution to multiple recipients and group members, we can implement message queues or publish-subscribe mechanisms. Message queues, such as RabbitMQ or Apache Kafka, provide reliable and scalable message buffering and delivery. Messages are placed in queues and delivered to recipients when they come online and become available. On the other hand, publish-subscribe mechanisms, such as AMQP brokers or pub-sub frameworks, allow for efficient broadcasting of messages to multiple subscribers. Clients subscribe to specific topics of interest and receive messages published to those topics. These mechanisms play a crucial role in decoupling the sender and recipients, ensuring reliable message delivery even when recipients are offline. Messages are stored in the queue or topic until the recipients become available, which makes sure that no messages are lost or missed.
There might be room on the diagram for another service, the “History Retrieval” service, so that hot and cold histories could be browsed seamlessly. The “cold,” long-term store is likely some S3 files packed into some Zip-archives per day / week / month / year. But these may take 10+ seconds to extract, and, when the users are scrolling up, 10+ seconds is too much. This is where the hot storage comes into play, so that some more recent messages are stored in a more available DB.
Data replication, partitioning, and caching:
For data partitioning, we can implement strategies that distribute the message load across different servers or shards based on user identifiers or other relevant factors. This allows for horizontal scalability and enables the system to handle a large volume of messages and users. By partitioning the data, we can also optimize query performance and ensure that data retrieval is efficient.
For caching needs, Redis is a good option. When it comes to ensuring data durability, fault tolerance, and fast read performance in our messaging app system, Redis is the perfect choice. With Redis, we can replicate important data like user profiles and message history across multiple servers or data centers. Redis shines as a caching solution, allowing us to store frequently accessed data like user profiles and message metadata in memory for lightning-fast retrieval.
Database Design:
Our messaging app system utilizes a relational database to store and manage message data. The database schema is designed to efficiently store message content, metadata, sender/receiver information, and timestamps. The schema includes tables for messages, users, groups, and other relevant entities, along with appropriate relationships and constraints.
To ensure fast retrieval of messages, the database employs indexing techniques on frequently queried fields such as message IDs, sender/receiver IDs, and timestamps. This optimizes the database query performance and enables quick access to message data. Additionally, the database design includes proper data partitioning and sharding strategies to distribute the message data across multiple database nodes, ensuring scalability and improved performance.
When it comes to managing the messaging app's data, PostgreSQL as the database management system is a solid choice. It’s a mature and reliable option that makes sure our data is stored securely and consistently. PostgreSQL supports partitioning, which allows us to divide a large table into smaller, more manageable chunks based on specific criteria. This improves query performance and maintenance operations by efficiently organizing the data.
One important thing to note is that the relational SQL database should not be used as the cold storage for the contents of the messages that users send to one another. Metadata, such as users, groups, friends lists, etc., may well live in a SQL DB in the long run. The “hot storage” of messages may or may not be in an RDBMS (some Cassandra or Redis or DynamoDB could do just fine). The “cold storage” should be far closer to the “Zipped archives on S3” approach than to using the DB.
To handle a growing number of users and messages, we've designed the database to scale effectively. We use techniques like data partitioning and sharding, which involve dividing the data into smaller, more manageable chunks. This allows us to distribute the message data across multiple database nodes, enabling faster processing and improved performance as the system grows. PostgreSQL has built-in features for partitioning, or we can employ specialized methods like consistent hashing. In terms of services, we can take advantage of managed database solutions like Amazon RDS for PostgreSQL or Google Cloud SQL for PostgreSQL. These services handle backups, scaling, and high availability for us, so we can focus on building the messaging app without worrying about the infrastructure. They offer robust performance and durability, and both are trusted and widely used in the industry.
It’s important to note, however, that PostgreSQL does not natively support automatic sharding, where data is automatically distributed across multiple database nodes based on a sharding key. If automatic sharding is a critical requirement for our messaging app, we should consider using distributed database solutions like Apache Cassandra or Google Cloud Spanner.
Similarly, choosing PostgreSQL as our database solution does involve some other trade-offs. While it provides strong consistency and reliability, it may face challenges when handling extremely large amounts of data or high write-intensive workloads. In those cases, NoSQL databases like Apache Cassandra or Amazon DynamoDB might be more suitable.
In summary, our database design combines the power of PostgreSQL with techniques like partitioning and sharding to ensure efficient data management and scalability. With managed database services like Amazon RDS or Google Cloud SQL, we can offload the infrastructure management and focus on building a reliable and high-performing messaging app for our users. By evaluating the needs of our app we can see that manual sharding is sufficient.
Watch These Related Mock Interviews

About interviewing.io
interviewing.io is a mock interview practice platform. We've hosted over 100K mock interviews, conducted by senior engineers from FAANG & other top companies. We've drawn on data from these interviews to bring you the best interview prep resource on the web.