MEDIUM
SYSTEM DESIGN

How to Solve Image Filter Service

Written By
Adam Bhula

What Is The Image Filter Service Problem?

The Distributed Databases Problem asks us to design and implement a service that allows users to upload their images, apply filters to transform them and receive a link to download their filtered images. Our objective is to provide this while ensuring our system does not become overloaded leading to long processing times or that download links are not sent. Through a combination of efficient image processing and effective handling of user requests, we can provide a reliable service. One of the challenges in developing a service like this is optimizing for image processing. Image processing tasks can be computationally intensive, especially when applying complex filters or working with large images. Ensuring that the service can efficiently handle a high volume of concurrent image processing requests while maintaining acceptable response times is the crux of this problem.

An Example of the Image Filter Service Problem

Create a service that allows users to upload an image and apply filters and then sends users a link to download their filtered image.

How to Solve the Image Filter Service Problem

To create an efficient image processing service, capable of handling a large number of requests and providing timely notifications, we need to consider the following steps:

Image Uploads and Storage:

Users will be able to upload one photo at a time, supporting popular image file formats like JPEG, PNG, and GIF. For an optimized storage solution we can use a hybrid approach. Store the actual image files in an object storage service such as Amazon S3, Google Cloud Storage, or Azure Blob Storage, benefiting from their scalability and durability. Simultaneously, capture and store the associated metadata in a database, including details like image ID, upload timestamp, user ID (if applicable), and file information. This hybrid storage approach combines the advantages of object storage for efficient file management with database storage for organized metadata retrieval and queries. It ensures a reliable and scalable system architecture for the image processing service.

This part of the system — images upload — deserves an architectural diagram. As all requests are independent, and as there’s no harm in occasional, very infrequent, double-uploads of the same image, this part of the system is perfectly shardable horizontally. The diagram for this part would include a load balancer, a frontend cluster to accept uploaded images (an “inverse CDN”), some autoscaling, some monitoring, and access to AWS S3 or other cheap blob storage. Of course, these very machines that accept uploaded images can also serve some application business logic functions. It is worth noting, however, that the very image upload part can literally be implemented based on the nginx server and some lightweight post-upload job triggered by a Lua script.

Filter Options:

We can develop a modular filter system that allows for easy addition of new filters in the future. Filters can be implemented as predefined algorithms or user-defined equations. We use image processing libraries or frameworks that provide a wide range of filter options and flexibility. Evaluate different filter implementations and consider trade-offs in terms of performance, accuracy, and computational complexity.

Would we run into any issues in double-processing? In other words, would double uploads or double filter applications cause us problems? The answer is no, so the requirements for the Job Queue itself aren’t as strict.

Efficient Image Processing and Job Queue System:

Optimize the image processing algorithms to reduce latency by utilizing parallel processing techniques and hardware acceleration, such as GPU computing. Consider implementing a job queue system, such as RabbitMQ or Apache Kafka, to manage and distribute the processing workload across multiple workers. For our system, let’s go with Kafka.

Kafka is a distributed streaming platform known for its high throughput and fault-tolerant design. It is commonly used for real-time data streaming and processing. Kafka's architecture allows for scalable and fault-tolerant message processing, making it suitable for high-volume and distributed systems. In the context of image processing, Kafka can be used as a job queue system by treating image processing requests as events or messages. These events can be published to Kafka topics, and multiple worker instances can subscribe to these topics and process the messages concurrently. Kafka's partitioning and replication mechanisms ensure fault tolerance and scalability, making it suitable for handling large volumes of image processing requests. Overall, Kafka offers robust job queue capabilities, allowing for efficient distribution and parallel execution of image processing tasks.

Note that this logic is entering a very deep pocket of the SysDesign interview conversation, and it is highly unlikely that navigating it is a requirement for an L4/L5 SD interview. For L4/L5 it would suffice to say that, since occasional, very rare, double-processing is not a problem, a RabbitMQ-based solution is also a very feasible option, and, in practice, both Kafka and RabbitMQ (or even a PostgreSQL table with an “autoincrement” column that would store the set of tasks to complete!) are perfectly defensible here.

The downside of using Kafka here could be that it requires sequential processing of data across partitions. Thus, the Kafka-natural way may result in less than full utilization of the workers that perform the jobs (i.e., that apply the desired filters to images). Overall, since in this particular problem there is no harm in occasionally re-running the same job more than once, one can argue that a RabbitMQ-based solution might be more effective.

The “failure mode” of Kafka would be “fast” tasks stuck in the queue behind a “slow” one in a particularly unlucky Kafka partition. A possible solution might be to use separate Kafka topics for “fast” and “slow” tasks. This approach makes it relatively painless to scale the service to large images/videos in future, which will clearly require far more processing time. This example just highlights that, in production, when prompt job completion is required, Kafka may not be the best solution in the long run — but in an interview setting it is certainly good enough for an L4/L5 conversation.

The job queue system enables efficient and scalable processing by decoupling image processing from the API endpoint, allowing tasks to be processed asynchronously. We provide users with a mechanism to retrieve their filtered images, such as generating unique download links or associating the results with unique identifiers.

Expiration of Download Links:

Implement a mechanism to track the creation timestamp of each download link. Regularly check the expiration status of the links and remove expired links from the system. This ensures that download links for filtered images are only valid for the specified three-month period, providing proper data management.

Scalability and Performance:

We design the system to handle roughly 100k requests per day, which is slightly over one request per second on average, maybe ~three requests per second at peak. Utilize horizontal scaling by deploying the service on multiple servers or by utilizing cloud-based infrastructure that can automatically scale based on demand. Implement load balancing techniques to distribute incoming requests evenly across available resources. When it comes to letting users download images, utilize caching mechanisms, such as a content delivery network (CDN) or in-memory caching, to improve end-user latency and reduce processing time for frequently accessed images. CDNs also help us to save on costs when data blobs are large.

Notification System:

Implement a notification system that sends notifications to users once their images are processed. Utilize scalable messaging services or queues, such as Amazon SNS, Google Cloud Pub/Sub, or Azure Service Bus, to efficiently handle the delivery of notifications. Store user preferences for notification channels (e.g., email, push notifications, SMS) and deliver notifications accordingly. Note: If you have reached this part of the interview, congratulate yourself — you most likely have nailed it by this point.

About interviewing.io

interviewing.io is a mock interview practice platform. We've hosted over 100K mock interviews, conducted by senior engineers from FAANG & other top companies. We've drawn on data from these interviews to bring you the best interview prep resource on the web.

We know exactly what to do and say to get the company, title, and salary you want.

Interview prep and job hunting are chaos and pain. We can help. Really.