AWS Simple Queue Service (SQS)  sandbox 

Scope

Region

Features

Apps PULL messages from a queue, process it and DELETE it. If the message is not deleted by a consumer, it is returned back to the queue after visibility timeout expires.

  • FIFO preserves order and ensure once only delivery. Also deduplication.
  • Visibility timeout
  • Retention period - max 14 days, by default 4 days.
  • Delivery delay
  • Redrive allow policy

Redrive

  • Move messages from DLQ to the main queue for re-preocessing.
  • Low value of maxReceiveCount can move messages to dead letter queue before any processing occurs in the main queue.
  • Set it high to allow for retries in the main queue before a message is moved to the DLQ.

Visibility Timeout

Ideally, set the visibility timeout to roughly the same time as it takes a consumer to process and delete the message.

visibility-timeout

Short Polling vs Long Polling

In Short polling, SQS polls some of its servers to fech messages. Returns found messages, or empty if no message where found on the polled servers. On the next invocation, it fetches messages from other servers, finds the missed messages and sends it across. Eventually all messages get delivered, but empty responses or false empty responses are sent frequently.

WaitTimeSeconds=0 specified for ReceiveMessage call or ReceiveMessageWaitTimeSeconds=0 queue property triggers short polling.

In Long polling, SQS collects messages from all servers before sending the response back. In this case, very rarely is the response empty or false empty. This also reduces overall cost of using SQS queues.

WaitTimeSeconds>0 specified for ReceiveMessage call or ReceiveMessageWaitTimeSeconds>0 queue property triggers long polling. Max long polling time is 20 seconds.

WaitTimeSeconds for ReceiveMessage is honoured (if specified) over queue property.

Auto-scaling

.. based on SQS message queue depth. CloudWatch metric does not handle auto scaling at the extremes quite as well - too few messages or too many messages.

In this case, a custom metric can be created like backlog per resource which is just message in the queue/current resources. This can be done by a cron based lambda executing every 60s. This custom metric can then be used for a more accurate auto-scaling EC2 or ECS.

Backend Architecture

This blog post from AWS provides some insights into SQS backend.

It consists of several micro-services, 2 of the important ones being front-end, and storage-backend.

img

Pre 2024

frontend service accepts API calls, handles authorization, and forwards the request to the storage-backend service.

The system consists of multiple clusters. Each cluster has multiple hosts and handles multiple customer queues. A customer queue can also be assigned to multiple clusters.

So, frontend system’s hosts connect to multiple hosts of the storage backend service. Using connection pools helps reuse some of these connections, however, given the scale of messages being transferred (100 mil/sec at peak times) this can hit hardware limits of how many connections can be open at a time.

Post 2024

New frame1 transfer protocol between these 2 service deployed. It uses multiplexing, and can handle multiple requests/responses over a single connection. Reduced the load on the system, and it can now handle 20% more requests with the same specs. Performance times also imporved, reducing the time a message spends “in the backend” by almost 20%.

Questions

  1. How does auto-scaling based on message age in queue behave?

References