AWS Serverless WebSockets — Introduction Around the Pitfalls

6 min readNov 15, 2019

About a year ago in AWS Re:Invent 2018, AWS published the long-awaited API Gateway WebSocket APIs, enabling a brand new class of applications to be built on the Serverless technologies powered by AWS Lambda. Complementing the re-existing REST API functionality, applications with higher distributed real-time communication requirements can now easily be built on AWS without provisioining servers or managing containers. This was possible even before to some degree utilizing AWS AppSync or AWS IoT, though both are slanted towards specific approaches with their own idiosyncrasies — GraphQL and massive device amounts, respectively. The new approach is more suitable as a drop-in replacement for WebSockets as they have been used in traditional web and mobile development.

Stateful connections to transient serverless actors? How?

You might know Serverless as a good solution for applications that only require short processing times with little to no memory-stored state. At a glance, WebSockets are antithetical to this approach, as they require maintaining a connection over a prolonged period of time. The solution that AWS uses is to utilize API Gateway as the stateful proxy between the client and message processing. The WebSocket connection itself is only as far as to the API Gateway, whereas forming the connection, disconnecting, and handling each message sent by the client triggers a Lambda function that only runs as long as necessary to handle this one individual task. Each WebSocket connection then has an individual callback URL in API Gateway that is used to push messages back to its corresponding client. In practical terms this push channel is abstracted out and handled by the AWS SDK.

Connection authorization

Serverless WebSocket APIs utilize the API Gateway authorization functionality in the same way as RESTful API Gateway APIs do. The WebSocket connection request must have some sort of credentials included, which can be handled by an Lambda authorizer function that returns a specific AWS IAM policy of the access rights of the connection. The authorizer can also create a request context filled with user metadata that is automatically included in all subsequent communication for the connection. This can reduce the required workload in the WebSocket frame handlers.

However, this approach requires that the authorization happens directly with the WebSocket connection request, with credentials usually tacked on to a header or query string. In other words, it’s not possible to accept all incoming connections and upgrade them to an authenticated one with a WebSocket frame later.

Persistence and reacting to change

A natural Serverless approach for making sure clients have timely up-to-date information is to decouple data modifications from pushing the updated data to the appropriate clients. Your data might be modified from multiple sources: WebSocket frame handlers, scheduled tasks, or even directly from a persistence console or command line scripts. You want your clients to get data updates pushed to them without having to duplicate the update propagation logic everywhere. An easy way to implement this is to use AWS DynamoDB, AWS’s flavor of a NoSQL document database, as your persistence — an approach also favored by the examples in the API Gateway WebSocket API feature announcement and Serverless framework API Gateway WebSocket introduction. The most useful feature of DynamoDB in this context, is DynamoDB streams, which can be set up to automatically trigger Lambda functions for all data modifications. This provides a natural avenue for propagating data updates to WebSocket clients.

Overall, DynamoDB has steadily improved its feature toolbox in critical areas with features such as on-demand pricing cutting the need for cumbersome autoscaling configuration and Point In Time Recovery making its backup solutions more robust, and has become a formidable alternative to traditional persistence solutions in production systems. It’s also quite synergistic when utilizing the Serverless framework, since you can fully provision your database with infra-code directly in your Serverless framework configuration file, so the tables end up being part of the same CloudFormation stack as the code that requires them.

Do or do not, there is no retry — real time data processing

There are a couple of possible pitfalls when using these kinds of Lambda function chains for handling real-time events. A big one revolves around Lambda function automatic retries. Usually if a Lambda function fails as a result of an unchecked exception, it is retried automatically, several times, using a strategy that varies by the action that triggered the Lambda function. More importantly, this also happens if a Lambda function fails due to a timeout — while the maximum Lambda invocation time is 15 minutes, often a more practical limit is somewhere around 30 to 120 seconds (with 30 seconds being the maximum timeout for API Gateway REST APIs — WebSocket handling is more relaxed in this sense). These retries happen with identical input parameters, which can easily be data that is already stale. What’s even worse is that DynamoDB update streaming always batches update events when possible, so if your batch of 10 events contains a single one that consistently makes the Lambda handler fail, the side effects from your 9 other update events might get repeatedly propagated. Another aggravating factor is that at least the early versions of the Node.js API Gateway WebSocket client communication API could occasionally time out indefinitely for certain connections, causing Lambda timeouts.

You need to take these things into account in your implementation. If you’re using DynamoDB streams, have the actual stream handler be a very lightweight one that simply invokes additional Lambda functions, one per event in the batch, so that one bad event cannot affect the whole application. Additionally, have circuit breakers in place that prevent Lambda function timeouts and handle these cases explicitly. Note that in Node.js Lambda functions the default behavior is never to exit the function execution if the event loop is not empty even if you execute the callback (usually means looping background execution or hanging promises), so extra care is required when trying to implement foolproof circuit breakers.

Basic architecture

The foundation of a simple Serverless WebSocket service can be seen as follows. Forming connections and sending frames end up invoking frame handler Lambda functions, and database modifications invoke stream and event handlers, which push data back to the clients.

This results in a high number of Lambda invocations, but they are all very short-lived and should not be computation intensive. AWS Lambda pricing lends itself well to this kind of approach: the rate for amount of requests is very cheap, while the rate for actual computation time (function runtime) is very expensive compared to alternatives. With serverless architectures, always aim for a high amount of short tasks, instead of a lower amount of long-running tasks. If your use case requires long-running (think longer than a couple of seconds) tasks, you should strongly consider more serverful approaches such as containers in an AWS Fargate or Kubernetes cluster.

Another consideration when dealing with a high amount of successive Lambda invocations is the traditional Serverless boogeyman — cold starts. Cold starts occur when a specific Lambda function is invoked for the first time in a while, or more parallelization is required — AWS infrastructure needs to provision a new handler from scratch, which can take up to seconds in extreme cases. The situation was especially bad in functions provisioned in a custom VPC — the cold start required an Elastic Network Interface to be created which could bump the total cold start time to over 10 seconds. The situation has steadily and markedly improved though, with a major breakthrough being VPC networking improvements that mostly eliminates this problem — a feature rolled out to roughly half the AWS regions at the time of writing. Anecdotally, I’ve supervised a production system completely built on Serverless WebSockets for over 6 months and request processing latency has not been a problem at all, even before the latest performance improvements (Node.js 8 runtimes, in a private VPC).

Conclusion

AWS Serverless WebSockets are a robust and production-ready way to implement real-time web and mobile functionality, while being as close to a drop-in replacement to traditional WebSockets as possible when backed by a serverless architecture. There is a learning curve regarding pitfalls and differences, but if your application requires real time communication, with a usage pattern of a high amount of short requests, with pay-as-you-go pricing, they are a very lucrative technology. Scaling out will never be easier (up to a point, anyway).

The author is a full stack developer that has used serverless technologies for years, for things they are good at, and things they are very bad at — coincidentally a good way to learn the best use cases.