Guest post originally published on Linkerd’s blog by Eliza Weisman

Linkerd 2.11 is here and with it are some cool new updates. One I am particularly excited about (full disclosure: I worked on it), is retries for HTTP requests with bodies. Linkerd has supported HTTP retries since version 2.2, but until now, we would only retry requests without bodies. Retrying requests with bodies is especially important for anyone using Linkerd with gRPC. Since all gRPC requests are HTTP/2 POST requests with bodies, this feature enables retries to be configured for gRPC traffic.

Retrying a request with a body may sound simple (just send the body again, right?), but it’s not that straightforward. In order to send a body again, the entire body has to be buffered in memory until the original request completes. This means that proxies will have to use more memory to store those bodies, and buffering the body increases latency. We would like to be able to retry these requests while keeping the impact on latency and proxy memory usage as low as possible.

Additionally, some requests, like client-streaming requests in HTTP/2 and Transfer-Encoding: chunked requests in HTTP/1.1, can have long bodies that are sent in multiple pieces. If the proxy were to buffer the entire request body before forwarding the request to the server, it would have to wait for the body to complete — potentially introducing significant latency. And, in many cases, the server might expect to process those request bodies chunk-by-chunk. Imagine uploading a multi-gigabyte video file or a client-streaming gRPC request where the client pushes events to the server as they occur. Waiting to buffer the entire body before forwarding it could break the behavior that the server expects. Instead, we want to be able to forward each chunk as it’s received while also buffering it in the proxy in case the request needs to be retried.

To reduce the overhead of buffering body data, we also want to minimize copying data from one buffer to another (i.e. memcpy calls). This can be quite time-consuming when there’s a lot of data and we want to avoid unnecessary memory allocations (i.e. malloc calls).

In addition, we want to make sure we’re correctly handling potential edge cases. What if the server responds with an error before the body stream ends? We might have to retry the request before we’ve received the whole body from the client. Or, what if the client sends a longer body than the request’s Content-Length header? A standards-compliant client shouldn’t do this, but it might happen due to a bug or (if the request came from an untrusted client outside the cluster) an HTTP request smuggling attack. If this occurs, we want to ensure that the proxy doesn’t use potentially unbounded amounts of memory.

The bottom line is that there are a lot of potential challenges and edge cases involved in buffering and re-sending request bodies. Fortunately, we can use some excellent libraries from the Rust ecosystem to help solve them. In particular, the bytes and http-body crates (both written by emeritus Linkerd maintainers), were vital parts of our implementation.

Buffering

The bytes crate provides an implementation of reference-counted byte buffers. This is very useful because it allows us to clone chunks of body data without having to copy all the bytes into a new array.

The Rust standard library’s growable array type, Vec, is represented simply as a pointer to an array in memory, plus a length. This representation is simple and lightweight. However, it means that if we want to clone a Vec of bytes representing an HTTP body chunk, we have to do this by allocating a new array and copying all the bytes from the existing array into it. This is quite time-consuming when there’s a large amount of data.

Cloning buffers: the slow way

To solve this problem the bytes crate provides the Bytes type, a reference-counted byte buffer. Where a Vec is a pointer to an array and a length, a Bytes is a pointer to an array, a length, and an atomic reference count. This means that multiple owning references to the byte buffer can exist at the same time, as it will only be deallocated when all those references go away. Now, cloning buffers only requires incrementing a reference count and copying a pointer. That’s much faster than copying all the data — 100% malloc and memcpy free!

Cloning buffers: the fast way

Hyper, the Rust HTTP implementation used by the proxy, can be configured to represent HTTP body data using the Bytes type. This means that we can take a chunk of data received from the client, clone it by incrementing its reference count, and send one clone to the server, while retaining the other for a potential retry. But because both clones of the buffer are just pointers to the same array in memory, we don’t have to copy all of the bytes.

This solves half of the problem. But, recall that a body might be streaming: we might receive several chunks of data over time. How do we add new data to our buffer?

The bytes crate’s BytesMut::extend_from_slice method appends the data from a slice of bytes to a mutable byte buffer. But, because we are using the shared, reference-counted Bytes type, we can’t use this. Mutating the contents of the byte buffer while it might be referenced elsewhere could result in a data race, so the shared Bytes type does not provide this API. We could copy the data from the Bytes into a new buffer that we can freely mutate, but this would mean allocating and copying the actual array of bytes, which defeats the purpose of using reference-counted buffers in the first place.

Instead, our solution was to implement a new type called BufList. A BufList is a Vec of multiple Bytes buffers, in the order that they were received. Now we can append a new chunk of data to the buffered body by simply cloning the Bytes and appending it to the BufList’s vector. By doing this, we can avoid copying the bytes, and (most of the time) avoid allocating as well. If the BufList’s Vec array is at capacity it may need to be resized to append a new chunk, but because it consists only of a pointer to each byte buffer, rather than all the bytes received as part of the request body, the array that needs to be allocated and copied is quite small, reducing the overhead of the allocation significantly.

Appending multiple buffers to a BufList

The http-body crate contains Rust traits providing interfaces that can be implemented types that represent HTTP bodies and HTTP body data chunks. Implementing these traits allows us to provide our BufList type to hyper as body data when forwarding the request. Although the byte buffers are not contiguous in memory, we can still send them on the network with a single system call by using vectored writes.

Retrying requests

Now that we have an efficient strategy for buffering body data, how do we actually retry requests? First, we wrote a new type, called ReplayBody, that implements http-body’s Body trait. A ReplayBody exists in one of two states: it is either receiving the initial request body from the client or playing back a buffered body for a retry. When we are receiving the initial body from the client, we lazily append each chunk to a BufList while forwarding that data to the server. If the request fails and we have to retry it, we switch to the “replay” state and play back the data buffered in the BufList. After replaying from the buffer, if the original request body has not completed (i.e. the server returned an error before we received the end of the body), we can switch back to the initial state, and continue forwarding from the received body while buffering new data. This means that from the client’s perspective, everything is fine — the retry is performed completely transparently, even when the body has not yet completed.

Next, we need to determine whether a given request can be retried. The proxy already has logic for determining whether a request is retryable based on the ServiceProfile configuration and retry budget. Previously, that logic would always determine requests that have bodies to be non-retryable. All we do is modify this logic to allow retrying requests with bodies. To avoid potentially unbounded buffering, we set a maximum Content-Length that the proxy will buffer for retries. Requests with Content-Length headers over 64 KB are never considered retryable.

Additionally, as a safeguard against situations where the Content-Length header is incorrect and the body is longer than its advertised length, we added a check in the ReplayBody type that stops buffering if the buffer ever exceeds the limit. If this occurs, any previously buffered data is discarded and the request will no longer be retried. This means that the proxy cannot run out of memory due to a bug in the client, or a malicious request from the outside world.

Summary

Retries are one of Linkerd’s most important reliability features. Before Linkerd 2.11, though, only requests without bodies could be retried, limiting the cases where this feature could be used. In 2.11, however, we added support for retrying requests with bodies.

Retrying requests with bodies involves some interesting implementation challenges, especially when we take streaming bodies into consideration.

In this post, we looked at how the Linkerd proxy minimizes the performance overhead of buffering request bodies by reducing copying and allocation. We also discussed how the proxy determines which requests can be retried, and some of the edge cases that had to be taken into consideration.

We hope you’re as excited as we are for this new Linkerd feature. You can try it out yourself by upgrading to Linkerd 2.11 (if you haven’t already) and enabling retries on routes with request bodies!