Recently, I have had the opportunity to work on some APIs that were implementing rate limiting for users. The idea we started with was to implement a limit algorithm that reset after a period of time, e.g., one hour. I researched rate limiting a bit more, and found another algorithm that I wanted to learn more about - Generic Cell Rate Limiting Algorithm (a.k.a, GCRA).
According to Wikipedia, this algorithm was designed for Asynchronous Transfer Mode network's traffic scheduler and shaper. It has since been repurposed to serve as a way to limit the rate of traffic to services and APIs. In researching the algorithm, I found some of the explanations a bit too terse to understand from the start. I had to read some implementations to get a sense of how it worked before the explanations began to make sense. As a result, I'm writing this blog post to hopefully serve as a (hopefully) easier to understand introduction to the algorithm through some easier to consume pieces.
Sliding Window Technique
One thing that is useful to keep in mind is an technique for processing chunks of data from some sequenece. There's more nuance to the technique than I will go into here, but a simple explanation is that given a sequence, e.g., the alphabet {a, b, c, d, ..., x, y, z} we could process groups of N at a time and each group would overlap.
So let's say we're processing the alphabet 3 letters at a time, our windows would look like
- {a, b, c}
- {b, c, d}
- {c, d, e}
- {d, e, f}
- ...
- {w, x, y}
- {x, y, z}
In Python we could do this like so:
from typing import Sequence, TypeVar
import string
T = TypeVar('T')
def simple_sliding_window(seq: Sequence[T]) -> Sequence[T]:
yield from zip(seq, seq[1:], seq[2:])
for window in simple_sliding_window(list(string.ascii_lowercase)):
print(window)
# Prints
# ('a', 'b', 'c')
# ('b', 'c', 'd')
# ('c', 'd', 'e')
# ('d', 'e', 'f')
# ('e', 'f', 'g')
# ('f', 'g', 'h')
# ('g', 'h', 'i')
# ('h', 'i', 'j')
# ('i', 'j', 'k')
# ('j', 'k', 'l')
# ('k', 'l', 'm')
# ('l', 'm', 'n')
# ('m', 'n', 'o')
# ('n', 'o', 'p')
# ('o', 'p', 'q')
# ('p', 'q', 'r')
# ('q', 'r', 's')
# ('r', 's', 't')
# ('s', 't', 'u')
# ('t', 'u', 'v')
# ('u', 'v', 'w')
# ('v', 'w', 'x')
# ('w', 'x', 'y')
# ('x', 'y', 'z')
Naturally, we could make this more complex by having it produce windows of arbitrary size. We could also have it rely on conditions and move endpoints arbitrarily.
This is important to keep in mind, because GCRA uses a windowed view of time to determine if a particular request is limited or not.
Leaky Bucket Limiters
A limiting algorithm that works like a leaky bucket works like it might sound. Users performing operations fill the bucket, but there's a hole that allows the bucket's capacity to regenerate by leaking used operations.
Leaky bucket algorithms, however, tend to rely on a separate process to leak used capacity from the bucket. If that process fails, then the bucket may fill up and users may be limited from performing any action. Herein lies the beauty of GCRA, it doesn't rely on a separate process just an accurate clock, but it still acts like a leaky bucket.
Our sliding window provides the leak, and time is what we're windowing into (rather than the alphabet).
GCRA's Windows
The fundamental part of understanding GCRA is knowing that you're calculating the times for the start and end of a window and that window is when requests are valid. This realization made understanding the rest of GCRA easier for me. Now let's walk though figuring out the windows that GCRA looks for.
Variables Used
There are some variables we should define before going further. Specifically, let's take a short step back to think about rate limiting.
A rate limit is usually defined as the number of operations you can perform within a period of time, e.g., 5,000 HTTP Requests per Hour. Some services allow for burst capacity for their operations, so they may advertise 20,000 operations per hour and give you an extra 10% if you need to burst past that so your true rate limit is 22,000 operations per hour. Let's now call the true limit LIMIT and the period of time it is allowed for the PERIOD, e.g., LIMIT = 22000 and PERIOD = 3600 (= 60 min/hour * 60 sec/min). Next let's define the KEY as the name of the "bucket". Finally, we'll define QUANTITY as the number of operations being requested and the ARRIVED_AT time for when the operation request was received. This might look like:
LIMIT = 22000 PERIOD = 3600 KEY = "operationA/user@example.com" QUANTITY = 1 ARRIVED_AT = NOW()
From these variables we can derive more information that we need for GCRA. One thing we need to know is the periods we "emit" used capacity from which is also often called the "emission interval", which we'll call EMISSION_INTERVAL. This can be derived like so:
EMISSION_INTERVAL = PERIOD / LIMIT
We also need to know what the capacity of the bucket is for the window bounds (and remember we're windowing based on time). In this case, it will be our capacity multiplied by our emission interval, which means
DELAY_VARIATION_TOLERANCE = LIMIT * EMISSION_INTERVAL # or DELAY_VARIATION_TOLERANCE = PERIOD
Next, we need to move our window forward each time we handle a request to perform an operation. We move the window based on how costly the operation is which can be determined (loosely) by the QUANTITY of operations being requested and the EMISSION_INTERVAL. This window movement is crucial to our sliding window technique that is being used to leak from our bucket. We can define the INCREMENT as
INCREMENT = EMISSION_INTERVAL * QUANTITY
If we did have actual costs associated with operations, we could have a modifier that factors in here as well, e.g., EMISION_INTERVAL * QUANTITY * COST_MODIFIER, but for most web applications today there is not much of a difference in costs.
Now, we can focus on the pieces that tend to be where everyone else starts. The first important variable is the "theoretical arrival time", which is commonly referred to as TAT. Based on our TAT and our INCREMENT we determine the the new theoretical arrival time, which we'll call NEW_TAT:
NEW_TAT = TAT + INCREMENT
This NEW_TAT defines the end of our new window. To find the start of our window, we need to find the time our next request will be allowed. We calculate this from our DELAY_VARIATION_TOLERANCE because that is the capacity of our bucket and helps us find the starting bound of our window.
ALLOW_AT = NEW_TAT - DELAY_VARIATION_TOLERANCE
Okay, so now we have the five pieces of information we need to find the remaining amount of our limit:
- ARRIVED_AT
- ALLOW_AT
- NEW_TAT
- TAT
- EMISSION_INTERVAL
From here we can calculate the remaining limit as the floor (largest integer that is less than or equal to the value) of:
((ARRIVED_AT - ALLOW_AT) / EMISSION_INTERVAL) + 0.5
In Python this would look like:
import math
remaining = math.floor(
((arrived_at - allow_at) / emission_interval) + 0.5
)
The implication of this calculation is that the closer the arrival time of the requests are to ALLOW_AT, the less capacity there is remaining. The further we are from ALLOW_AT (in the future), the more capacity we have remaining.
So now that we know how to calculate all of this, let's get into some of the nuanced pieces of our algorithm.
Finding Theoretical Arrival Time for the First Request
On the first request ever, we won't have a pre-existing bucket for a user, this means that we don't have a pre-existing theoretical arrival time (TAT). In this case, we will use ARRIVED_AT (e.g., NOW()) for our TAT. So in effect, what our variables will look like is:
ARRIVED_AT = NOW() TAT = ARRIVED_AT NEW_TAT = ARRIVED_AT + INCREMENT = TAT + INCREMENT ALLOW_AT = NEW_TAT - DELAY_VARIATION_TOLERANCE
Visually speaking that would look like:
+--------------------------------------------------------------> Time | | | ALLOW_AT ARRIVED_AT NEW_TAT +---- DELAY_VARIATION_TOLERANCE -----+
This request is allowed. Next we need to save NEW_TAT so that it becomes TAT for the next request.
A Request with a Pre-existing Theoretical Arrival Time
There are two cases to handle when there's an existing TAT value:
- TAT is older than ARRIVED_AT
- TAT is newer than ARRIVED_AT
- There's a case where the requested operation is allowed
- There's another case where the requested operation is limited
The first case is handled by having your implementation choose the newest timestamp, e.g., MAX(ARRIVED_AT, TAT). In this case, our NEW_TAT can be calculated like so:
NEW_TAT = MAX(ARRIVED_AT, TAT) + INCREMENT
And if TAT < ARRIVED_AT then we essentially have the case for the first request.
The second case is the case where ARRIVED_AT may still be within our window or it may be outside the window. In this case TAT should actually be in the future. So our window may look like:
+--------------------------------------------------------------> Time | | ARRIVED_AT TAT +--- ∆ ---+
In this case we'll then add INCREMENT to TAT and find our ALLOW_AT value to determine the bounds of our window. There are two cases here
CASE 1 (allowed) +--------------------------------------------------------------> Time | | | | ALLOW_AT ARRIVED_AT TAT NEW_TAT +--- ∆ ---+- I -+ +----- DELAY_VARIATION_TOLERANCE -----+ CASE 2 (limited) +--------------------------------------------------------------> Time | | | | ARRIVED_AT ALLOW_AT TAT NEW_TAT +---------------- ∆ --------------+-- I --+ +- DELAY_VARIATION_TOLERANCE -+
In the first case, we will still calculating a positive remaining capacity and we'll save the NEW_TAT for future requests. In the second case, however, (ARRIVED_AT - ALLOW_AT) will be negative. We can exit our algorithm early at this point without saving NEW_TAT and we can calculate two time differences:
The time after which there will be capacity, a.k.a., "retry after"
ALLOW_AT - ARRIVED_AT
The time after which the capacity will be reset, a.k.a, "reset after"
TAT - ARRIVED_AT
And at this point we're done with our algorithm.
Other Resources
- https://brandur.org/rate-limiting is a blog post that goes into more detail about the benefits of different kinds of rate limiting algorithms and has a terser explanation of GCRA.
- https://brandur.org/rate-limiting is an implementation in Golang
- https://github.com/sigmavirus24/rush has a pure Python implementation