System Design

Rate Limiting Algorithms: Token Bucket, Sliding Window, and Leaky Bucket

Rate limiting is how you protect an API from abuse, enforce fair usage across clients, and prevent a single misbehaving consumer from degrading service for everyone else. The algorithm you choose determines how traffic bursts are handled, how much memory you consume per user, and how easy the implementation is to reason about.

Published June 30, 2026

Every public API needs rate limiting. Without it, a buggy client in an infinite retry loop, a competitor scraping your data, or a targeted DDoS attack can exhaust your servers in seconds. Rate limiting shapes inbound traffic into a pattern your infrastructure can reliably serve. The challenge is picking an algorithm that enforces your policy accurately without penalizing legitimate users who send a brief burst of legitimate requests. This connects naturally to broader caching strategies and error handling patterns like returning 429 Too Many Requests with a Retry-After header.

Token Bucket

The token bucket is the most commonly used algorithm and the one deployed by AWS API Gateway, Stripe, and Twilio. The bucket holds a maximum of capacity tokens. Tokens are added at a fixed refill rate. Each incoming request consumes one token. If the bucket is empty, the request is rejected. If the bucket is full and no requests arrive, tokens accumulate up to the cap but no further.

This naturally permits short bursts (up to the full bucket capacity) while enforcing a long-run average equal to the refill rate. A client that has been idle for a while arrives with a full bucket and can handle a burst of work immediately.

import time

class TokenBucket:
    def __init__(self, capacity: int, refill_rate: float):
        """
        capacity: max tokens (also the max burst size)
        refill_rate: tokens added per second
        """
        self.capacity = capacity
        self.refill_rate = refill_rate
        self.tokens = capacity
        self.last_refill = time.monotonic()

    def _refill(self):
        now = time.monotonic()
        elapsed = now - self.last_refill
        new_tokens = elapsed * self.refill_rate
        self.tokens = min(self.capacity, self.tokens + new_tokens)
        self.last_refill = now

    def allow(self) -> bool:
        self._refill()
        if self.tokens >= 1:
            self.tokens -= 1
            return True
        return False


# Example: allow 10 requests per second, burst up to 20
bucket = TokenBucket(capacity=20, refill_rate=10)
for i in range(25):
    status = "OK" if bucket.allow() else "RATE LIMITED"
    print(f"Request {i+1}: {status}")

Fixed Window Counter

The simplest algorithm: divide time into fixed windows (e.g., each minute) and count requests per client per window. When the count exceeds the limit, reject further requests until the window resets.

import time
from collections import defaultdict

class FixedWindowCounter:
    def __init__(self, limit: int, window_seconds: int):
        self.limit = limit
        self.window = window_seconds
        self.counts = defaultdict(lambda: [0, 0])  # [count, window_start]

    def allow(self, client_id: str) -> bool:
        now = int(time.time())
        window_start = (now // self.window) * self.window
        record = self.counts[client_id]

        if record[1] != window_start:
            record[0] = 0
            record[1] = window_start

        if record[0] < self.limit:
            record[0] += 1
            return True
        return False

The fatal flaw is the boundary burst problem: a client can send limit requests at 00:59 and another limit requests at 01:00, effectively doubling the allowed rate across any one-second boundary span. For loose limits this may be acceptable; for strict enforcement it is not.

Sliding Window Log

Store a timestamp for every request a client makes within the last window period. On each new request, purge timestamps older than one window, count what remains, and allow or reject based on the count.

import time
from collections import deque

class SlidingWindowLog:
    def __init__(self, limit: int, window_seconds: int):
        self.limit = limit
        self.window = window_seconds
        self.logs: dict[str, deque] = {}

    def allow(self, client_id: str) -> bool:
        now = time.monotonic()
        cutoff = now - self.window

        if client_id not in self.logs:
            self.logs[client_id] = deque()

        log = self.logs[client_id]
        while log and log[0] < cutoff:
            log.popleft()

        if len(log) < self.limit:
            log.append(now)
            return True
        return False

This is the most accurate algorithm — it eliminates the boundary burst problem entirely. The cost is memory: you store one timestamp per request per client, which becomes expensive at high throughput. At 1000 requests per minute per user across 10,000 active users, that is 10 million stored timestamps.

Sliding Window Counter

A practical hybrid: track the count in the current fixed window and the previous fixed window, then compute a weighted estimate of requests in the rolling period.

estimated_count = (
    prev_window_count * (1 - elapsed_fraction_of_current_window)
    + current_window_count
)

This approximation is accurate to within about 0.1% in practice, requires only two counters per client, and eliminates the boundary burst spike of the fixed window approach. It is the algorithm Cloudflare documented using in their distributed rate limiter.

Leaky Bucket

Incoming requests are placed into a queue (the bucket). A worker drains the queue at a fixed rate, processing one request every 1/rate seconds. If the queue overflows, new requests are rejected. Unlike the token bucket, the leaky bucket enforces a perfectly smooth output rate — there is no burst allowed, only steady throughput.

This is useful when the downstream system cannot handle bursts at all (e.g., a rate-limited third-party API you are proxying). The downside is added latency for queued requests and the complexity of managing the queue itself.

Comparison Table

Algorithm	Burst handling	Accuracy	Memory per client	Implementation complexity
Token bucket	Allows bursts up to capacity	High	O(1) — 2 values	Low
Fixed window	Boundary burst double-spend	Low	O(1) — 2 values	Very low
Sliding window log	No burst; exact rolling count	Exact	O(requests in window)	Medium
Sliding window counter	Approximates rolling count	Very high (~99.9%)	O(1) — 2 values	Low
Leaky bucket	No burst; smoothed output	High	O(queue depth)	High (requires queue)

Distributed Rate Limiting with Redis

In-process rate limiting (as shown above) does not work when your service runs across multiple servers: each instance has its own counters and clients can exceed limits by spreading requests. Redis solves this with atomic operations shared across all instances.

# Sliding window counter in Redis using two keys
import redis
import time

r = redis.Redis()

def is_allowed(client_id: str, limit: int, window: int) -> bool:
    now = int(time.time())
    current_window = now // window
    prev_window = current_window - 1
    elapsed = (now % window) / window

    current_key = f"rl:{client_id}:{current_window}"
    prev_key = f"rl:{client_id}:{prev_window}"

    pipe = r.pipeline()
    pipe.get(prev_key)
    pipe.get(current_key)
    prev_count, current_count = pipe.execute()

    prev_count = int(prev_count or 0)
    current_count = int(current_count or 0)

    estimated = prev_count * (1 - elapsed) + current_count

    if estimated < limit:
        pipe = r.pipeline()
        pipe.incr(current_key)
        pipe.expire(current_key, window * 2)
        pipe.execute()
        return True
    return False

Each key expires after two window lengths, keeping memory usage bounded. The pipeline ensures the read and write happen with minimal round-trips. For stricter atomicity, use a Lua script executed via EVAL, which runs as a single atomic operation on the Redis server.