Rate Limiting a Geocoding Pipeline: Token Bucket vs Leaky Bucket vs Sliding Window

Q: Which rate-limiting algorithm should I use for a geocoding pipeline?

Token bucket for 95% of cases. It allows bursts up to a configurable capacity, refills at a steady rate, and is the easiest of the three to implement. Use Redis as the backend if you have multiple workers; use an in-process bucket if you are single-worker.

Q: What is the difference between token bucket and leaky bucket?

Token bucket allows bursts — you save up tokens during quiet periods and can spike when needed. Leaky bucket smooths output to a strict per-second rate, no bursts allowed. Use leaky only if your provider explicitly forbids bursts (rare); otherwise token bucket maps better to real-world geocoding traffic.

Q: How do I tune token-bucket capacity and refill rate?

Set capacity to the provider's documented burst allowance (often 10–50 calls). Set refill_per_sec to the sustained rate divided by 60 (e.g. 3,000 calls/minute → 50 calls/sec). Start conservative, monitor the 429 rate, increase until you see occasional 429s, then back off 10%.

Q: What should I do when I get a 429 from the geocoder?

Honor the Retry-After header literally. Most providers return seconds; some return an HTTP date. Wait the indicated time, then retry the same call. Do not wrap 429s in your generic exponential backoff — the provider knows when capacity will be available and is telling you exactly when to come back.

Q: Do I need both rate limiting and exponential backoff?

Yes. Rate limiting prevents you from causing 429s in the first place; backoff handles the transient errors that happen anyway (5xx, network timeouts). Skip rate limiting and your pipeline will 429-storm under load. Skip backoff and you fail on every transient error. Combined they produce a pipeline that auto-paces under any conditions.

Three rate-limiting algorithms for geocoding: token bucket, leaky bucket, sliding window. Trade-offs, working code, and which fits your provider.

| May 15, 2026

Rate Limiting a Geocoding Pipeline: Token Bucket vs Leaky Bucket vs Sliding Window

Your geocoding pipeline starts hitting 429 Too Many Requests. The provider's documented limit is "1,000 requests/minute." You're sending in bursts of 20 across 30 workers, but you're also re-trying on errors, and the math doesn't quite add up to "1,000/min." Welcome to client-side rate limiting — the part of the pipeline you only think about after a production incident.

This post is the practical comparison: token bucket, leaky bucket, and sliding window. What they actually do, where each one fits, and the working code that turns "we got 429s for an hour" into "we never get 429s." By the end you should have a rate limiter you trust and the math to size it.

Why client-side rate limiting matters

Server-side rate limiting (the 429 response) is a defense. Your pipeline shouldn't depend on it. Three reasons:

Latency penalty. Every 429 costs you a round trip and a Retry-After wait. A pipeline that hits 100 429s for every 1,000 calls effectively runs at 90% of its theoretical throughput.
Bill surprise. Some providers charge for 429s. Most don't, but you don't want to find out which after the fact.
Coordination across workers. 30 workers each "self-limiting" without a shared counter will burst-collide. The right shape is one shared limiter.

Client-side rate limiting puts the brakes inside your code, before the request leaves the box. The 429s never happen because the call never goes out faster than your declared rate.

The three algorithms

Token bucket

Think of a bucket with capacity B tokens. Tokens drip in at rate R per second. Every request consumes 1 token. If the bucket is empty, the request waits or is rejected.

Properties:

Allows bursts. If the bucket is full and 100 requests arrive, all 100 fire instantly (assuming B ≥ 100), then refill resumes.
Average rate is `R/sec`. Over a long window, you can't exceed it.
Trivial to implement with two integers (bucket_count, last_refill_time).

The right shape for most APIs because providers usually advertise rates in this form ("100 RPS sustained, burst to 200"). Token bucket maps directly onto that model.

# token_bucket.py
import time
from threading import Lock

class TokenBucket:
    def __init__(self, capacity: int, refill_per_sec: float):
        self.capacity = capacity
        self.refill_per_sec = refill_per_sec
        self.tokens = float(capacity)
        self.last_refill = time.monotonic()
        self.lock = Lock()

    def _refill(self):
        now = time.monotonic()
        elapsed = now - self.last_refill
        self.tokens = min(self.capacity, self.tokens + elapsed * self.refill_per_sec)
        self.last_refill = now

    def acquire(self, tokens: int = 1, block: bool = True) -> bool:
        """Acquire `tokens` tokens. Returns True if acquired, False if non-blocking and unavailable."""
        with self.lock:
            self._refill()
            if self.tokens >= tokens:
                self.tokens -= tokens
                return True
            if not block:
                return False
            # Block: compute wait time and sleep
            shortfall = tokens - self.tokens
            wait_seconds = shortfall / self.refill_per_sec
        time.sleep(wait_seconds)
        return self.acquire(tokens, block=True)

# Usage: 1000 calls/min sustained, burst to 100 at any instant
limiter = TokenBucket(capacity=100, refill_per_sec=1000 / 60.0)

def geocode(addr):
    limiter.acquire()  # blocks if needed
    return call_api(addr)

Leaky bucket

Same bucket metaphor, but the bucket *leaks* at rate R. Requests fill the bucket; if it overflows, requests are rejected.

The practical difference from token bucket: leaky bucket is smoothing. Bursts get queued and released at the leak rate. There's no "burst allowance."

# leaky_bucket.py
import time
from threading import Lock
from collections import deque

class LeakyBucket:
    def __init__(self, capacity: int, leak_per_sec: float):
        self.capacity = capacity
        self.leak_per_sec = leak_per_sec
        self.queue = deque()
        self.lock = Lock()

    def acquire(self) -> None:
        """Block until a slot is available."""
        with self.lock:
            now = time.monotonic()
            # Drain finished slots
            while self.queue and self.queue[0] <= now:
                self.queue.popleft()
            if len(self.queue) >= self.capacity:
                wait_until = self.queue[0]
                self.queue.append(wait_until + 1.0 / self.leak_per_sec)
            else:
                self.queue.append(now + 1.0 / self.leak_per_sec)
            wait_seconds = max(0, self.queue[-1] - 1.0 / self.leak_per_sec - now)
        time.sleep(wait_seconds)

Best fit when:

The provider's limit is "exactly N/sec, no bursting." Some legacy APIs are like this.
You want to spread API load evenly across time, not in bursts. Useful when the provider's per-second limit is enforced strictly and a burst causes 429s even if average rate is below limit.

In practice, token bucket beats leaky bucket for most geocoding use cases because providers want to be friendly to bursty workloads (CSV uploads come in waves), and modern APIs (csv2geo included) accept bursts as long as the running average is bounded.

Sliding window

Track all requests within the last N seconds; reject if count ≥ limit.

Simpler to reason about ("I made M calls in the last 60 seconds, my limit is 1,000/min, so I have 1,000 - M more this minute"). Slightly more memory because you store timestamps:

# sliding_window.py
import time
from threading import Lock
from collections import deque

class SlidingWindow:
    def __init__(self, limit: int, window_seconds: float):
        self.limit = limit
        self.window = window_seconds
        self.timestamps = deque()
        self.lock = Lock()

    def acquire(self) -> None:
        with self.lock:
            now = time.monotonic()
            cutoff = now - self.window
            # Drop expired
            while self.timestamps and self.timestamps[0] < cutoff:
                self.timestamps.popleft()
            if len(self.timestamps) >= self.limit:
                wait_until = self.timestamps[0] + self.window
                wait_seconds = wait_until - now
            else:
                self.timestamps.append(now)
                return
        time.sleep(wait_seconds)
        return self.acquire()

# 1000 calls / 60 seconds
limiter = SlidingWindow(limit=1000, window_seconds=60)

Best fit:

When you want to match the provider's reported limit format exactly ("1,000 per minute" → SlidingWindow(1000, 60)).
When debugging is more important than memory (you can dump the timestamps and see the exact request pattern).

Memory cost: O(limit) — for "1,000/min" that's 1,000 timestamps × 8 bytes = 8 KB. Negligible.

When you have multiple workers (the shared-state problem)

Single-process is easy. Multi-worker (BullMQ on multiple boxes, SQS workers, multiple Python processes) is where the algorithms get interesting because each worker can't have its own private bucket — they'd collectively exceed the limit.

Two options:

Option A — Distribute the budget statically

If you have 10 workers and a 1,000/min limit, each worker gets 100/min. Simplest, but inefficient when load is uneven (one worker busy, nine idle = wasting 900/min of budget).

Option B — Shared limiter via Redis

A Redis-backed token bucket that all workers consume from. Atomic via LUA scripts:

-- token_bucket.lua
-- KEYS[1] = bucket key
-- ARGV[1] = capacity
-- ARGV[2] = refill_per_sec
-- ARGV[3] = now (unix epoch with subseconds)
-- ARGV[4] = requested tokens

local capacity     = tonumber(ARGV[1])
local refill       = tonumber(ARGV[2])
local now          = tonumber(ARGV[3])
local req          = tonumber(ARGV[4])

local data = redis.call('HMGET', KEYS[1], 'tokens', 'last')
local tokens = tonumber(data[1]) or capacity
local last   = tonumber(data[2]) or now

-- Refill
tokens = math.min(capacity, tokens + (now - last) * refill)

if tokens >= req then
  tokens = tokens - req
  redis.call('HMSET', KEYS[1], 'tokens', tokens, 'last', now)
  redis.call('EXPIRE', KEYS[1], 3600)
  return {1, tokens}
else
  return {0, tokens}
end

# redis_limiter.py
import redis
import time

r = redis.from_url('redis://localhost:6379/0')
LUA_SHA = r.script_load(open('token_bucket.lua').read())

def acquire(key='geocode:limit', capacity=100, refill=1000/60.0, tokens=1):
    while True:
        now = time.time()
        ok, remaining = r.evalsha(LUA_SHA, 1, key, capacity, refill, now, tokens)
        if ok == 1:
            return
        # Wait based on shortfall
        wait = (tokens - remaining) / refill
        time.sleep(wait)

All workers across all boxes hit the same Redis key. Total throughput across the cluster is bounded by the bucket's refill rate. The LUA script makes the check-and-decrement atomic so no double-spends.

Performance: each acquire is one Redis round trip (~500 μs LAN) plus the LUA execution (~50 μs). At 1,000 RPS the limiter adds ~0.5 ms/call — negligible vs the geocoding API call.

What to set the limits to

Three numbers come from the provider's docs:

Sustained rate (per minute or per second). Set refill_per_sec to this.
Burst rate (max in a short window). Set capacity to this.
Concurrent requests cap, if any. Bound your worker concurrency × per-worker concurrency to this.

For CSV2GEO's free tier (3,000/day): capacity=20, refill_per_sec=3000/(24*60*60) ≈ 0.035. The bucket fills slowly but allows small bursts. For Pro tier (50,000/day = ~35/min sustained): capacity=100, refill_per_sec=35/60 ≈ 0.58.

For higher tiers (250K/day Volume): capacity=500, refill_per_sec=250000/(24*60*60) ≈ 2.9.

Always size slightly under the provider's stated limit. If the provider says "1,000/min", set your client limiter to 950/min. The 5% headroom absorbs clock skew and timing edge cases that would otherwise push you over the line.

Honoring `Retry-After` and `X-RateLimit-*` headers

Even with a good client-side limiter, occasional 429s slip through (clock skew, leftover requests from before a config change, provider-side throttle adjustments). The right response: read the headers, sleep, retry.

def geocode_with_429_handling(addr):
    limiter.acquire()
    r = requests.get(API_URL, params={'q': addr}, headers={'X-API-Key': KEY})
    if r.status_code == 429:
        retry_after = int(r.headers.get('retry-after', '5'))
        time.sleep(retry_after)
        return geocode_with_429_handling(addr)   # tail-recursive retry
    r.raise_for_status()
    return r.json()['results'][0]

CSV2GEO sends three rate-limit headers on every response:

X-RateLimit-Limit — your total allowance per minute
X-RateLimit-Remaining — what's left in the current window
X-RateLimit-Reset — when (unix epoch) the counter resets

Use X-RateLimit-Remaining to detect drift in your local limiter and adjust if it's reading consistently lower than your client thinks:

def adjust_local_limiter(response):
    remaining = int(response.headers.get('x-ratelimit-remaining', '999999'))
    # If server says we have less than client thinks, slow down
    if remaining < limiter.tokens:
        with limiter.lock:
            limiter.tokens = remaining

Backoff is not rate limiting

A common mistake: implementing exponential backoff and calling it rate limiting. They're complementary, not substitutes.

Rate limiting prevents you from sending too many requests *per unit time*.
Exponential backoff spaces out *retries after a failure*.

You need both. A pipeline with exponential backoff but no rate limiting will still get 429-stormed every time it sees high success rates and ramps up. A pipeline with rate limiting but no backoff will retry into the same wall on transient errors. The two together produce a pipeline that auto-paces under all conditions.

The full backoff playbook is in Exponential Backoff.

Cost of getting it wrong

Real numbers from a pipeline I watched:

Before client-side rate limiting: 1M calls/month, 12% 429 rate. Effective successful throughput: 880K calls. Wall-clock time wasted on 429-then-retry: ~6 hours/month.
After token bucket: 1M calls/month, <0.1% 429 rate. Effective throughput: 999K calls. Wall-clock waste: under 5 minutes/month.

The cost was 30 lines of Python. The savings showed up as faster batch completion times — customers got their geocoded CSVs back in 12 minutes instead of 18.

Frequently Asked Questions

Which rate-limiting algorithm should I use for a geocoding pipeline?

Token bucket for 95% of cases. It allows bursts up to a configurable capacity, refills at a steady rate, and is the easiest of the three to implement. Use Redis as the backend if you have multiple workers; use an in-process bucket if you are single-worker.

What is the difference between token bucket and leaky bucket?

Token bucket allows bursts — you save up tokens during quiet periods and can spike when needed. Leaky bucket smooths output to a strict per-second rate, no bursts allowed. Use leaky only if your provider explicitly forbids bursts (rare); otherwise token bucket maps better to real-world geocoding traffic.

How do I tune token-bucket capacity and refill rate?

Set capacity to the provider's documented burst allowance (often 10–50 calls). Set refill_per_sec to the sustained rate divided by 60 (e.g. 3,000 calls/minute → 50 calls/sec). Start conservative, monitor the 429 rate, increase until you see occasional 429s, then back off 10%.

What should I do when I get a 429 from the geocoder?

Honor the Retry-After header literally. Most providers return seconds; some return an HTTP date. Wait the indicated time, then retry the same call. Do not wrap 429s in your generic exponential backoff — the provider knows when capacity will be available and is telling you exactly when to come back.

Do I need both rate limiting and exponential backoff?

Yes. Rate limiting prevents you from causing 429s in the first place; backoff handles the transient errors that happen anyway (5xx, network timeouts). Skip rate limiting and your pipeline will 429-storm under load. Skip backoff and you fail on every transient error. Combined they produce a pipeline that auto-paces under any conditions.

Summary

| Algorithm | Best for | Memory | Implementation | |---|---|---|---| | Token bucket | Most APIs (default) | O(1) | Easy | | Leaky bucket | Strict per-second limits, no bursts | O(capacity) | Medium | | Sliding window | Matches "X per Y seconds" framing | O(limit) | Easy |

For 95% of geocoding pipelines: token bucket, with a Redis backend if you have multiple workers. Set capacity to the provider's burst allowance and refill_per_sec to the sustained rate divided by 60. Read Retry-After on the rare 429 that gets through. Pair with exponential backoff for retry-after-error semantics. Done.

Rate limiting is one of those topics where a small amount of upfront thinking saves a recurring class of incidents. The pipelines that "just work" are the ones where it was designed in from the start.