Exponential Backoff for Geocoding: When to Retry, When to Stop
Exponential backoff for geocoding: which errors to retry, which to fail fast, jitter strategies, and the math that prevents thundering herds.
A geocoding pipeline that retries everything aggressively is the same pipeline that quadruples its API bill during a brief upstream outage and then hits the provider's circuit breaker, making the outage look longer than it actually was. A pipeline that doesn't retry at all is the pipeline that loses 5% of its rows on a 30-second blip and you find out two days later from a customer.
The right shape is exponential backoff with jitter, gated by what kind of error you got. This post is the practical version: which errors deserve retries, how many, with what intervals, and how to add jitter so you don't thundering-herd the recovering provider. Working code in Python and Node.
The errors you'll actually see
Not all errors are retryable. Categorize first, then decide:
| Status / Error | Retry? | Why | |---|---|---| | 200 OK | n/a | success | | 400 Bad Request | no | input is wrong; retrying won't help | | 401 Unauthorized | no | API key issue; fix config | | 403 Forbidden | no | permission issue; same | | 404 Not Found | no | endpoint doesn't exist; URL bug | | 408 Request Timeout | yes | network blip; transient | | 429 Too Many Requests | yes (with Retry-After) | you're going too fast; slow down | | 500 Internal Server Error | yes | server-side bug or overload | | 502 Bad Gateway | yes | proxy/load-balancer issue | | 503 Service Unavailable | yes (with Retry-After) | server overloaded | | 504 Gateway Timeout | yes | upstream slow; retry | | Network: ECONNREFUSED | yes | server down briefly | | Network: ETIMEDOUT | yes | network blip | | Network: ECONNRESET | yes | connection drop | | getaddrinfo ENOTFOUND | yes, slowly | DNS failure (transient or config) |
The classification rule of thumb: 5xx and network errors are transient; 4xx (except 408 and 429) are permanent. Permanent errors burn API quota on retries that won't help.
The basic algorithm
import time
import random
def call_with_backoff(fn, max_attempts=5, base_delay=1.0, max_delay=60.0):
"""
Retries `fn` with exponential backoff + jitter.
`fn` must raise on failure (or return a falsy value to retry).
"""
last_error = None
for attempt in range(max_attempts):
try:
return fn()
except RetryableError as e:
last_error = e
if attempt == max_attempts - 1:
raise # exhausted
delay = min(base_delay * (2 ** attempt), max_delay)
jitter = random.uniform(0, delay) # full jitter
time.sleep(jitter)
raise last_errorWhat's happening:
- `base_delay * (2 attempt)`** = doubling pattern: 1s, 2s, 4s, 8s, 16s.
- `min(..., max_delay)` caps the wait so you don't sleep for hours.
- `jitter = random.uniform(0, delay)` spreads retries across time. Without this, every client retries at the same instant and DDOSes the recovering server.
Why jitter matters (the thundering herd)
Imagine 1,000 clients all retry exactly 1 second after a brief 503. They all arrive at the recovering server at the same instant. The server, just back online, gets 1,000 simultaneous requests and falls over again. This is "thundering herd" and it's why naive exponential backoff makes outages last longer than they need to.
Jitter spreads those 1,000 retries across the next second (or 2, or 4 — whatever the backoff window is). The server sees a steady ~250 RPS instead of a 1,000-RPS spike. It stays up.
Three jitter strategies:
Full jitter (recommended)
jitter = random.uniform(0, base_delay * (2 ** attempt))The retry happens somewhere in [0, exponential_delay]. Maximum spread.
Equal jitter
delay = base_delay * (2 ** attempt)
jitter = delay/2 + random.uniform(0, delay/2)Half the delay is fixed, half is random. Bounds the "earliest possible retry" while still spreading load.
Decorrelated jitter (AWS recommendation)
prev_delay = ... # tracked across retries
delay = min(max_delay, random.uniform(base_delay, prev_delay * 3))Generates non-monotonic delays — sometimes longer, sometimes shorter than the previous attempt. Best for highly contended retry storms.
For 95% of geocoding pipelines, full jitter is the right default. Use decorrelated jitter only if you're seeing measurable thundering herds in your metrics.
Honoring Retry-After
When the server explicitly tells you when to retry, respect it. Always:
def call_with_backoff(fn, max_attempts=5, base_delay=1.0, max_delay=60.0):
for attempt in range(max_attempts):
try:
return fn()
except RateLimited as e:
# Server told us when to retry; don't override
if e.retry_after_seconds:
time.sleep(e.retry_after_seconds + random.uniform(0, 1))
continue
# No header — fall back to exponential
delay = min(base_delay * (2 ** attempt), max_delay)
time.sleep(random.uniform(0, delay))
except RetryableError as e:
if attempt == max_attempts - 1:
raise
delay = min(base_delay * (2 ** attempt), max_delay)
time.sleep(random.uniform(0, delay))Retry-After is the server's authoritative signal. Adding 0–1s of jitter on top is fine (avoids thundering herd if the server returns the same Retry-After to many clients) but otherwise just trust it.
A complete geocoding worker with backoff
Putting it together with proper error classification:
import requests
import time
import random
PERMANENT_STATUSES = {400, 401, 403, 404, 422}
TRANSIENT_STATUSES = {408, 429, 500, 502, 503, 504}
def geocode(addr, max_attempts=5):
base_delay = 1.0
max_delay = 60.0
for attempt in range(max_attempts):
try:
r = requests.get(
'https://api.csv2geo.com/v1/geocode',
params={'q': addr},
headers={'X-API-Key': API_KEY},
timeout=10,
)
# Permanent error — fail fast, don't burn retries
if r.status_code in PERMANENT_STATUSES:
return {'error': f'permanent_{r.status_code}', 'attempts': attempt + 1}
# Rate limited — honor Retry-After
if r.status_code == 429:
wait = int(r.headers.get('retry-after', '5'))
time.sleep(wait + random.uniform(0, 1))
continue
# Transient error — exponential backoff with jitter
if r.status_code in TRANSIENT_STATUSES:
if attempt == max_attempts - 1:
return {'error': f'transient_{r.status_code}_exhausted', 'attempts': attempt + 1}
delay = min(base_delay * (2 ** attempt), max_delay)
time.sleep(random.uniform(0, delay))
continue
# 2xx success
r.raise_for_status()
results = r.json().get('results', [])
return results[0] if results else None
except (requests.ConnectionError, requests.Timeout) as e:
# Network errors are always retryable
if attempt == max_attempts - 1:
return {'error': f'network_{type(e).__name__}_exhausted', 'attempts': attempt + 1}
delay = min(base_delay * (2 ** attempt), max_delay)
time.sleep(random.uniform(0, delay))
return {'error': 'unreachable', 'attempts': max_attempts}Three things this does that simpler implementations miss:
- Distinguishes permanent from transient. A 400 Bad Request burns one attempt and stops. A 503 burns up to 5 attempts.
- Honors `Retry-After` separately from exponential. Rate limiting and outages are different problems; treat them differently.
- Returns structured errors. Caller knows whether to log/alert (
network_*_exhausted) or just record the row as failed (permanent_400).
When to stop
Three stopping conditions:
Attempt count
max_attempts=5 is the default for a reason: the cumulative wait time is 1 + 2 + 4 + 8 + 16 = 31 seconds (without jitter). If the provider hasn't recovered in 31 seconds, you're in a real outage, not a blip. Time to give up on this row and move on.
For very critical pipelines (real-time API serving), use max_attempts=3 with shorter base_delay=0.5 to fail fast and fall back to a queue.
Total elapsed time
Some pipelines need a hard latency budget. "If we can't geocode this in 30 seconds, return an error to the user."
def call_with_deadline(fn, deadline_seconds=30):
start = time.monotonic()
attempt = 0
while time.monotonic() - start < deadline_seconds:
try:
return fn()
except RetryableError:
attempt += 1
delay = min(1.0 * (2 ** attempt), deadline_seconds - (time.monotonic() - start))
if delay <= 0:
break
time.sleep(random.uniform(0, delay))
raise DeadlineExceeded()Circuit breaker
If 50% of recent calls have failed, stop calling for 30 seconds. Lets the provider recover without you piling on:
class CircuitBreaker:
def __init__(self, threshold=0.5, window=60, cooldown=30):
self.threshold = threshold
self.window = window
self.cooldown = cooldown
self.results = [] # [(timestamp, success_bool), ...]
self.opened_at = None
def allow(self) -> bool:
now = time.monotonic()
if self.opened_at and now - self.opened_at < self.cooldown:
return False
if self.opened_at:
self.opened_at = None # cooldown ended; try again
# Drop old results
cutoff = now - self.window
self.results = [(t, s) for t, s in self.results if t > cutoff]
if len(self.results) >= 10:
failure_rate = 1 - sum(1 for _, s in self.results if s) / len(self.results)
if failure_rate >= self.threshold:
self.opened_at = now
return False
return True
def record(self, success: bool):
self.results.append((time.monotonic(), success))Use:
breaker = CircuitBreaker()
def geocode_with_breaker(addr):
if not breaker.allow():
return {'error': 'circuit_open'}
result = geocode(addr) # function with backoff
breaker.record(not result.get('error'))
return resultWhen the circuit is open, calls return immediately with circuit_open. The pipeline should queue these for retry after the cooldown.
Don't retry permanent errors
A pipeline that retries 400s is a pipeline that calls the API 5× for every bad input row. On a batch with 5% bad inputs, that's 25% wasted calls.
The most common subtle mistake here: treating "no match" (200 OK with results: []) as an error and retrying. It's not an error — the address legitimately couldn't be geocoded. Retrying produces the same empty result and burns API quota. Treat empty results as a final state and write them as such ({ matched: false }).
Idempotency makes backoff free
Per Idempotent Geocoding, if every call has an idempotency key, retries don't double-charge. That changes the backoff calculus: you can be more aggressive with retries because the worst case is just latency, never billing.
The combination "exponential backoff + idempotency key + cache" is the magic triple that turns a brittle pipeline into one that recovers automatically from any provider issue short of multi-hour outages.
Observability matters
Without logging the right counters, you can't tell if your backoff is working. Track at minimum:
- Per-attempt success rate. Attempt 1 should succeed >95% of the time; attempt 5 should succeed <30%. If attempt 1 success drops, your provider has a problem.
- Total retry count per call. P50 should be 0 (most calls succeed first try). P99 might be 2–3 in normal conditions; if it climbs to 5 you're in incident territory.
- Time spent in backoff vs API call. Backoff time should be a single-digit % of total wall-clock. If it's 30%, your pipeline is fighting the provider — fix rate limiting first.
- Status code distribution. A sudden spike in 503s vs 504s tells you whether the provider's servers are overloaded (503) or its network is slow (504). Different fixes.
The full observability playbook is in Observability for Geocoding Pipelines.
Frequently Asked Questions
What is full jitter and why does every retry need it?
Full jitter means the actual delay is a random value between zero and the current exponential backoff window, not a fixed value. Without it, every client retries at the exact same delay after an outage and the recovering server is immediately thundering-herded back into failure. With jitter, retries spread evenly across the backoff window and the server can absorb them.
How many retries should I attempt before giving up?
Five attempts or 30 seconds total elapsed, whichever comes first. Beyond that, the failure is no longer transient — surface it to the user or queue for later batch retry. Infinite retries waste compute and can amplify a partial outage into a worse one.
Which HTTP errors should I retry versus fail fast?
Retry: 408 (timeout), 429 (rate limit — honor Retry-After), 500/502/503/504 (server-side transient). Fail fast: 400 (bad request), 401 (auth), 403 (forbidden), 404 (not found), 422 (invalid input). Retrying a 400 or 401 wastes calls — the input is broken and will not fix itself.
What metrics should I track to know if my backoff is working?
Per-attempt success rate (attempt 1 should be >95%), P50 and P99 retry count per call, time spent in backoff versus API call (should be single-digit % of wall clock), and status-code distribution (sudden 503 spike = server overload, 504 spike = network slow). If attempt-1 success drops below 95%, the provider has a problem — pause and check their status page.
Why is idempotency essential for safe retries?
Without it, every retry risks double-charging or double-writing. With an idempotency key, a retry that arrives at the API after the original silently-succeeded gets the cached result and never re-runs the operation. Idempotency turns retries from a risky trade-off into a free reliability win.
Summary
Three rules:
- Classify errors. Permanent → fail fast. Transient → exponential backoff with full jitter. Rate-limited → honor
Retry-After. - Always jitter. Without it, retries thundering-herd the recovering server and you're the cause of the outage extending.
- Stop at 5 attempts or 30 seconds. Whichever comes first. Beyond that, queue for later or surface the failure to the user.
Pair with client-side rate limiting to prevent retries-causing-retries cascades, and idempotency to make every retry safe. The three together produce a pipeline that handles real-world API turbulence without you noticing.
Use our batch geocoding tool to convert thousands of addresses to coordinates in minutes. Start with 100 free addresses.
Try Batch Geocoding Free →