Idempotent Geocoding: Why and How to Make Calls Safe to Retry
Design a geocoding pipeline where every call is safe to retry — idempotency keys, cache-as-dedup, patterns that prevent double charges.
A geocoding pipeline that isn't idempotent is a pipeline that double-charges you, double-writes results to your warehouse, and silently corrupts data the day a worker crashes between "API call succeeded" and "result written to DB." Idempotency isn't optional — it's the property that lets you retry on every failure without consequences. Done right, it's invisible. Done wrong, you find out about it when the AWS bill triples or a customer notices their address was geocoded twice.
This post is the practical playbook: what idempotency means for geocoding specifically, how to design the keys, where the boundary lives, and the four code patterns that make every call safe to retry. Working code in Python and Node. By the end you should be able to retry any geocoding call with confidence.
What "idempotent" actually means
A function is idempotent if calling it N times produces the same effect as calling it once. For geocoding, that means:
- Same input address → same coordinates returned (obviously).
- Same input address called twice → only one API call made (or zero if cached).
- Same input address called twice → one row in the result store, not two.
- Same input retried after a worker crash → no double-charge, no orphan partial state.
The first one is automatic — the geocoder is deterministic. The other three are not. They require explicit design.
Why retries matter
Networks fail in three ways for HTTP calls:
- Request fails to send. No bytes left your machine. You can retry safely — the API never saw the call.
- Request succeeded; response failed. API processed the call, but your client never saw the response. A naive retry double-processes.
- Response succeeded; downstream write failed. API responded; your code crashed before persisting. Retry processes again, double-charges, possibly creates a duplicate result row.
Cases 2 and 3 are where idempotency matters. They're not common — maybe 0.1% of calls in normal conditions, 1–5% during a network or queue problem. But on a million-call month, "0.1% double-charges" is 1,000 duplicate calls. At $0.0005 each that's $0.50. Annoying but not fatal. The real damage is downstream: 1,000 duplicate rows in your warehouse that look subtly wrong.
The fix is to make every call idempotent so a retry is harmless.
The idempotency key
The mechanism is simple: every logical operation gets a key. Before doing the work, check if a result for that key already exists. If so, return it. If not, do the work, persist the result keyed by the key.
For geocoding, the natural key is a hash of the normalized address — the same hash you'd use for caching and deduplication. One concept, three uses:
import hashlib
from postal.parser import parse_address
def idempotency_key(raw_address: str, country: str = 'US') -> str:
"""Stable key derived from normalized address components."""
components = {label: value for value, label in parse_address(raw_address)}
parts = [
components.get('house_number', ''),
components.get('road', '').lower(),
normalize_postcode(components.get('postcode', ''), country),
country.upper(),
]
return hashlib.sha256('|'.join(parts).encode()).hexdigest()[:32]The key has three properties:
- Deterministic. Same input → same key. Always.
- Normalized. "123 Main St" and "123 main street" produce the same key.
- Privacy-preserving. The hash is one-way; no PII leaks if logs are exposed.
Pattern 1 — Cache-as-dedup (the easy 95% case)
The simplest pattern: put a cache in front of the geocoder. The cache key IS the idempotency key. A retried call hits the cache, returns the same result, makes zero API calls.
import requests
def geocode(raw: str, cache: dict, country: str = 'US') -> dict | None:
key = idempotency_key(raw, country)
if key in cache:
return cache[key] # idempotent return — no API call
r = requests.get(
'https://api.csv2geo.com/v1/geocode',
params={'q': raw, 'country': country},
headers={'X-API-Key': API_KEY},
timeout=10,
)
r.raise_for_status()
result = (r.json().get('results') or [None])[0]
cache[key] = result
return resultThis handles case 1 (network failure before send) automatically — no cache write happens, retry is a fresh attempt. It handles case 2 (response lost) on the retry — the API call executes again but the cache is updated; from the *caller's* perspective the operation is consistent.
Limitation: it doesn't dedupe within a single in-flight retry burst. If two workers grab the same job at the same instant, both miss the cache, both call the API. Pattern 2 fixes that.
Pattern 2 — Server-side idempotency keys
For tighter guarantees, send an explicit idempotency key with the request. The server stores the key with the result; if the same key arrives within a TTL window, the server returns the cached result without re-running the operation.
The CSV2GEO API supports this via the Idempotency-Key header (Stripe-style):
def geocode_with_key(raw: str, country: str = 'US') -> dict | None:
key = idempotency_key(raw, country)
r = requests.get(
'https://api.csv2geo.com/v1/geocode',
params={'q': raw, 'country': country},
headers={
'X-API-Key': API_KEY,
'Idempotency-Key': key,
},
timeout=10,
)
r.raise_for_status()
return (r.json().get('results') or [None])[0]Server semantics:
- First request with key X: process normally, store
(X, result)for 24 hours. - Subsequent requests with key X within 24 hours: return stored result, don't bill.
- Same key arrives concurrently while first is in-flight: queue the second behind the first, return the first's result to both.
This handles all three failure cases at the server level. The client doesn't need a local cache for safety, only for performance.
Pattern 3 — Idempotent writes downstream
The geocoder is one half of the pipeline; persisting the result is the other. A typical anti-pattern:
def process(addr):
result = geocode(addr)
db.execute('INSERT INTO geocoded_rows (addr, lat, lng) VALUES (?, ?, ?)',
addr, result['lat'], result['lng'])Crash between geocode and INSERT: retry geocodes again (handled by Pattern 1/2) but never inserts. Crash after INSERT but before the queue acks: retry inserts a duplicate.
The fix is to make the write idempotent. Two options:
Option A — UPSERT with the idempotency key as the unique constraint:
CREATE TABLE geocoded_rows (
idempotency_key text PRIMARY KEY,
addr_raw text NOT NULL,
lat double precision NOT NULL,
lng double precision NOT NULL,
created_at timestamptz NOT NULL DEFAULT now()
);def process(addr):
key = idempotency_key(addr)
result = geocode_with_key(addr)
db.execute('''
INSERT INTO geocoded_rows (idempotency_key, addr_raw, lat, lng)
VALUES (?, ?, ?, ?)
ON CONFLICT (idempotency_key) DO NOTHING
''', key, addr, result['lat'], result['lng'])UPSERT on the idempotency key collapses retries into a no-op. No duplicate rows.
Option B — Two-phase commit with the queue:
If you're using a queue (see Designing a Batch Geocoding Queue), the queue's "ack" itself becomes the commit. The pattern:
- Worker pulls message from queue.
- Worker geocodes (idempotent via Pattern 1/2).
- Worker writes result to DB (idempotent via UPSERT).
- Worker acks the message.
If step 4 fails or the worker crashes before it, the message reappears in the queue (SQS visibility timeout, BullMQ retry config). The retry runs steps 2–4 again. Step 2 hits cache → no double-charge. Step 3 is no-op via UPSERT → no duplicate row. Step 4 succeeds → message acked.
End-to-end idempotent. The cost is one extra UPSERT per call; trivial compared to the API call cost.
Pattern 4 — Idempotent batch operations
Single-row idempotency is one thing; batch operations are harder. A naive batch geocode:
def batch(addrs):
results = csv2geo_batch(addrs) # POST /v1/geocode with all addrs
for addr, result in zip(addrs, results):
db.insert(addr, result)If the batch call succeeds but the loop crashes mid-write, retrying re-runs the entire batch — refunding nothing, double-charging everything.
The fix is to make the BATCH itself idempotent:
def batch(addrs, batch_id):
"""batch_id is stable per logical batch (e.g. uploaded CSV's hash)."""
# Check if batch already completed
if db.exists('SELECT 1 FROM batches WHERE batch_id = ? AND status = ?',
batch_id, 'complete'):
return # idempotent — already done
# Mark batch in-progress (atomic: insert-or-update)
db.execute('''
INSERT INTO batches (batch_id, status)
VALUES (?, 'in_progress')
ON CONFLICT (batch_id) DO UPDATE SET status = 'in_progress'
''', batch_id)
# Process each row idempotently
for addr in addrs:
key = idempotency_key(addr)
if db.exists('SELECT 1 FROM geocoded_rows WHERE idempotency_key = ?', key):
continue # already done as part of an earlier retry
result = geocode_with_key(addr)
db.execute('''
INSERT INTO geocoded_rows (idempotency_key, addr_raw, lat, lng, batch_id)
VALUES (?, ?, ?, ?, ?)
ON CONFLICT (idempotency_key) DO NOTHING
''', key, addr, result['lat'], result['lng'], batch_id)
db.execute('UPDATE batches SET status = ? WHERE batch_id = ?',
'complete', batch_id)This pattern lets you retry the entire batch any number of times. The first call processes everything; subsequent calls skip rows that already have an idempotency key in the result table; the final UPDATE marks the batch complete.
When idempotency goes wrong
Three failure modes I've watched cost teams real money or trust:
Idempotency key includes the timestamp
# WRONG — different key every call
def idempotency_key(addr):
return hashlib.sha256(f'{addr}|{datetime.now()}'.encode()).hexdigest()Looks defensive ("freshness!") but breaks the property: every retry hashes to a new key, every retry executes the operation again. You've reinvented the bug you were trying to fix.
The right thinking: the key represents the logical operation, not the call attempt. Two calls for "geocode 123 Main St" should have the same key whether they happen 1 second apart or 1 day apart.
Idempotency key includes the user_id
# WRONG (depending on intent)
def idempotency_key(addr, user_id):
return hashlib.sha256(f'{addr}|{user_id}'.encode()).hexdigest()If User A and User B both geocode "123 Main St", the result is the same — the building hasn't moved. Including user_id means you re-geocode for every user, defeating cache hits across users. The geocoded result is shared infrastructure.
Exception: if your business logic requires per-user audit trails, store the user_id on the WRITE side (in geocoded_rows) but not in the idempotency key.
Idempotency window too short
If the server-side TTL on idempotency keys is 60 seconds, a retry that arrives 65 seconds later is treated as a fresh call. Set the TTL to at least 24 hours; ideally to "forever" for geocoding because addresses don't change.
CSV2GEO's idempotency key TTL is 7 days. Long enough to cover every retry pattern we've seen, short enough that key storage costs don't grow unbounded.
Cost math
Idempotency saves money in two places:
- Skipped duplicate API calls. On a typical batch with 5% network/transient errors triggering retries, that's 5% saved on the API bill.
- Skipped duplicate downstream writes. Free at the application level; a few extra UPSERTs cost nothing.
For a million-call/month pipeline at $0.0005/call:
- Without idempotency, with 5% retry rate: 1,050,000 API calls = $525
- With idempotency: 1,000,000 API calls = $500
Saved: $25/month. Small in dollars; large in trust. The data-quality wins (no duplicate warehouse rows, no orphan state) are uncountable but real — they're the difference between "the geocoding pipeline just works" and "let me check the duplicate count this morning."
Putting it together
A complete geocoding worker that's idempotent at every level:
def worker(message):
addr = message.body['addr']
batch_id = message.body['batch_id']
key = idempotency_key(addr)
# If we already wrote this row, no-op (handles retry after partial crash).
if db.exists('SELECT 1 FROM geocoded_rows WHERE idempotency_key = ?', key):
message.ack()
return
# Geocode (server-side idempotency via header)
result = geocode_with_key(addr)
if not result:
message.ack() # no-match is a final state
return
# UPSERT — collapses concurrent duplicate writes
db.execute('''
INSERT INTO geocoded_rows (idempotency_key, addr_raw, lat, lng, batch_id)
VALUES (?, ?, ?, ?, ?)
ON CONFLICT (idempotency_key) DO NOTHING
''', key, addr, result['lat'], result['lng'], batch_id)
message.ack()11 lines of business logic. Idempotent at the cache layer (Pattern 1, implicit), the API layer (Pattern 2, via Idempotency-Key header), the DB layer (Pattern 3, via UPSERT), and the queue layer (Pattern 4, via the early-return on existing key).
A worker that crashes anywhere in this function can be safely re-invoked. The queue's at-least-once delivery becomes "at-most-once effect" — the strongest property a distributed system can give you for free.
Frequently Asked Questions
What makes a geocoding call idempotent?
An idempotency key derived from the input — typically the SHA-256 of the normalized address components. The key represents the logical operation, not the call attempt. A retry with the same key returns the cached result instead of calling the API again, so a worker crashing mid-batch can be safely replayed without double-charging.
Why should the idempotency key never include a timestamp?
Because then every retry produces a different key, defeating the purpose. A retried call must look identical to the cache, the API, and the database. Hash only stable input — country, normalized street, postcode, house number — never timestamps, request IDs, or user IDs that change between attempts.
How does the queue layer become idempotent?
Check the idempotency key against the destination table before processing the message. If the row already exists (UPSERT with ON CONFLICT DO NOTHING), ack the message and return. At-least-once delivery becomes at-most-once effect — the strongest guarantee a distributed system can give you without distributed transactions.
Does idempotency cost me anything in performance?
Almost nothing. The idempotency check is a single index lookup before each operation — sub-millisecond at any reasonable scale. The savings from skipped duplicate API calls and avoided double-writes dwarf the overhead by orders of magnitude.
What is the relationship between idempotency, caching, and retries?
They are the same machinery wearing different hats. The cache makes retries free (re-issue → cache hit). The idempotency key prevents double-charges if the cache misses. Exponential backoff handles transient failures because retries are guaranteed to converge. Skip any one of the three and the pipeline breaks under real load.
Summary
Three principles:
- The idempotency key represents the logical operation, not the call attempt. Hash the normalized input, never the timestamp.
- Idempotency is end-to-end. API call, downstream write, queue ack — all need to be safe to retry. One weak link breaks the chain.
- Free idempotency comes from the cache. If you've already built the cache for cost reasons, you've already built most of the idempotency machinery. Wire it together.
When everything is idempotent, retries become free and rate limits become tractable. It's the foundation that lets the rest of the pipeline be aggressive about reliability.
Use our batch geocoding tool to convert thousands of addresses to coordinates in minutes. Start with 100 free addresses.
Try Batch Geocoding Free →