Sync vs async batch geocoding: a 1 M-row decision tree
When does synchronous geocoding break down and async batch take over? A decision tree, latency budgets, and working code for 1 M-row logistics pipelines.
Most geocoding integrations start synchronous. A driver app resolves one stop address before rendering the map pin. A customer-facing order tracker converts a depot address on page load. These are fine. The sync path is simple, debuggable, and — for one address at a time — entirely appropriate.
The trouble arrives the day a logistics operations team uploads a shipment manifest with 900,000 rows and expects the geocoded output by morning. Or the day the warehouse batch-imports a quarter's worth of supplier addresses. Or the day the routing engine needs to re-geocode every active stop after a national boundary update. At that point the synchronous pattern does not just slow down — it fails categorically. Connections time out, rate-limit buckets overflow, memory pressure builds because you are holding the entire pending dataset in RAM, and the retry logic that was fine for a hundred rows now creates thundering-herd collisions.
This post gives you the decision tree for choosing between sync and async batch, the latency-budget arithmetic that makes the boundary obvious, and the production patterns for running async batch jobs against the CSV2GEO API for datasets up to and well past 1 M rows. All code is plain HTTP — curl, Python requests, and Node fetch. SDKs exist and are documented, but every pattern here works without them and without version pinning.
The decision tree
Five questions. Answer them in order and you will land in the right pattern before you write a line of code.
1. Is a human waiting for the result right now? If yes — a user is staring at a spinner — you want sync. Async batch is not suitable for interactive latency budgets. Even a well-optimised async job adds seconds of queue overhead. The user-facing path should stay synchronous, cached, and as close to the edge as possible.
2. How many addresses per invocation? Below roughly 50 addresses per trigger event: sync with per-request batching (the /api/v1/geocode endpoint accepts up to 100 addresses per call). Between 50 and ~5,000: sync with a small concurrent worker pool. Above 5,000 per invocation, or any pipeline that triggers more than a few times per minute: async batch, no exceptions. The arithmetic is in the next section.
3. Does your compute node need to stay alive for the full duration? A Lambda, a Cloud Run container, a Fargate task — most serverless runtimes impose a maximum execution time between 5 and 15 minutes. A million-row geocoding job at reasonable throughput takes longer than that. Async batch decouples the submission from the wait, so your orchestrator can submit the job, checkpoint the job ID, and poll or receive a webhook when results are ready. The compute node does not need to babysit.
4. Are your addresses already deduplicated and cached? If a large fraction of addresses repeat — which is true in every logistics context (the depot addresses, the distribution-centre addresses, and the top-100 postcodes appear thousands of times a week) — cache aggressively before deciding you need batch at all. A 70% cache-hit rate on a 200,000-row dataset reduces your effective uncached rows to 60,000, which a sync worker pool handles comfortably. See Caching Geocoding Results — 90% Cost Reduction before provisioning any batch infrastructure.
5. Does the consuming system need results atomically or row-by-row? A routing engine that cannot begin planning until all stops are geocoded needs atomic delivery — submit everything, wait for the complete result set, then proceed. A warehouse system that enriches rows as they arrive and writes them to a database as they complete is row-by-row. Async batch handles both: the job status endpoint tells you when the full set is ready, but you can stream partial results from the output URL as they land.
If you answered "async batch" above, read on. If you landed on sync, the concurrency tuning post has the worker-pool patterns you need.
Latency-budget arithmetic
This is the maths that makes the boundary concrete.
Assume a realistic synchronous geocoding call takes somewhere between 80 ms and 300 ms end-to-end, including DNS resolution, TLS handshake on a warm connection, server processing, and response parsing. Call it 150 ms as a median planning figure for a co-located service with a persistent connection pool. (Your actual number will vary; measure it — see P99 Latency: Why the Average Lies for the right way to do that.)
At 150 ms per call, with a batch size of 100 addresses per call, you are doing 667 calls per 100,000 addresses. With a single-threaded synchronous client, that is 667 × 0.15 s = 100 seconds per 100,000 addresses. For 1,000,000 addresses: 1,000 seconds, about 17 minutes, assuming zero failures and perfect throughput.
In practice you have retries, rate-limit backoffs, and occasional 5xx spikes. Real-world single-threaded throughput for a million rows is typically 30–60 minutes. That is fine if your SLA is "done by tomorrow morning." It is not fine if operations need the routed manifest before the first driver shift starts at 06:00.
The async batch path changes the constraint. Instead of one thread holding a connection open for 30 minutes, you:
- Submit the full dataset in one or a few HTTP calls.
- Release the thread immediately.
- Poll or receive a webhook when results are ready.
- Fetch the result file.
The server side can parallelise across many workers. Your client side is idle during the processing window. Your compute node can be a short-lived Lambda that submits and exits, with a separate lightweight poller that wakes up every 30 seconds. The architecture degenerates from "one long-lived connection holding state" to "a few short calls and a wait."
The break-even is roughly the point where managing concurrency, backoff logic, and connection-pool tuning on the sync path becomes more operational overhead than simply using async batch. In practice that is somewhere around 5,000–10,000 addresses per job, depending on your team's appetite for plumbing.
What the async batch API looks like
CSV2GEO's batch geocoding flow has three steps: submit, poll (or wait for webhook), and fetch.
Submit
curl -s -X POST "https://csv2geo.com/api/v1/batch/geocode" \
-H "Content-Type: application/json" \
-d '{
"api_key": "'"$CSV2GEO_KEY"'",
"addresses": [
{"id": "stop_001", "q": "15 Westfield Ave, Chicago IL 60601"},
{"id": "stop_002", "q": "900 N Michigan Ave, Chicago IL 60611"}
],
"webhook_url": "https://your-system.example.com/hooks/geocode-complete"
}'The response is immediate — the server acknowledges the job and returns a job ID. No addresses have been geocoded yet.
{
"job_id": "geo_7f3a291c",
"status": "queued",
"submitted": 2,
"estimated_seconds": 1
}For large submissions, send the addresses in the addresses array or upload a CSV file using the multipart form variant — the documentation covers both. For a million rows, chunking into multiple job submissions of 50,000–100,000 addresses each is sensible; it gives you finer-grained progress reporting and means a single transient failure invalidates a smaller fraction of your work.
Poll
If you did not supply a webhook_url, poll the status endpoint:
curl -s "https://csv2geo.com/api/v1/batch/geocode/geo_7f3a291c/status?api_key=$CSV2GEO_KEY"{
"job_id": "geo_7f3a291c",
"status": "processing",
"processed": 44821,
"total": 100000,
"pct_complete": 44.8,
"result_url": null
}When status is complete, result_url is populated with a signed URL from which you can fetch the result file. Poll on a sensible interval — 30 seconds is appropriate for a 100,000-row job; 5 minutes is appropriate for a million rows. Polling every second is not sensible and will eat your free-tier quota. A simple Python poller:
import time
import requests
import os
API = "https://csv2geo.com/api/v1"
KEY = os.environ["CSV2GEO_KEY"]
def poll_until_complete(job_id, interval_s=30, timeout_s=3600):
deadline = time.time() + timeout_s
while time.time() < deadline:
r = requests.get(
f"{API}/batch/geocode/{job_id}/status",
params={"api_key": KEY},
timeout=15,
)
r.raise_for_status()
body = r.json()
status = body["status"]
pct = body.get("pct_complete", 0)
print(f" {job_id}: {status} ({pct:.1f}%)")
if status == "complete":
return body["result_url"]
if status == "failed":
raise RuntimeError(f"Job {job_id} failed: {body.get('error')}")
time.sleep(interval_s)
raise TimeoutError(f"Job {job_id} did not complete within {timeout_s}s")Fetch results
def fetch_results(result_url, out_path):
r = requests.get(result_url, timeout=120, stream=True)
r.raise_for_status()
with open(out_path, "wb") as f:
for chunk in r.iter_content(chunk_size=65536):
f.write(chunk)
print(f" Results written to {out_path}")The result file is a JSON Lines file: one JSON object per input address, in the same order as submitted, keyed by the id you supplied. Each object carries the geocoded lat, lng, confidence, formatted_address, and any includes you requested (elevation, reverse address, etc.). If an individual address failed to geocode, its entry has "status": "no_match" rather than coordinates — your downstream code handles this the same way it would handle a low-confidence sync result.
The Node equivalent of the same fetch:
import { createWriteStream } from 'node:fs';
import { pipeline } from 'node:stream/promises';
const KEY = process.env.CSV2GEO_KEY;
async function fetchResults(resultUrl, outPath) {
const r = await fetch(resultUrl);
if (!r.ok) throw new Error(`fetch failed: ${r.status}`);
const dest = createWriteStream(outPath);
await pipeline(r.body, dest);
console.log(`Results written to ${outPath}`);
}Step-by-step: a 1 M-row logistics manifest
This is the full pattern for a logistics team receiving a shipment manifest every night and needing all stops geocoded before the routing engine runs at 05:00.
Step 1: Deduplicate and cache-check first
Before submitting a single row to the batch API, run your address list through your local geocoding cache. In a typical last-mile network, the top 200 addresses — distribution centre, major retail hubs, common residential postcodes — appear in 30–50% of all manifests. Geocoding them once and caching the result is the highest-leverage move available.
import csv
import sqlite3
def split_cached_uncached(manifest_path, db_path):
"""Split a manifest into rows we already know and rows we need to geocode."""
conn = sqlite3.connect(db_path)
cached, uncached = [], []
with open(manifest_path) as f:
for row in csv.DictReader(f):
addr = row["delivery_address"].strip().lower()
cur = conn.execute(
"SELECT lat, lng, confidence FROM geocode_cache WHERE address = ?",
(addr,)
)
hit = cur.fetchone()
if hit:
row["lat"], row["lng"], row["confidence"] = hit
cached.append(row)
else:
uncached.append(row)
conn.close()
print(f" Cache: {len(cached)} hits, {len(uncached)} misses")
return cached, uncachedEven a 40% hit rate on a million-row manifest saves 400,000 API calls and shaves meaningful time off the job.
Step 2: Chunk the uncached rows into batch submissions
The batch endpoint handles large arrays well, but individual submissions of 50,000–100,000 rows give you better progress granularity and a smaller blast radius if a network hiccup occurs mid-submission.
def chunk(lst, n):
for i in range(0, len(lst), n):
yield lst[i:i+n]
def submit_batches(uncached_rows, chunk_size=50_000):
job_ids = []
for i, batch in enumerate(chunk(uncached_rows, chunk_size)):
addresses = [
{"id": r["stop_id"], "q": r["delivery_address"]}
for r in batch
]
r = requests.post(
f"{API}/batch/geocode",
json={"api_key": KEY, "addresses": addresses},
timeout=60,
)
r.raise_for_status()
job_id = r.json()["job_id"]
job_ids.append(job_id)
print(f" Submitted chunk {i+1}: {len(addresses)} rows → job {job_id}")
return job_idsWith a million uncached rows chunked at 50,000, you make 20 submission calls. Each returns immediately. Your orchestrator records all 20 job IDs and proceeds to the poll phase.
Step 3: Poll all jobs in parallel
Polling 20 jobs one at a time is fine. Polling them in a small thread pool is slightly more efficient and gives you a single "all complete" signal.
from concurrent.futures import ThreadPoolExecutor, as_completed
def await_all_jobs(job_ids, interval_s=60, timeout_s=7200):
with ThreadPoolExecutor(max_workers=5) as pool:
futures = {
pool.submit(poll_until_complete, jid, interval_s, timeout_s): jid
for jid in job_ids
}
result_urls = {}
for f in as_completed(futures):
jid = futures[f]
result_urls[jid] = f.result()
print(f" Job {jid} complete")
return result_urlsThe entire million-row manifest — split across 20 jobs, each processing 50,000 rows — should complete well within any overnight batch window.
Step 4: Fetch and merge results
Fetch each result file, parse the JSON Lines, write geocoded rows back to the manifest, and populate the cache for future runs.
import json
def fetch_and_merge(result_urls, cached_rows, db_path):
conn = sqlite3.connect(db_path)
all_rows = list(cached_rows) # start with cached hits
for job_id, url in result_urls.items():
local_path = f"/tmp/{job_id}.jsonl"
fetch_results(url, local_path)
with open(local_path) as f:
for line in f:
obj = json.loads(line)
if obj.get("status") == "no_match":
obj["lat"] = obj["lng"] = None
else:
# Write to cache
conn.execute(
"INSERT OR REPLACE INTO geocode_cache VALUES (?,?,?,?)",
(obj["query"].strip().lower(),
obj["lat"], obj["lng"], obj["confidence"])
)
all_rows.append(obj)
conn.commit()
conn.close()
return all_rowsStep 5: Handle no-match rows before routing
Not every address geocodes. Common failure modes in logistics manifests: malformed addresses from e-commerce checkout forms, addresses in territories outside our 461M-address coverage for that country, or genuine data-entry errors. Production routing engines must not receive null coordinates — they will either crash or silently route to (0, 0) somewhere off the coast of Africa.
Build the fallback before you go live. A sensible policy for a last-mile operator:
- Confidence ≥ 0.8: accept, pass to router.
- Confidence 0.5–0.8: flag for dispatcher review in the exceptions queue; hold the stop until manually resolved or approved.
- Confidence < 0.5 or `no_match`: escalate immediately to the operations team; do not route.
The dispatch console post covers the UI pattern for that exceptions queue in detail — it is the same queue whether the failure came from a sync call or a batch job.
Failure modes specific to async batch
Sync geocoding fails fast and loudly — you get a 4xx or 5xx and your retry loop kicks in. Async batch introduces failure modes that are quieter and more insidious.
Silent partial failures. A batch job can complete with status: complete while containing thousands of no_match rows. Always check the no_match_count in the job status response and alert if it exceeds your expected baseline. A manifest that normally has 0.3% no-match rows suddenly showing 8% no-match is a data-quality incident, not a geocoding incident — and you want to know before the router runs.
Stale job IDs. Job results are retained for a fixed period after completion. If your poller crashes, restarts two days later, and tries to fetch the result URL, the signed URL may have expired. Build job ID persistence into your orchestrator state — write the job ID to a database row before you submit, not in memory. The retry path is re-submitting the relevant chunk, not panicking.
Webhook delivery failures. If you use webhooks rather than polling, your webhook endpoint can be temporarily unavailable when the job completes. The API will retry delivery with backoff, but design your endpoint to be idempotent — receiving the same completion event twice should not trigger two downstream routing jobs. The idempotent geocoding patterns post applies to webhook receivers as much as to API callers.
Oversized submissions. Submitting a single job with 2 million rows and a 5 MB JSON body risks hitting request-size limits and makes debugging opaque. Chunk at 50,000–100,000 rows. Smaller jobs also give operations teams a progress bar that means something — "14 of 20 chunks complete, 70%" is a real status. "1 of 1 chunk, 0%" is not.
Cost arithmetic for logistics at scale
A million-row nightly manifest, of which 40% is cache hits, leaves 600,000 addresses to geocode.
600,000 addresses at 1 credit each = 600,000 credits per night.
The CSV2GEO pricing page (csv2geo.com/pricing/api) publishes the tiered rates. At paid brackets, the per-call cost falls as volume rises. At the entry tier ($54/month for 100,000 calls), a million-row-per-night operation would exhaust the included allowance in about three nights and bill overages thereafter. The right bracket for a daily manifest of this size is one of the higher-volume tiers — the pricing page has the live numbers and there is no minimum commit or quote process.
The cache point bears repeating. If you invest a day in a proper address-normalisation layer and an LRU cache backed by a small Postgres table, you can often push the cache-hit rate to 60–70% on a mature logistics network where the same zip codes appear daily. That directly reduces your monthly API spend by the same percentage.
Observability for batch jobs
A batch geocoding pipeline that runs silently in the dark is a liability. Instrument these four numbers and alert on them:
- Rows submitted per job — deviations from the expected count indicate upstream data pipeline failures.
- No-match rate per job — baseline this over two weeks and alert at 2× the rolling average.
- Job duration — a job that takes 3× longer than its historical average is a signal worth investigating before the router misses its slot.
- Rows with confidence < 0.8 — separately from no-match, low-confidence rows are candidate errors that did geocode to something, just not reliably. They deserve a separate alert threshold.
The observability for geocoding pipelines post has the full instrumentation pattern including the Prometheus metric names and Grafana dashboard JSON if you want to start there.
Frequently Asked Questions
How large a single batch submission can I make?
The API accepts batches of up to 100,000 addresses per submission call. For larger datasets, chunk into multiple jobs and track the job IDs in your orchestrator state. There is no limit on the total number of concurrent jobs per API key, so submitting 20 jobs of 50,000 each is idiomatic.
Does the batch API cost the same per address as the sync API?
Yes. Each geocoded address costs 1 credit regardless of whether it was resolved via a synchronous call or as part of a batch job. The economic benefit of batch is not per-address cost reduction — it is the elimination of the orchestration overhead (connection management, concurrency tuning, backoff logic) that the sync path would require at scale.
What is the right polling interval for a 500,000-row job?
Poll every 60–120 seconds. The job will take several minutes to complete; polling every 5 seconds burns quota and adds no useful information. If you are using webhooks, set the polling interval to 5 minutes as a fallback in case webhook delivery fails.
Can I include elevation or reverse geocoding in the batch output?
Yes — the batch endpoint accepts the same include parameter as the sync endpoint. include=elevation adds a ground elevation reading to every geocoded result in the job output, with no additional API call required per row. You pay 1 elevation credit per address that successfully geocodes. For a logistics use case, including elevation in the batch job is cheaper than a separate elevation enrichment pass.
What happens to addresses that fail to geocode in a batch job?
They appear in the result file with "status": "no_match" and null coordinates. The job as a whole is still marked complete. Always check no_match_count in the job status response before passing results to a router, and build an exceptions queue for those rows. Do not assume that a complete job means 100% of rows have coordinates.
Our free-tier quota is 3,000 calls per day. Can we prototype batch geocoding on the free tier?
Yes. At 100 addresses per sync call and up to 100,000 rows per batch submission, the free tier (3,000 calls/day) supports meaningful prototype runs — a 3,000-call day covers up to 300,000 address lookups if you fill each call to capacity. For a genuine million-row production pipeline you will want a paid tier, but the free tier is more than enough to validate the integration end to end.
Does the async batch API have a different base URL or auth model?
No. Same base URL (https://csv2geo.com/api/v1), same API key, same api_key query parameter or Authorization: Bearer header. The batch endpoints are /batch/geocode (submit), /batch/geocode/{job_id}/status (poll), and the signed result_url returned when the job completes.
Related Articles
- Running a dispatch console at 5,000 stops per day — the UI and API patterns for the exceptions queue that receives your no-match and low-confidence rows
- Benchmarking geocoding APIs — honest numbers — how to measure sync and batch throughput accurately before you commit to an architecture
- Caching geocoding results — 90% cost reduction — the cache layer that should precede any batch submission to reduce effective volume
- Exponential backoff — when to retry, when to stop — retry policy for batch submission calls and webhook receivers
- Concurrency tuning for geocoding pipelines — if you land on the sync path, this is the worker-pool tuning guide
---
*I.A. / CSV2GEO Creator*
Use our batch geocoding tool to convert thousands of addresses to coordinates in minutes. Start with 100 free addresses.
Try Batch Geocoding Free →