Geocoding 1 Million Addresses: From 8 Hours to 12 Minutes

Real case study: how we got a 1M-row geocoding job from 8 hours to 12 minutes. Concurrency, batching, caching math.

| April 26, 2026
Geocoding 1 Million Addresses: From 8 Hours to 12 Minutes

A logistics customer ran a nightly geocode of roughly one million delivery addresses. The job started after the 22:00 cutoff for next-day routing and was supposed to finish before drivers logged in at 06:00. For months, it usually did. Then volume crept up, the job started bleeding into the morning, and one bad night it finished at 09:15 — by which point dispatch had already cut routes by hand. The eight-hour pipeline had become a business problem.

Five changes later, the same nightly job runs in twelve minutes. Nothing exotic. No new vendor, no rewrite. Just bounded concurrency, a batch endpoint, a cache layer, a streaming pipeline, and a normalization pre-pass — applied in the order that gave each one room to matter. This post is the play-by-play, with the actual numbers we measured and the math behind why each step did what it did.

The starting point: 8 hours

The original loop was the kind of code that looks reasonable in a code review and behaves like a brick at scale. One worker, sequential calls, nothing async:

# original.py — the eight-hour version
import csv, requests

with open('addresses.csv') as f:
    rows = list(csv.DictReader(f))

results = []
for row in rows:
    r = requests.post('https://api.csv2geo.com/v1/geocode', json=row, timeout=30)
    results.append(r.json())

with open('out.csv', 'w') as f:
    csv.DictWriter(f, fieldnames=results[0].keys()).writeheader()
    csv.DictWriter(f, fieldnames=results[0].keys()).writerows(results)

The end-to-end time per row was about 30 ms — TLS handshake reuse via the connection pool, server-side geocode in 5-15 ms, network back. Multiplied across one million rows, that is 30,000 seconds, or eight hours and twenty minutes. Add CSV parse and write, and the job came in at roughly eight hours fifteen.

There was no headroom. A 50 ms blip in p99 latency added an hour. A retry loop on a few thousand 429s could double the runtime. Worse, the loop loaded all rows into memory before the first request went out, which meant a 400 MB resident set on a small worker box and zero progress visible to monitoring until the job either finished or fell over.

The first thing to internalize: a single-threaded loop is doing one thing the whole time — *waiting on the network*. The CPU is idle. The TCP socket is idle for most of the round trip. The eight hours is almost entirely sleep. That is exactly the workload concurrency exists for.

Step 1: bounded concurrency (8 hours → 90 minutes)

We introduced a semaphore-bounded worker pool at concurrency 16. Sixteen in-flight requests at any moment, each averaging 30 ms, gave a theoretical throughput of ~533 rows per second — call it 1,875 seconds for a million rows. Measured: 1,920 seconds, or 32 minutes pure compute. With CSV I/O and the long tail, the run came in at about 90 minutes end-to-end.

# step1.py — bounded concurrency
import asyncio, aiohttp, csv

CONCURRENCY = 16

async def geocode_one(session, sem, row):
    async with sem:
        async with session.post(
            'https://api.csv2geo.com/v1/geocode', json=row, timeout=30,
        ) as r:
            return await r.json()

async def main():
    with open('addresses.csv') as f:
        rows = list(csv.DictReader(f))
    sem = asyncio.Semaphore(CONCURRENCY)
    async with aiohttp.ClientSession() as session:
        results = await asyncio.gather(*[
            geocode_one(session, sem, row) for row in rows
        ])
    return results

Why 16 and not 64? Because the curve flattens. We benchmarked at 4, 8, 16, 32, 64, and 128 in-flight. The numbers:

| Concurrency | Throughput (rows/sec) | p99 latency (ms) | Effective speedup | |---:|---:|---:|---:| | 1 | 33 | 35 | 1x | | 8 | 250 | 38 | 7.5x | | 16 | 533 | 42 | 16x | | 32 | 760 | 78 | 23x | | 64 | 870 | 210 | 26x | | 128 | 880 | 540 | 26x |

Throughput plateaus around 32-64 because we are saturating the upstream's per-account fairness budget. The p99 column tells the real story: past 16, we trade marginal throughput for tail latency that triggers our retry logic — and retries are pure waste. Sixteen was the knee of the curve. The deeper version of this analysis is in the post on finding your geocoder's concurrency sweet spot.

The intermediate result was useful: the eight-hour job was now ninety minutes, and we had not changed anything about the addresses, the API, or the pricing. A sixteen-line change moved the bottleneck from "wall clock" to "API call cost per row."

Step 2: batch endpoint (90 min → 35 min)

The single-row endpoint has fixed per-request overhead — TLS, HTTP framing, JSON envelope, server-side request setup. At 30 ms per call, somewhere between 5 and 10 ms of that is overhead the geocoder is not doing useful work for. Times one million calls, that overhead is 50-100 minutes of pure waste.

The batch endpoint takes up to 1,000 rows per POST. Per-batch latency was 320 ms — meaning per-row amortized cost dropped to 0.32 ms. With concurrency 16, in-flight batches gave a theoretical 50,000 rows/sec ceiling, well above what any other component could feed.

# step2.py — batch posts
BATCH_SIZE = 1000

def chunked(iterable, n):
    buf = []
    for x in iterable:
        buf.append(x)
        if len(buf) == n:
            yield buf
            buf = []
    if buf:
        yield buf

async def geocode_batch(session, sem, batch):
    async with sem:
        async with session.post(
            'https://api.csv2geo.com/v1/geocode/batch',
            json={'addresses': batch}, timeout=60,
        ) as r:
            return (await r.json())['results']

Measured wall time for the geocode phase dropped from 32 minutes to 8 minutes. Add CSV I/O and the run came in at 35 minutes end-to-end. The lesson: when per-call overhead is a meaningful fraction of total latency, batching is not a 2x optimization, it is a 3-4x optimization, and the larger your batch the closer you get to the theoretical ceiling.

The corollary nobody mentions: batch endpoints change your error model. A failed batch is not "one row failed" — it is "1,000 rows are now in an ambiguous state." Idempotent retries become non-negotiable. We covered that pattern in designing a batch geocoding queue, and it is worth reading before you swap from singles to batches in production.

Step 3: cache layer (35 min → 18 min)

We instrumented the input distribution and found something obvious in retrospect: this customer's "one million addresses" was actually about 380,000 distinct addresses. Drivers visit the same buildings on rotation. Standing orders. Recurring deliveries. The hit rate against an empty cache on night one was 0%; by night three it stabilized at ~62%.

We added a Redis-backed cache in front of the batch endpoint, keyed on a normalized SHA-256 of country|postcode|city|street|house_number. Cache hits served in roughly 0.4 ms (single Redis GET, no JSON parse beyond the cached blob). Cache misses fell through to the batch endpoint as before.

# step3.py — cache wrapper
import hashlib, json
from redis import Redis

r = Redis.from_url(os.environ['REDIS_URL'])

def key(addr):
    parts = [str(addr.get(k, '')).lower().strip()
             for k in ('country', 'postcode', 'city', 'street', 'house_number')]
    return 'geo:v1:' + hashlib.sha256('|'.join(parts).encode()).hexdigest()

async def cached_batch(session, sem, batch):
    keys = [key(a) for a in batch]
    cached = r.mget(keys)
    misses = [(i, batch[i]) for i, c in enumerate(cached) if c is None]
    fresh = await geocode_batch(session, sem, [a for _, a in misses]) if misses else []
    pipe = r.pipeline()
    for (idx, _), result in zip(misses, fresh):
        pipe.setex(keys[idx], 30 * 86400, json.dumps(result))
    pipe.execute()
    out = [json.loads(c) if c else None for c in cached]
    for (idx, _), result in zip(misses, fresh):
        out[idx] = result
    return out

At a 60% steady-state hit rate, only 400,000 of the million rows actually called the upstream geocoder. That cut the geocode phase from 8 minutes to about 3.5 minutes. Add Redis lookups (still pipelined, ~0.4 ms × 1,000 batches = 400 ms total) and CSV I/O, and the end-to-end run was 18 minutes.

We covered the full TTL-strategy, normalization, and version-prefix invalidation pattern in how to cache geocoding results. The short version: cache forever for forward geocoding, normalize before keying, hash for PII safety, and absolutely cache no_match results — your customer's typos will haunt you forever otherwise.

Step 4: streaming pipeline (18 min → 14 min)

The pipeline still loaded the input CSV into a list before the first request went out, then accumulated all results in memory before writing. At a million rows that was 400 MB resident, peaking around 700 MB during the result-merge step. The job ran on a 1 GB worker. We had survived this far on luck.

We replaced the load-all/process-all/write-all shape with a streaming generator pipeline: read a row, batch it with up to 999 others, hand the batch off to an async worker, write each row's result the instant it came back. Bounded buffers between stages applied backpressure when downstream stalled.

# step4.py — streaming pipeline
import asyncio, csv

async def producer(path, queue, batch_size=1000):
    with open(path) as f:
        reader = csv.DictReader(f)
        batch = []
        for row in reader:
            batch.append(row)
            if len(batch) == batch_size:
                await queue.put(batch); batch = []
        if batch: await queue.put(batch)
    await queue.put(None)  # sentinel

async def worker(in_q, out_q, session, sem):
    while True:
        batch = await in_q.get()
        if batch is None:
            await out_q.put(None); break
        results = await cached_batch(session, sem, batch)
        await out_q.put((batch, results))

async def writer(out_q, path):
    with open(path, 'w', newline='') as f:
        w = None
        while True:
            item = await out_q.get()
            if item is None: break
            batch, results = item
            for inp, res in zip(batch, results):
                row = {**inp, **(res or {'status': 'no_match'})}
                if w is None:
                    w = csv.DictWriter(f, fieldnames=row.keys()); w.writeheader()
                w.writerow(row)

Wall-clock improvement was modest: 18 minutes to 14 minutes. The compute phase did not change much — it was already fast. What changed was that resident memory dropped from 700 MB to 35 MB, the worker no longer needed a 1 GB box, and we could now run multiple jobs concurrently on the same machine. More importantly, the pipeline started writing the first results inside the first 30 seconds of the run, so monitoring saw progress and on-call no longer panicked at minute four when the job appeared frozen.

The pattern matters more than the time saved. A pipeline with bounded buffers between stages is one that scales linearly with input size; a load-all/process-all pipeline does not. The full discussion of backpressure, queue sizing, and OOM avoidance is in streaming geocoding at scale.

Step 5: pre-cleaning low-confidence rows (14 min → 12 min)

Instrumentation showed about 8% of rows came back as no_match on the first pass and were retried by an outer loop with a wider geocoder fallback. Looking at the inputs, most of the no-matches were not bad addresses — they were dirty inputs the geocoder could not parse. Things like "123 Main St., Apt 4B, NYC, NY 10001" where the unit number bled into the street field, or "123 main street ny ny" with no separators.

We added a libpostal-style normalization pass before the cache lookup. Strip unit numbers into their own field. Expand USPS abbreviations. Move bleed-through state codes out of city names. Pre-clean took about 0.05 ms per row — negligible — and dropped the no-match rate from 8% to about 0.6%.

# step5.py — normalization pre-pass
import re

UNIT_PATTERNS = re.compile(r',?\s+(apt|unit|ste|suite|#)\s*[\w-]+', re.I)
USPS_ABBR = {' street': ' st', ' avenue': ' ave', ' boulevard': ' blvd',
             ' road': ' rd', ' drive': ' dr', ' lane': ' ln'}

def normalize_row(row):
    street = row.get('street', '')
    m = UNIT_PATTERNS.search(street)
    if m:
        row['unit'] = m.group(0).strip(', ')
        street = UNIT_PATTERNS.sub('', street)
    s = ' ' + street.lower()
    for k, v in USPS_ABBR.items():
        s = s.replace(k, v)
    row['street'] = s.strip()
    return row

Eliminating the retry loop saved about two minutes of wall time on a typical run. More importantly, the rows that previously fell into the wider fallback (slower, more expensive) now hit the primary geocoder on the first try — pulling them out of a 200-300 ms latency tier and into the 3-15 ms tier. The full taxonomy of what to normalize and what not to touch is in address parsing before geocoding.

The cumulative result, twelve minutes for one million rows, is approximately 1,389 rows per second sustained. Forty times faster than where we started. Same hardware, same upstream API, same dataset.

What didn't help

For honesty's sake, three things we tried that did not move the needle:

HTTP/3. We swapped the client to a QUIC-capable HTTP library expecting fewer round-trip stalls on cold connections. Measured difference at our concurrency level: about 4 ms across the full run. The connection pool was already warm — HTTP/2 multiplexing was doing the work HTTP/3 would have done. If you are running short-lived workers on cold pools, your mileage may differ.

gRPC. The geocoding API offered a gRPC endpoint with protobuf payloads. Theoretically faster — smaller wire format, no JSON parse. Measured: 2-3 ms per batch faster, but an extra week of integration work, a worse error model on partial batch failures, and worse debuggability. The cost-benefit didn't justify it for batch geocoding. (For low-latency single-row interactive use cases, gRPC is more compelling.)

Regional pinning. The customer's workers ran in us-east-1, the API has a multi-region front door, and the assumption was forcing requests to the nearest backend would shave RTT. We measured 1-2 ms improvement, no statistically significant throughput change at our concurrency. The variance from concurrent load on the multi-tenant pool was bigger than the gain. Worth it for single-digit-ms latency budgets, not for batch.

The pattern in all three: optimizations that look attractive in benchmark microreadings often disappear in production traffic at the concurrency where the system actually runs. Measure first.

Cost per million

The eight-hour pipeline cost roughly $500 per million rows: one million billed lookups at $0.50 per thousand. After all five optimizations:

| Phase | Billed lookups | API spend | |---|---:|---:| | 1M raw rows | 1,000,000 | $500 | | After cache (60% hit) | 400,000 | $200 | | After pre-clean (no_match drop, fewer fallbacks) | ~370,000 | $185 |

The bill went from $500 to $185 per million — a 63% reduction in API spend, on top of the 40x speedup. Per year, on the customer's 30M-row monthly volume, that is $113,400 saved ($180,000 → $66,600) without a single contract renegotiation. The cache TTL strategy and the math at higher hit rates are detailed in the caching post.

A modest engineering investment — call it three weeks across two engineers — pays for itself in six weeks at the customer's rate, and continues compounding.

Frequently Asked Questions

Why didn't you start with the cache?

Because cache hit rate is empirical. With a single-threaded 30 ms loop, you cannot generate enough request volume to even measure your hit rate within a useful timeframe. We needed concurrency first to observe traffic patterns, then batching to make the cache lookup amortized cost negligible relative to the upstream call. The order matters: each step opened the budget for the next.

What concurrency should I start with?

Sixteen is a reasonable default. Benchmark up from there in powers of two, watching p99 latency. Stop when p99 starts climbing faster than throughput improves — that is the point where you are creating queue depth at the upstream and your retries are eating your gains. The full method is in the concurrency tuning post.

Is 60% cache hit rate normal?

For repeat-visit workloads (logistics, field service, recurring B2B), yes, often higher. For one-shot list-cleaning workloads (a marketing list you geocode once), the hit rate against your own cache will be near zero — but the per-job hit rate within a single batch can still be 5-15% on duplicates, and it is still worth caching for the next run. Your hit rate is a property of your traffic, not the cache.

What if my geocoder doesn't have a batch endpoint?

The first three optimizations (concurrency, cache, streaming) get you most of the way without batching. We measured the comparable pipeline with single-row calls only: 22 minutes instead of 12. Still 22x faster than the starting point. If you have any influence over the upstream, ask for a batch endpoint — it is the highest-leverage feature they could ship for high-volume customers.

Did you parallelize CSV reads?

No. CSV parse is fast — at a million rows it was about 8 seconds with the standard library. Parallelizing it is a footgun: two readers on the same file produce duplicate or out-of-order rows unless you split the file first, which adds complexity for a small win. The right move is to make sure the reader streams (does not load all rows into memory) and feeds the worker pool steadily. Bottlenecks were always downstream of CSV.

How do you handle partial batch failures?

Retry the failed rows individually with idempotency keys. The batch endpoint returns per-row status, so a 3-row failure inside a 1,000-row batch is identifiable. Send those three through a single-row retry path with exponential backoff. If you do not have idempotency keys, you risk double-billing yourself on retries — a real concern that sometimes erases your cache savings if you ignore it.

What did monitoring look like before vs after?

Before: a single counter for "job done." On-call learned about failures from a missing CSV in S3 the next morning. After: per-stage throughput counters (rows in, rows cached, rows geocoded, rows written), p50/p95/p99 latency on the upstream call, cache hit/miss rates, no_match rates, and a heartbeat that fires every 5 seconds while the pipeline is alive. The streaming pipeline made all of this possible — there were no more "frozen for four minutes then dump everything at once" moments to confuse the dashboards.

Closing

Forty times faster, sixty-three percent cheaper, the same upstream API, the same dataset, the same hardware. Five optimizations applied in the order each one earned. No silver bullet. No new vendor. No rewrite. Bounded concurrency, batch endpoint, cache layer, streaming pipeline, normalization pre-pass — that is the entire trick.

The deeper meta-lesson, for anyone running a similar pipeline: a 30-ms-per-row loop is doing one thing the entire time, and that thing is waiting. The eight hours was sleep. Once you stop the program from sleeping, everything else is amortizing fixed costs and avoiding repeat work. Measure your traffic, find the longest sleep, fix that one. Then look again.

I.A. / CSV2GEO Creator

Related Articles

Ready to geocode your addresses?

Use our batch geocoding tool to convert thousands of addresses to coordinates in minutes. Start with 100 free addresses.

Try Batch Geocoding Free →