Benchmarking Geocoding APIs: Methodology, Pitfalls, and Honest Numbers

How to benchmark a geocoder without lying to yourself: cold/warm caches, regional bias, address-difficulty stratification.

| May 01, 2026

Benchmarking Geocoding APIs: Methodology, Pitfalls, and Honest Numbers

"I ran 1,000 addresses through 5 geocoders, here's the winner" is, almost without exception, the wrong methodology. The post that follows will show beautiful charts. The numbers will be precise to two decimals. The conclusion will be confident. And it will be wrong, because at least three load-bearing assumptions in the experiment were never examined.

The three assumptions, in order of how much damage they do: warm-cache bias (the test set is small enough that everything is cached after the first run, and "winner" is really "fastest cache lookup"), regional sampling bias (the addresses came from a single country, usually the US, and providers with strong reference data outside that country never had a chance to show it), and address-difficulty bias (the sample was random, which means clean urban addresses dominated, and the hard tail that actually breaks production was statistically invisible).

This post is the methodology that fixes those three things. It is also a working benchmark harness in Node, the cost math that turns latency and match-rate numbers into a unit-economics decision, and the six anti-patterns that show up over and over again in benchmark posts written by people who genuinely meant well. The goal is a benchmark you can defend to an auditor, a CFO, and your own future self when you have to explain why you picked the provider you did.

What you're actually measuring

Before you write a single line of harness code, decide what question you are asking. The four common ones answer different decisions, and a benchmark optimised for one of them is misleading for the others.

| metric | answers | when it matters | |---|---|---| | Match rate | "what fraction of my addresses get a usable result?" | data quality, downstream pipeline reliability | | p99 latency | "how slow is the slowest 1% of requests?" | user-facing realtime, SLOs | | Rooftop precision | "what fraction land within 10m of the actual building?" | insurance, last-mile, emergency response | | Cost per 1k | "what does this actually cost at my volume?" | finance, contract negotiation |

Match rate matters most for batch pipelines that load a CRM or feed a BI dashboard — if 8% of addresses come back empty, that 8% is a recurring data-quality bug regardless of how fast the API is. p99 latency matters most for realtime UX where a user is waiting on an autocomplete or a delivery quote — see why your average lies for the full argument. Rooftop precision matters when coordinates feed a downstream geometry calculation — flood-zone lookups, drive-time isochrones, school-district boundaries. Cost per 1k matters when you have an actual volume and an actual budget and you need to justify which provider goes into the contract.

You will rarely measure only one. But you should rank them for your use case, because every benchmark involves tradeoffs and "balanced" is a synonym for "good at nothing."

The cold-vs-warm cache trap

Every geocoder runs an internal cache. Some cache aggressively across all customers; some cache per-key; some cache with regional partitioning. Either way, the first time an address is queried, the response is "cold" — the geocoder has to consult its primary index, sometimes a fallback provider, sometimes an ML model. Subsequent queries within the cache TTL are "warm" — they come back in microseconds from a hot in-memory layer.

If you run a benchmark by hitting the same 1,000 addresses three times in a row and averaging the latency, you are measuring cache lookup speed for runs two and three. That is not what production looks like. Production traffic is a mix: maybe 60% cold (new addresses), 30% warm (recently-seen), 10% hot (most-popular). The exact mix depends on your business — a retail address validator sees a long tail of unique addresses; a delivery platform re-queries the same fleet of warehouses every hour and is mostly warm.

The fix is to measure both modes explicitly and report both numbers. Concretely:

Generate a stratified sample (more on that below) of N addresses, where N is large enough to exceed any reasonable cache window — 5,000 to 10,000 is usually safe.
Run the cold pass: query each address exactly once, record latency, match status, and confidence.
Wait 60 seconds (so any short-TTL cache is irrelevant) and run the warm pass: query the same addresses again.
Report p50_cold, p99_cold, p50_warm, p99_warm separately. Compute a blended number for your expected mix.

A typical result looks like this. We ran 5,000 addresses through our own pipeline last month:

| metric | cold | warm | |---|---|---| | p50 latency | 84ms | 4ms | | p99 latency | 380ms | 18ms | | match rate | 96.4% | 96.4% |

The match rate is the same in both modes — it should be, because cache hits return cached results. The latency numbers differ by an order of magnitude. A benchmark that only reported warm numbers would imply our p99 is 18ms; the honest answer for a customer with mostly-cold traffic is closer to 380ms.

Stratified sampling by difficulty

Random samples are biased toward whatever dominates the input list. For most North American CRM exports, that is clean urban addresses with a five-digit ZIP. Those addresses score over 99% match rate at almost any provider. They are not the addresses that break production.

The addresses that break production are the long tail: rural routes with no street name, apartment buildings with 200 units, addresses in countries with non-Latin scripts, recently-built developments, malformed inputs from CSV exports gone wrong. Those represent maybe 5-10% of a typical input list — but they are the difference between a 99% benchmark match rate and a 91% production match rate.

A stratified sample fixes this by deliberately oversampling the hard buckets. Here is the partition we use, with example match-rate variance per bucket measured on our own production traffic.

| bucket | fraction in random sample | fraction in stratified sample | typical match rate | reasoning | |---|---|---|---|---| | Clean US urban | 60% | 20% | 99.4% | overrepresented in random; easy case | | US rural (RR/PO Box) | 5% | 15% | 78.2% | sparse reference data, missing house numbers | | US apartment (multi-unit) | 8% | 15% | 91.6% | unit numbers confuse parsers | | EU (mixed countries) | 12% | 20% | 94.8% | umlauts, voie types, postcode variants | | LATAM/APAC | 5% | 15% | 84.1% | non-Latin scripts, sparser data | | Malformed (typos, missing fields) | 10% | 15% | 62.3% | tests the parser/validator path |

The stratified sample over-represents the hard buckets so they have statistical weight. When you compute the aggregate match rate, you weight the buckets by their actual fraction in your real data — not in the sample. So the stratified sample lets you measure each bucket precisely, and the weighted average reflects your real workload.

If you do not know your actual bucket distribution, audit a sample. Take 10,000 rows from your real input, classify them by bucket using simple regex rules (postcode pattern, country code, presence of unit number, missing fields), and you have your weights. Without this step you are guessing.

Regional bias

Every geocoder has regional strengths. One provider has excellent UK coverage from a Royal Mail PAF derivative; another has rooftop precision in Germany via a Deutsche Post feed; a third dominates in Brazil because they invested in a CEP licence years ago. Run a US-only benchmark and you will never see any of this.

The fix mirrors the difficulty stratification: build a region-stratified test set that proportionally reflects the regions you actually serve. If 70% of your traffic is US, 20% EU, 10% APAC, your benchmark should sample at those rates — but again, with enough volume per region (200+ addresses) that your per-region match rate is statistically meaningful. A 20-address EU sample tells you nothing.

For each region, source the addresses from a known-good list rather than synthesising them. Open-data sources include OpenAddresses for many countries, Geonames for cities, and national open-data portals (data.gov, data.gov.uk, data.gouv.fr, dados.gov.br) for verified addresses. Avoid Wikipedia, avoid social media, avoid anything where the address might be wrong — your benchmark needs ground truth.

For the multi-country case specifically, see 200 countries' address formats for the format conventions. A benchmark that sends a US-formatted address to a Japanese geocoder is testing the wrong thing.

p99 over p50

Mean latency lies. p50 lies a little less. p99 is the only honest summary statistic for a latency distribution that has a tail — and geocoder latency distributions always have tails, because the cold path is slow, the fallback path is slower, and the network is variable.

A geocoder with 50ms mean and 80ms p99 is a steady, predictable system. A geocoder with 50ms mean and 1,200ms p99 has the same average but will look terrible in production — every hundredth user waits over a second. Both will report the same "average latency" in a sloppy benchmark.

Always report p50, p95, and p99. Always compute them per bucket (cold/warm × difficulty × region) — aggregating percentiles across heterogeneous workloads is mathematically suspect. And always show the histogram, not just the summary numbers, because shape matters: bimodal distributions often hide a fallback-provider path that is much slower than the primary, and you want to see that.

For the long argument with the math, see p99 latency in geocoding. For the operational implications, see observability for geocoding pipelines.

Cost per 1k as the bottom line

The metric that matters at contract time is cost per 1,000 successfully geocoded addresses, accounting for both the gross API cost and the cost of failed matches that have to be retried, manually fixed, or absorbed as data-quality debt.

The formula is straightforward:

effective_cost_per_1k = (gross_cost_per_1k) / (match_rate × confidence_pass_rate)

Where match_rate is the fraction of addresses that come back with any result, and confidence_pass_rate is the fraction of those results above your acceptance threshold (see confidence scores explained for how to set the threshold).

Worked example. Provider A charges $0.50 per 1,000 calls, returns 96% match rate, 92% of those above your 0.85 confidence threshold. Provider B charges $0.30 per 1,000 calls, returns 91% match rate, 78% above 0.85.

| provider | gross $/1k | match × confidence | effective $/1k | |---|---|---|---| | A | $0.50 | 0.96 × 0.92 = 0.883 | $0.566 | | B | $0.30 | 0.91 × 0.78 = 0.710 | $0.423 |

Provider B looks 17% cheaper at the gross level (50¢ vs 30¢). At the effective level, it is 25% cheaper ($0.566 vs $0.423) — but only if you are willing to absorb the lower confidence-pass rate as a data-quality cost. If your downstream cost of a wrong match is $5 per row, the math reverses: Provider B's extra 14 percentage points of low-confidence results cost you more than the API savings.

Always do the multiplication. Always factor in the downstream cost of failures. Always check the math at multiple volume tiers — most providers have step pricing, and the cheapest-per-1k at 100K/month is rarely the cheapest at 10M/month.

A working benchmark harness in Node

Here is a complete harness that implements everything above: stratified loading, cold pass, 60-second wait, warm pass, per-bucket metrics, and a results CSV. About 80 lines.

// benchmark.mjs — node benchmark.mjs sample.csv results.csv
import { createReadStream, createWriteStream } from 'node:fs';
import { parse } from 'csv-parse';
import { stringify } from 'csv-stringify/sync';
import { setTimeout as sleep } from 'node:timers/promises';

const ENDPOINT = 'https://csv2geo.com/api/v1/geocode';
const KEY = process.env.CSV2GEO_KEY;
const CONCURRENCY = 16;

async function loadSample(path) {
  const rows = [];
  for await (const r of createReadStream(path).pipe(parse({ columns: true }))) {
    rows.push(r); // expects columns: address, bucket, region
  }
  return rows;
}

async function geocodeOne(address) {
  const t0 = performance.now();
  try {
    const res = await fetch(`${ENDPOINT}?q=${encodeURIComponent(address)}`, {
      headers: { Authorization: `Bearer ${KEY}` },
    });
    const data = await res.json();
    const top = (data.results || [])[0];
    return {
      latency_ms: performance.now() - t0,
      matched: !!top,
      score: top?.accuracy_score ?? null,
      accuracy: top?.accuracy ?? null,
      ok: res.ok,
    };
  } catch (e) {
    return { latency_ms: performance.now() - t0, matched: false, score: null, error: e.message };
  }
}

async function runPass(rows, label) {
  const out = [];
  let i = 0;
  async function worker() {
    while (i < rows.length) {
      const idx = i++;
      const row = rows[idx];
      const r = await geocodeOne(row.address);
      out[idx] = { ...row, pass: label, ...r };
    }
  }
  await Promise.all(Array.from({ length: CONCURRENCY }, worker));
  return out;
}

function percentile(arr, p) {
  const s = [...arr].sort((a, b) => a - b);
  return s[Math.floor((s.length - 1) * p)];
}

function summarise(results, pass) {
  const byBucket = {};
  for (const r of results.filter(r => r.pass === pass)) {
    (byBucket[r.bucket] ||= []).push(r);
  }
  for (const [bucket, rs] of Object.entries(byBucket)) {
    const lat = rs.map(r => r.latency_ms);
    const matchRate = rs.filter(r => r.matched).length / rs.length;
    console.log(
      `${pass.padEnd(5)} | ${bucket.padEnd(20)} | n=${String(rs.length).padStart(4)} | ` +
      `match=${(matchRate * 100).toFixed(1)}% | ` +
      `p50=${percentile(lat, 0.5).toFixed(0)}ms | ` +
      `p99=${percentile(lat, 0.99).toFixed(0)}ms`
    );
  }
}

const rows = await loadSample(process.argv[2]);
console.log(`Loaded ${rows.length} addresses. Running cold pass...`);
const cold = await runPass(rows, 'cold');
console.log('Sleeping 60s for cache settling...');
await sleep(60_000);
console.log('Running warm pass...');
const warm = await runPass(rows, 'warm');

const all = [...cold, ...warm];
console.log('\n--- Results ---');
console.log('pass  | bucket               |    n  | match  | p50     | p99');
summarise(all, 'cold');
summarise(all, 'warm');

createWriteStream(process.argv[3]).write(
  stringify(all, { header: true })
);
console.log(`\nWrote ${all.length} rows to ${process.argv[3]}`);

Run it as node benchmark.mjs stratified-sample.csv results.csv. The output gives you per-bucket cold and warm numbers; the CSV lets you re-aggregate however you want — by region, by accuracy category, by score range. For concurrency tuning of the harness itself (the CONCURRENCY constant matters), see concurrency tuning for geocoding.

Anti-patterns

Six common mistakes, in roughly the order of how much damage they do.

1. The 100-address benchmark. Statistical noise dominates below a few hundred per bucket. A 100-address sample with 92% match rate has a 95% confidence interval of roughly ±5 points; you cannot distinguish a 92% provider from an 87% one. Run at least 1,000 per bucket if you want to claim a difference is real.

2. Single-region sampling. Already covered above: a US-only benchmark tells you nothing about how a provider performs in the EU, LATAM, or APAC. If your traffic is multi-region, your benchmark must be too.

3. Reporting only the average latency. Means lie when distributions have tails. Always report p50/p95/p99 and ideally show the histogram.

4. Conflating cold and warm runs. Re-running the same sample without a wait period measures cache speed, not geocoder speed. Always do an explicit cold pass and report it separately from warm.

5. Ignoring confidence scores. A "matched" result with accuracy_score: 0.42 is not a useful match — it is a postcode centroid the geocoder is not confident about. If your benchmark counts those as matches, you are inflating match rates by 5-15 percentage points. Apply your confidence threshold inside the benchmark, the same way you would in production.

6. Comparing list prices. Provider A at $0.50/1k gross might be cheaper than Provider B at $0.30/1k gross once match rate and confidence-pass rate are factored in. Always compute effective cost per 1k. Always.

Frequently Asked Questions

How many addresses do I need to benchmark?

At least 1,000 per bucket per pass, so for a six-bucket stratified sample with cold and warm passes, that is ~12,000 calls. Smaller samples have confidence intervals wide enough to make any conclusion meaningless. If budget is tight, reduce the number of buckets before reducing the per-bucket sample size.

How long should I wait between cold and warm passes?

Sixty seconds is enough to bypass most short-TTL caches without making the test impractically long. If the provider documents a cache TTL longer than that, wait longer; if they document anything shorter, you can shorten the wait. The point is to measure cold-versus-warm separately, not to trick the cache.

Where do I get a stratified sample?

Source from open data: OpenAddresses for clean addresses across many countries, Geonames for city-level entries, national open-data portals (data.gov, data.gov.uk, data.gouv.fr, etc.) for ground-truthed national lists. For the malformed-input bucket, take a sample of real production failures with PII redacted. Avoid synthetic data — geocoders are trained on real distributions and synthetic addresses test the wrong thing.

Can I just trust the provider's published benchmarks?

No. Every provider's benchmark is run on data they curated, with parameters they chose, in conditions favourable to themselves. That is not malice; it is what marketing benchmarks always look like. Run your own. The whole point of this article is that the methodology matters more than any number, and that is true for vendor benchmarks too.

Should I include parsing time in the benchmark?

If you parse before sending (recommended — see address parsing), include it as a separate metric, not bundled into the geocoder latency. Parsing is something you control; geocoding is something the provider controls. Mixing them makes it impossible to attribute slowness correctly when something regresses.

How often should I re-run the benchmark?

Quarterly if you have one provider locked in, monthly if you are evaluating alternatives, weekly during a contract negotiation. Match rates and latency drift over time as providers update reference data and infrastructure; a benchmark from twelve months ago is a historical document, not a current decision-input.

What about reverse geocoding?

The same methodology applies, but the buckets are different (urban/suburban/rural/water for points, by region as before) and the metric of interest is distance_meters rather than match rate — see reverse geocoding accuracy. The harness above is forward-geocoding-shaped; for reverse you need to swap the endpoint and the response parsing, but the cold/warm/stratified structure stays.

I.A. / CSV2GEO Creator