Concurrency Tuning for Geocoding: Finding Your Sweet Spot

Find your geocoder's optimal concurrency: Little's Law, queue depth, and the curve where p99 starts diverging.

| May 08, 2026

Concurrency Tuning for Geocoding: Finding Your Sweet Spot

A geocoding pipeline running at concurrency 1 wastes throughput. The same pipeline running at concurrency 512 wastes money — it eats 429s, amplifies tail latency, and starves its own connection pool. The sweet spot sits between those two failure modes, and most teams either guess at it (usually too low) or ramp it until something breaks (usually too high). Neither approach is necessary. Little's Law gives you the starting point, the latency curve tells you when to stop, and a fifty-line tuner script finds the exact elbow on your traffic.

This post walks the math, the curve, the per-plan starting numbers, the connection-pool gotcha that silently serializes your "concurrent" workers, and a working Node tuner that ramps from 1 to 256 and prints the elbow. Every code sample compiles.

Little's Law for geocoding

Little's Law is one of the few results in queueing theory simple enough to fit on an index card: the average number of items in a stable system equals the average arrival rate times the average time each item spends in the system.

L = λ × W

For a geocoding client, the variables map cleanly:

L — concurrency (number of in-flight requests)
λ — throughput (requests per second leaving the client)
W — average latency per request (seconds)

That gives you the formula every capacity plan starts from:

concurrency = throughput_target × avg_latency

A worked example. You want 5 requests per second of sustained throughput against a geocoder that returns in 200 ms on average. Plug in:

concurrency = 5 rps × 0.2 s = 1.0

You need exactly one in-flight worker to hit 5 rps. That sounds wrong until you remember the math: each worker completes 5 requests per second on its own, because each request takes 200 ms.

Now scale it. You want 500 rps and the geocoder still answers in 200 ms:

concurrency = 500 rps × 0.2 s = 100

You need 100 in-flight workers to sustain 500 rps. Drop the latency to 100 ms and you only need 50 workers for the same throughput. Raise the latency to 400 ms (because you started using a slower fallback) and you need 200. The relationship is linear, not magical.

Two warnings about applying Little's Law in practice. First, W must be the latency you actually observe at the concurrency you actually run, not the latency from a single-request test. Latency rises with concurrency, so the equation is implicitly L = λ × W(L) — and W(L) is the curve described in the next section. Second, the law assumes a stable system. If your arrival rate exceeds what the upstream can serve, the queue grows without bound and the formula stops describing reality.

The latency curve

Plot p50, p95, and p99 on the y-axis against concurrency on the x-axis. You will see three regimes.

Underutilized (concurrency 1 to ~N/4 of the limit). All three curves are flat and tightly bunched. p50 and p99 are within 2x of each other. The geocoder has spare capacity, queues are empty, every request gets fresh server resources. You are leaving throughput on the table here, but your tail latency is excellent.

Sweet spot (concurrency ~N/4 to ~3N/4 of the limit). p50 starts to creep up — barely. p95 follows. p99 stays flat or rises slightly. Throughput climbs in near-linear proportion to concurrency. This is where you want to live. The system is busy but not saturated; queues are short and predictable; the upstream is using its capacity efficiently.

Past the elbow (concurrency > 3N/4 of the limit). p50 keeps climbing slowly. p95 climbs faster. p99 detaches from p95 and rockets — what was a 50 ms gap becomes a 5,000 ms gap as soon as queueing kicks in. Throughput plateaus or even drops as 429s and timeouts force retries. This is the failure mode that looks like success on a dashboard that only watches average latency.

The "elbow" is the concurrency level where p99 starts diverging from p50 by more than 5x. Below the elbow you are getting more throughput for free. Above it, every additional worker costs you tail latency without buying you throughput. The elbow is the sweet spot's right boundary, and it is the number you actually want to find. For the deeper math on why p99 explodes the way it does, see why your average latency lies.

A useful mental model: the elbow is roughly where the upstream's worker pool starts queueing your requests. Below it, every request goes straight to a server worker. At the elbow, some requests start waiting in the upstream's queue. Past it, the queue grows faster than it drains and tail latency goes parabolic.

Per-plan starting points

Little's Law plus a 5x safety factor gives you a starting concurrency for any rate-limited geocoder. The factor exists because the rate limit is a hard ceiling — you want to live below it, not on it, so a single slow request does not push the next batch into 429 territory.

The formula:

starting_concurrency = (rate_limit_per_minute / 60) × avg_latency_seconds × safety_factor

With a 200 ms average latency and a 5x safety factor, here are the per-plan starting numbers:

| Plan | Rate limit | Steady rps | Suggested starting concurrency | |---|---:|---:|---:| | Free | 100 / min | 1.67 | 4 | | Starter | 1,000 / min | 16.7 | 16 | | Growth | 5,000 / min | 83.3 | 64 | | Pro | 10,000 / min | 167 | 128 |

Two caveats. These numbers assume you have already implemented exponential backoff on 429s — if you have not, see exponential backoff for geocoding APIs before you turn the dial up. They also assume you are using the single-shot endpoint; the batch endpoint changes the math entirely (see the dedicated section below).

Treat the table as a *starting* point, not a destination. The elbow on your specific traffic — your address mix, your network path, your time of day — will be somewhere within ±50% of these numbers. Run the tuner from the script section to pin the actual value.

How to find your elbow

The procedure has four steps and takes about ten minutes per environment.

Step 1: instrument p50 and p99. You cannot find the elbow without seeing the curves. Bucket request durations into a histogram, emit the percentiles every 30 seconds, and plot them. Prometheus's histogram_quantile over geocode_request_duration_seconds_bucket works. Anything that gives you both percentiles on the same chart works.

Step 2: ramp concurrency in geometric steps. Start at 1, double until you hit the rate-limit ceiling. So 1, 2, 4, 8, 16, 32, 64, 128, 256. Geometric ramping is faster than linear and finds the elbow with fewer datapoints.

Step 3: hold each level for at least 60 seconds. The system needs to reach steady state — Little's Law is about the average, not the first 10 requests. Sixty seconds is the floor; 120 is better. If your geocoder caches under load, you may need 180 to let the cache fill.

Step 4: walk the chart and find where p99/p50 > 5. The elbow is the last concurrency level where the ratio is still healthy. Drop your production concurrency to that level, leave 20% headroom, and you are done.

The honest version of this procedure: do it once during onboarding, do it again whenever traffic patterns change materially (new address regions, new use case, new fallback provider). It is not weekly maintenance.

Connection pool sizing

The bug that wastes the most engineering time on this topic: setting concurrency to 64, watching throughput plateau at the equivalent of concurrency 6, and not understanding why. The answer is almost always that the HTTP client's connection pool is smaller than the requested concurrency.

Node's http.Agent defaults to maxSockets: Infinity in the global agent but to maxSockets: 5 (in older versions) or whatever the user-set value is on a custom agent. If you instantiate an agent and forget to size its pool, you have just serialized your "concurrent" workers across 5 connections. The other 59 are blocked on a semaphore inside the agent.

The rule is simple: agent.maxSockets >= concurrency. Always. Set it explicitly and document it next to the concurrency knob.

import { Agent } from 'node:https';
import fetch from 'node-fetch';

const CONCURRENCY = 64;

const agent = new Agent({
  keepAlive: true,
  maxSockets: CONCURRENCY,        // must be >= concurrency
  maxFreeSockets: CONCURRENCY,    // keep them warm
  timeout: 10_000,
});

async function geocode(addr) {
  return fetch('https://api.csv2geo.com/v1/geocode', {
    method: 'POST',
    agent,
    headers: { 'content-type': 'application/json' },
    body: JSON.stringify(addr),
  });
}

The Python equivalent is the connector pool on aiohttp.TCPConnector(limit=CONCURRENCY), or the pool_connections / pool_maxsize arguments on requests.adapters.HTTPAdapter. The Go equivalent is http.Transport.MaxIdleConnsPerHost, which defaults to a comically low 2. Every language has a version of this trap. Set the pool size to your concurrency, then re-run the tuner and you will see the throughput you expected.

For a deeper treatment of HTTP client tuning per language, the language-specific tutorials cover the right pool settings — Node, Python, Go, PHP, Java, Ruby — under their respective concurrency sections.

Concurrency vs batch endpoint

Concurrent single-shot requests scale up to a point. Past that point, the batch endpoint wins on every axis: lower latency per address, fewer 429s, smaller bill (typically by 30-60% on per-call pricing).

The crossover heuristic: switch to batch when you are sending more than ~50 requests per second of similar work to the same endpoint. Below that, the per-call overhead of batching (request building, JSON encoding, response parsing) is not worth it. Above it, you are paying network round-trip cost on every address when one round trip could carry 1,000.

| Volume | Recommended pattern | |---|---| | <1 rps | Single-shot, concurrency 1-2 | | 1-50 rps | Concurrent single-shot, concurrency from Little's Law | | 50-500 rps | Batch endpoint, batch size 100-1,000 | | >500 rps | Batch endpoint + concurrent batches, 4-8 batches in flight |

The mistake to avoid: using batch when your traffic is bursty and latency-sensitive. A batch that needs to wait 200 ms for 100 addresses to accumulate before sending will hurt p99 for the first address in the batch. For real-time UX (autocomplete, form validation), stay on single-shot until you literally cannot afford the rps.

For end-to-end batch design, including queue patterns and partial-failure handling, see streaming geocoding at scale and the case study at 1M addresses in 12 minutes.

A working tuner script

Drop-in Node script that ramps concurrency from 1 to 256 in geometric steps, measures p50 and p99 at each level, and prints the elbow. About 80 lines, no dependencies beyond node-fetch.

// tuner.mjs
import fetch from 'node-fetch';
import { Agent } from 'node:https';

const ENDPOINT = process.env.GEOCODE_URL;
const API_KEY = process.env.GEOCODE_API_KEY;
const SAMPLE_ADDRESS = {
  country: 'US', postcode: '20500', city: 'Washington',
  street: 'Pennsylvania Ave NW', house_number: '1600',
};

const LEVELS = [1, 2, 4, 8, 16, 32, 64, 128, 256];
const HOLD_SECONDS = 60;

function percentile(sorted, p) {
  const idx = Math.floor((sorted.length - 1) * p);
  return sorted[idx];
}

async function runAtLevel(concurrency) {
  const agent = new Agent({ keepAlive: true, maxSockets: concurrency });
  const durations = [];
  let stop = false;
  setTimeout(() => { stop = true; }, HOLD_SECONDS * 1000);

  async function worker() {
    while (!stop) {
      const t0 = Date.now();
      try {
        await fetch(ENDPOINT, {
          method: 'POST',
          agent,
          headers: {
            'content-type': 'application/json',
            'authorization': `Bearer ${API_KEY}`,
          },
          body: JSON.stringify(SAMPLE_ADDRESS),
        });
      } catch (_) { /* count as failure, ignore for percentile demo */ }
      durations.push(Date.now() - t0);
    }
  }

  await Promise.all(Array.from({ length: concurrency }, worker));
  agent.destroy();

  durations.sort((a, b) => a - b);
  return {
    concurrency,
    n: durations.length,
    rps: durations.length / HOLD_SECONDS,
    p50: percentile(durations, 0.50),
    p95: percentile(durations, 0.95),
    p99: percentile(durations, 0.99),
  };
}

(async () => {
  console.log('concurrency\tn\trps\tp50\tp95\tp99\tp99/p50');
  let elbow = LEVELS[0];
  for (const c of LEVELS) {
    const r = await runAtLevel(c);
    const ratio = r.p99 / Math.max(r.p50, 1);
    console.log(
      `${r.concurrency}\t${r.n}\t${r.rps.toFixed(1)}\t${r.p50}\t${r.p95}\t${r.p99}\t${ratio.toFixed(2)}`
    );
    if (ratio < 5) elbow = c;
    if (ratio > 10) break; // past the elbow, no point ramping further
  }
  console.log(`\nElbow at concurrency = ${elbow}. Run production at ~${Math.floor(elbow * 0.8)} for 20% headroom.`);
})();

Run it once per environment. Save the output. The elbow it prints is the number you should plug into your worker pool. The 80% headroom recommendation is conservative; if your traffic is bursty, drop it to 70%. If it is steady and well-behaved, 90% is fine.

For honest benchmarking methodology — including how to make sure you are measuring the geocoder and not your own client — see benchmarking geocoding APIs.

Anti-patterns

Four ways to get this wrong, in roughly the order I see them in production code reviews.

Unbounded `Promise.all`. Reading a CSV with Promise.all(rows.map(geocode)) creates as many in-flight requests as the file has rows. A 100K-row file yields 100K concurrent promises, instantly saturating the connection pool, the rate limiter, and usually the upstream. Use a bounded concurrency primitive — p-limit, a semaphore, or a worker pool. Concurrency is a knob; do not break it off.

Fixed concurrency that ignores rate-limit headers. Your tuner found 64 last quarter. Today the rate limit got cut to 1,000/min for everyone in your tier. Concurrency 64 now produces 429s on every other request. The X-RateLimit-Remaining header is the upstream telling you exactly how much headroom is left — read it, and back off when it gets thin. The pattern that combines this with bucketed backoff is in token bucket vs leaky bucket.

Ignoring connection-pool size. Covered above; flagged again because it is the single most common cause of "I set concurrency to 64 but throughput looks like 6." The pool size is not optional — set it equal to or larger than your concurrency, every time, in every language.

Batch when the right answer was concurrent, or vice versa. Batching a single autocomplete keystroke wastes the latency budget. Sending 1M individual requests when one batch call would have served them wastes the bill. The crossover heuristic in the table above is conservative; if you are within 2x of either side, run a quick A/B and pick the winner. Defaulting blindly to either pattern is how teams lock in the wrong one for a year.

Frequently Asked Questions

What is the right starting concurrency for a new pipeline?

Use Little's Law: concurrency = (rate_limit_per_minute / 60) × avg_latency_seconds × 5. For a 1,000/min plan with 200 ms average latency, that is (1000/60) × 0.2 × 5 ≈ 16. Start there, ramp during a tuning pass, settle on the elbow with 20% headroom. The per-plan table above has the precomputed numbers for the common tiers.

How do I know I am past the elbow?

The clean signal is the ratio p99 / p50 exceeding 5. Below the elbow, that ratio sits between 1.5 and 3. At the elbow it climbs through 5. Past the elbow it shoots through 10 within a few additional concurrency steps and throughput stops climbing. If you see throughput plateau while p99 is growing, you are past the elbow.

Does concurrency tuning matter for the batch endpoint?

Yes, but the dial moves. With single-shot you tune in-flight requests; with batch you tune in-flight batches. A batch of 1,000 addresses might have a 4-second latency, so Little's Law for 100 rps target is (100 × 4) / 1000 = 0.4 — you need less than one batch in flight on average. In practice you run 4-8 concurrent batches to absorb variance and keep the upstream warm.

What if my traffic is bursty?

Tune concurrency for the steady-state target, then put a queue in front of the worker pool to absorb bursts. The queue smooths the arrival rate; the worker pool runs at its tuned concurrency regardless. This is the pattern from designing a batch geocoding queue. Trying to tune concurrency to handle the burst itself just means you run too hot the rest of the time.

How does caching change the calculation?

A 90% cache hit rate cuts your effective request volume by 10x. Little's Law reads off the *uncached* requests — the ones that actually hit the upstream — so a higher hit rate lets you live at lower concurrency for the same user-facing throughput. Re-tune after any major caching change; the elbow on cached traffic is in a different place from the elbow on raw traffic.

Should I tune concurrency separately per region or per address type?

Usually no. The elbow is a property of the upstream and your network path to it, not of your specific addresses. Geocoding a US address and a UK address against the same endpoint will hit the same elbow. The exception is if your traffic mix changes which fallback path the geocoder takes — for example, if 90% of your traffic suddenly routes through a slower secondary provider. In that case, average latency rises, and Little's Law tells you to lower concurrency to compensate.

Does concurrency tuning work for self-hosted geocoders?

Yes, and the math is the same — but the bottleneck moves from "rate limit" to "CPU and memory." On a self-hosted geocoder you usually find the elbow at the point where CPU goes above 80% sustained, or memory pressure forces evictions. The procedure (ramp, measure, find the elbow) is identical; only the failure mode at the top end changes. The connection-pool point still applies — your client still serializes if the pool is undersized.

Closing

Little's Law gives you the starting concurrency. The latency curve tells you when to stop. The connection pool is the bug that quietly halves your effective concurrency until you check it. The batch endpoint takes over once you cross 50 rps. Run the tuner once per environment, write down the elbow, run at 80% of it, and revisit when traffic patterns change. That is the entire job.

For the latency math the elbow rests on, see why your average lies. For honest performance numbers across providers, see benchmarking geocoding APIs. For the rate-limit interaction, see token bucket vs leaky bucket. For the case study where this exact procedure took a job from 8 hours to 12 minutes, see 1M addresses in 12 minutes.

I.A. / CSV2GEO Creator