p99 Latency in Geocoding: Why Your Average Lies
Why mean latency is a trap and p99 is the SLO that matters: percentile math, tail amplification, and how to fix it.
A geocoding pipeline serving 95% of requests from cache in 5ms and the remaining 5% from upstream in 4 seconds has a mean latency of about 205ms. Put that on a dashboard and it sounds healthy. It is also nearly useless. The number tells you nothing about what 1 in 20 of your users actually experiences — which, in this example, is two orders of magnitude worse than the average suggests.
The number that does tell you something is p99: the latency below which 99% of requests complete. In the same pipeline, p99 is somewhere north of 3.5 seconds. That is the SLO that matters, the number you put in your customer-facing contract, and the one that drives engineering decisions. This post is the practical version: why mean lies, the percentile math, how the tail amplifies under fanout, what causes the tail in real geocoding pipelines, how to compute p99 correctly, and the small set of techniques that actually shrink it.
Why mean lies
The arithmetic mean is dominated by the bulk of the distribution. In a latency distribution where 95% of requests are clustered tightly around 5ms and 5% are spread between 1s and 10s, the mean sits near 200ms — a value that almost no individual request actually has. The mean describes a hypothetical request that does not exist.
Worse, the mean is robust to small tail changes in exactly the wrong direction. Doubling your tail latency from 4s to 8s on 5% of requests moves your mean from 205ms to 405ms — a 2x change in a number nobody experiences, while every slow user has an objectively worse day. Halve the tail, the mean barely flinches. The mean is a lagging indicator of a number that does not represent any real user.
The geocoding-specific version of this story is even sharper. A pipeline with a 90% cache hit rate has 90% of requests finishing in well under 10ms, while the 10% miss path goes through libpostal parsing, an upstream HTTP call, retry logic if the upstream is having a bad afternoon, and possibly a fallback to a second provider. The miss path is where every interesting failure mode lives. The mean buries it.
The percentile math
A percentile is a sorted-order statistic. p50 is the value where half of requests are faster and half are slower. p99 is the value where 99% of requests are faster. The definitions are simple; what they tell you is not.
| Percentile | What it represents | What it catches | |---|---|---| | p50 (median) | A typical request | Cache-hit performance, baseline parser speed | | p95 | The slow side of normal | Cache misses to a healthy upstream | | p99 | Your slowest 1% | Tail amplification, retry paths, GC pauses | | p99.9 | Your slowest 1 in 1,000 | Network reconvergence, provider failover, lock contention | | p99.99 | Your slowest 1 in 10,000 | Once-a-day pathological events |
The right percentile to instrument depends on traffic volume. At 100 requests per second you have 8.6M requests per day, so p99.9 is 8,640 requests a day living above the line — a real cohort. At 1 request per second you have 86,400 requests a day, p99 is ~864 requests a day, and p99.9 is meaningful only on a weekly window. Pick a percentile your traffic volume actually populates.
The other rule: do not look at one percentile in isolation. The shape p50=5ms, p95=20ms, p99=4s tells a different story than p50=200ms, p95=300ms, p99=400ms. The first has a steep cliff at the cache miss boundary. The second has a uniformly slow pipeline. They want different fixes.
Tail amplification under fanout
Here is the math nobody tells you in the dashboard onboarding. If a single user-facing request triggers n parallel sub-requests and waits for all of them, the response time is the *maximum* of the sub-request latencies, not the average. The probability that all n sub-requests stay below your p99 is (0.99)^n.
| Fanout n | P(all under p99) | Effective percentile of slowest | |---:|---:|---| | 1 | 99.0% | p99 | | 5 | 95.1% | ~p95 | | 10 | 90.4% | ~p90 | | 50 | 60.5% | ~p60 | | 100 | 36.6% | ~p37 |
In reverse: to keep the *response* at p99, every sub-request must hit roughly p99.9 if you fan out 10 ways, p99.99 if you fan out 100 ways. A geocoding endpoint that batches 100 addresses per call and waits for all of them has its user-facing p99 set by the per-address p99.99 — a percentile most teams have never looked at.
This is why a pipeline that "feels slow" often has clean per-request percentiles. Each address is fine; the *batch* is dragged down by whichever address happens to land on the bad-luck side of the distribution. The fix is rarely to make the median faster — it is to clip the tail so the worst sub-request stops dominating the wall-clock.
What causes the tail
Real geocoding tails come from a small, identifiable set of causes. Each one wants a different fix.
GC pauses. A Java or Node service under load occasionally stops the world for 50-300ms. On the cache-hit fast path you would never notice; on a tail measurement it shows up as a clean spike at GC pause duration. Fix: tune the collector (G1 with MaxGCPauseMillis, ZGC for sub-millisecond pauses), pre-warm pools, avoid allocating in the hot path.
Cache misses to a healthy upstream. This is the dominant tail cause for any pipeline with a hit rate <99%. Cache hits run in microseconds; misses run in tens to hundreds of milliseconds. The cliff between the two *is* your p95-to-p99 jump. Fix: raise the hit rate. The math is in How to Cache Geocoding Results.
Network reconvergence. A BGP flap, a TCP retransmission, a DNS TTL expiry that lands on a slow resolver. These produce 1-5 second outliers that no application change can fix. They show up at p99.9 and above and are the reason your "every dependency is fine" investigation finds nothing.
Retry plus backoff. A single 429 from an upstream triggers a backoff, often 250ms or more. A second 429 doubles it. If your retry logic is not budget-aware, a transient rate-limit blip turns 1% of requests into multi-second outliers. Fix: bound retry budgets per request and per minute, not just per call. See Exponential Backoff for Geocoding APIs and Rate Limiting a Geocoding Pipeline.
Lock contention. A shared mutex around a connection pool, a Redis SLOWLOG entry, a write-heavy table that contends with reads. Lock waits are the most invisible tail cause because they do not show up in CPU charts or memory charts. Fix: profile under load, not at idle.
How to compute p99 correctly
Two patterns dominate, and one of them is wrong in ways that are easy to miss.
Sample-based. Keep the last N latency samples in a ring buffer, sort, take the 99th percentile. Simple, but: at 10K requests per second and N=10,000 you have a one-second window, which is too short to contain enough p99 events to be statistically meaningful. To stabilize p99 you need at least 100 events in the tail, which means N≥10,000 for p99 and N≥100,000 for p99.9. Memory and sort cost grow with N.
Histogram-based. Bucket each latency into a pre-sized histogram (HDR Histogram is the canonical implementation). p99 is computed by walking the buckets until 99% of the count is covered. Constant memory regardless of throughput, log-precision accuracy, mergeable across instances. This is the right answer for production.
The trap most teams fall into is averaging p99 across instances. If you have 10 instances each reporting p99 = 100ms, the global p99 is *not* 100ms — it is some value derived from the union of all 10 distributions. Averaging percentiles is mathematically meaningless; you have to merge the underlying histograms and recompute. Prometheus' histogram_quantile() does this correctly when you sum the bucket counts first:
histogram_quantile(0.99,
sum by (le) (rate(geocode_request_duration_seconds_bucket[5m]))
)The sum by (le) is the load-bearing part. Without it you compute per-instance p99s and average them — wrong, and quietly wrong. Get this query right once and put it in your dashboard template. For more metric patterns, see Observability for Geocoding Pipelines.
Setting an SLO
A latency SLO is a contract: for some percentile, latency stays below some threshold over some window. The shape that survives an audit is two-tiered:
- 99% of requests under 500ms — your normal-traffic SLO.
- 99.9% of requests under 2s — your tail-pathology SLO.
The two numbers catch different failure modes. A regression that doubles your cache miss latency moves the 99% line. A regression that adds a 5-second retry path moves the 99.9% line but leaves 99% untouched. You want both alarms wired.
How do you derive the threshold? Work backwards from business needs. A user-facing form submission has a perceptual latency budget around 1 second — past that, abandonment rates climb measurably. A backend batch enrichment has no user-facing latency budget at all; you care about throughput and cost, and p99 is a debugging tool. A real-time API serving SDKs in production has the strictest budget — typically 200-500ms p99 because the SDK consumer adds it to *their* budget.
Write the SLO down. Wire the alarm. The SLO that lives only in your head will be silently violated.
A working p99 calculator
The 50-line version, in Node, using a bucketed counter that is correct enough for most production needs without a heavyweight dependency. Drop it into a service and expose /metrics.
// histogram.mjs
const BUCKETS_MS = [
1, 2, 5, 10, 20, 50, 100, 200, 500,
1000, 2000, 5000, 10000, 30000,
];
class LatencyHistogram {
constructor() {
this.counts = new Array(BUCKETS_MS.length + 1).fill(0);
this.total = 0;
}
observe(ms) {
this.total++;
for (let i = 0; i < BUCKETS_MS.length; i++) {
if (ms <= BUCKETS_MS[i]) { this.counts[i]++; return; }
}
this.counts[BUCKETS_MS.length]++; // overflow bucket
}
percentile(p) {
const target = this.total * p;
let cum = 0;
for (let i = 0; i < this.counts.length; i++) {
cum += this.counts[i];
if (cum >= target) {
return BUCKETS_MS[i] ?? Infinity;
}
}
return Infinity;
}
}
export const geocodeLatency = new LatencyHistogram();
export async function timed(fn) {
const start = performance.now();
try {
return await fn();
} finally {
geocodeLatency.observe(performance.now() - start);
}
}Use it like this:
import { timed, geocodeLatency } from './histogram.mjs';
const result = await timed(() => geocode(addr));
// Every minute, log the percentiles
setInterval(() => {
console.log({
p50: geocodeLatency.percentile(0.50),
p95: geocodeLatency.percentile(0.95),
p99: geocodeLatency.percentile(0.99),
p999: geocodeLatency.percentile(0.999),
});
}, 60_000);For production, replace the bucketed counter with hdr-histogram-js — same shape, log-linear precision, and proper merging across instances. The toy version above is fine for a single-instance service or for local profiling where dependency weight matters. For a production-grade Prometheus setup, use a Histogram with explicit buckets and let Grafana compute the percentile via histogram_quantile().
Reducing the tail
Once you can see the tail, the techniques that move it are a small set.
Caching. The single biggest p99 improvement is raising your cache hit rate. Going from 90% to 99% hit rate effectively divides your p99 by ten, because the cache miss path is what the tail is made of. The math, key design, and TTL strategy are in How to Cache Geocoding Results.
p99-aware retry. Standard retry logic adds latency on every retry. A p99-aware retry sets a deadline at the start of the request and refuses to retry if the deadline is close — better to fail fast at 450ms than retry and finish at 1.2s past your SLO. The pattern: pass a deadline through your call chain, check it before each retry.
Hedged requests. Issue a second request to a different replica or upstream after a short timeout (often p95 of the first call), and take whichever finishes first. Cancels the slow one. The technique was popularized by Google's "The Tail at Scale" paper for exactly this case. Costs ~5% extra capacity, cuts p99 in half. Use sparingly — hedging a non-idempotent operation is dangerous; see Idempotent Geocoding.
Connection pre-warming. Cold TLS handshakes add 50-200ms. Keep connection pools warm with periodic no-op pings, and you eliminate a category of tail spikes that show up at deploy boundaries and after idle periods.
Concurrency tuning. Past a certain queue depth, p99 diverges sharply while p50 stays flat. The curve is in Concurrency Tuning for Geocoding. Find the knee, do not run past it.
Honest benchmarking. None of this works if you only test the warm-cache happy path. Benchmark with cold caches, realistic address-difficulty distributions, and provider failures injected. The methodology is in Benchmarking Geocoding APIs.
Frequently Asked Questions
Why is mean latency considered misleading?
Because the mean is dominated by the bulk of the distribution and ignores the tail, which is where every interesting failure lives. A pipeline serving 95% of requests in 5ms and 5% in 4 seconds has a mean of ~200ms — a value that no actual request has. The mean does not describe a real user; the percentiles do.
What is the right percentile to set my SLO at?
A two-tier SLO is the pattern that survives audit: 99% of requests under your normal-traffic threshold (often 500ms for an API), and 99.9% under a tail-pathology threshold (often 2-5s). The two numbers catch different failure modes — a regression in cache miss latency moves the first; a regression in retry behavior moves the second.
How is p99 computed across multiple instances?
By merging the underlying histograms and recomputing the percentile, never by averaging the per-instance p99s. Averaging percentiles is mathematically meaningless. In Prometheus, the correct query sums histogram bucket counts before applying histogram_quantile(). Most teams get this wrong on the first dashboard and never check.
What is tail amplification under fanout?
When a request triggers n parallel sub-requests and waits for all of them, the response time is the *maximum* of the sub-request latencies. The probability all stay under per-call p99 is 0.99^n, which falls off fast — at n=10, only 90% of responses stay under p99. To keep the response at p99 with fanout of 10, every sub-request must hit p99.9.
How do I tell if my tail is caused by GC, network, or retries?
Look at the shape. GC pauses produce a clustered spike at the pause duration (50-300ms). Network reconvergence produces broad outliers at p99.9 and above (1-5s). Retry plus backoff produces a stepped distribution at retry-interval multiples. If you cannot tell from the shape, capture flame graphs during slow requests and check for kernel-level waits versus user-level work.
Should I use HDR Histogram or a simple bucketed counter?
HDR Histogram for production. It uses constant memory regardless of throughput, has log-precision accuracy across the full latency range, and merges correctly across instances. A simple bucketed counter is fine for local profiling or low-throughput services where the dependency weight is not justified — the 50-line version in this post is sufficient there.
What hit rate do I need for p99 < 100ms?
Roughly: if your cache hits run in 5ms and your cache misses run in 1s, you need a hit rate above 99% for p99 to land under 100ms — because at 99% hit rate, the slowest 1% of requests *are* the misses. The math is harsh but predictable. Push the hit rate higher; that is where the lever is.
Closing
The mean lies because the mean averages over a population that does not exist. p99 tells you what the slowest 1% of users actually experience, which is what your SLO should be written against, what your dashboards should show, and what your engineering effort should be aimed at. Compute it correctly with histograms, alarm on it at two tiers, and shrink the tail with caching, deadline-aware retries, hedged requests, and warm connections.
The number on the dashboard should be the number a customer would call you about. That number is p99, not the mean.
I.A. / CSV2GEO Creator
Related Articles
- Benchmarking Geocoding APIs: Methodology, Pitfalls, and Honest Numbers
- Concurrency Tuning for Geocoding: Finding Your Sweet Spot
- Observability for Geocoding Pipelines: Metrics That Actually Matter
- Rate Limiting a Geocoding Pipeline: Token Bucket vs Leaky Bucket vs Sliding Window
- Geocoding 1 Million Addresses: From 8 Hours to 12 Minutes
Use our batch geocoding tool to convert thousands of addresses to coordinates in minutes. Start with 100 free addresses.
Try Batch Geocoding Free →