Geocoding Confidence Scores Explained: When to Trust, When to Reject

What accuracy_score actually means, how providers differ, and the threshold rules that survive 1M-row datasets.

| May 07, 2026

Geocoding Confidence Scores Explained: When to Trust, When to Reject

A regional delivery company once shipped 412 packages to coordinates roughly 800 metres from the actual front door. Some landed at the wrong end of a rural road, a few at a postcode centroid in the middle of a field, one in a river. The post-mortem was short: their pipeline accepted every result the geocoder returned, and the geocoder had been quietly stamping accuracy_score: 0.4 on roughly 6% of the addresses for months. Nobody had ever filtered on it.

That story is not unusual. It is the single most common failure mode in production geocoding, and it is entirely preventable. Confidence scores are a filter, not a vibe. They are the contract the geocoder offers you in exchange for not pretending that every address you send in is going to come back with a rooftop hit. Read the contract, set a threshold, and most of the bad geocodes never reach your downstream system.

This post is what every team eventually learns the hard way: what accuracy_score actually measures, how match levels map to it, where 0.7 came from, and how to tune the threshold for your specific use case without playing whack-a-mole with bad coordinates for the next six months.

What accuracy_score really measures

accuracy_score is the geocoder's self-reported certainty that the result it returned matches the address you sent. It is a number between 0 and 1 — closer to 1 means the geocoder is more sure. That is the simple version.

The less simple version is that it is a blended estimate of three things: how cleanly your input parsed (a structured address with a valid postcode scores higher than a free-form blob), how precisely the result matched (a unique rooftop with a unique house number scores higher than a postcode centroid), and how dense the underlying reference data is in that region (downtown Munich has more authoritative data than rural Mongolia, and the score reflects that).

What accuracy_score is not is a distance in metres. A score of 0.95 does not mean "the result is within 5 metres of truth." It means the geocoder believes — based on its internal weighting of the three factors above — that the match is high quality. The actual physical accuracy is correlated with the score, but loosely. A 0.95 result might be 2 metres from truth in central London and 30 metres from truth on a freshly-built suburban street. If you need physical accuracy in metres, that is what distance_meters is for in reverse-geocoding responses, and even then it is a self-report, not ground truth.

Understanding this distinction matters because teams routinely mistake high-confidence for high-precision. They are different things. Confidence is "did I match the right address?" Precision is "how close are these coordinates to the actual building?" A geocoder can be very confident it matched the right postcode and still drop you 400 metres from the front door — because that is what postcode centroids do.

The match levels

Most geocoders return both a numerical accuracy_score and a categorical accuracy (sometimes called match_level or match_type). The category tells you what kind of feature the geocoder hit. The score tells you how confident it is in that hit.

| accuracy | meaning | typical accuracy_score range | |---|---|---| | houseNumber | rooftop or parcel-level match — the specific door | 0.95-1.0 | | street | street centroid — middle of the named road segment | 0.7-0.95 | | place | POI or named building match (e.g. "Eiffel Tower") | 0.6-0.9 | | postcode | postcode or postal-area centroid | 0.5-0.7 |

The implications matter. A houseNumber match is the only level where you can reasonably claim "we delivered to the building." A street match places you somewhere along the road — for a 400-metre street, that is a ±200m error before you start. A postcode match in a dense urban postcode is typically ±100m; in a rural UK postcode area or a US ZIP that covers a township, it can be several kilometres.

The other thing this table makes obvious: the accuracy_score ranges overlap. A 0.85 result might be street or place. You cannot ignore the categorical field and use the score alone — you need both. When you are filtering, check the category first, then the score within that category.

The 0.7 rule

Across our own production traffic — about 39 countries, billions of geocoded rows, several years of audit data — 0.7 is the threshold where the curve breaks. Below 0.7, results are dominated by postcode centroids and ambiguous matches that should not be trusted for downstream geometry work. Above 0.7, the false-positive rate (geocoder is confident but wrong) is under 2%.

If you plotted the accuracy_score distribution of a typical dataset, it would look bimodal. There is a tall spike near 1.0 (the rooftop matches, which is most of the volume on clean data), a long tail dropping off through 0.85 to 0.7 (street and good place matches), then a much smaller secondary bump around 0.5-0.6 (postcode centroids and weak place matches). The dip between the two bumps sits almost exactly at 0.7. Below it, you are mostly in postcode-centroid territory; above it, you are mostly in real address territory.

That is why 0.7 is the default in most of the example code in this series. It is not a number we picked because it looked round. It is the empirical break point in the data, and it is remarkably stable across countries — even in regions where the absolute hit rate is lower, the *shape* of the distribution is similar, and 0.7 is still the dip.

The caveat: 0.7 is a starting point, not gospel. If your use case can tolerate street-level matches but not postcode centroids, 0.7 is right. If your use case demands rooftop precision, 0.7 is far too lenient — you want 0.95. We will get to per-use-case thresholds in a moment.

Why providers disagree

If you compare the accuracy_score from two different geocoders for the same address, the numbers will not line up. One provider might call a result 0.92, the other 0.78. Neither is wrong. They are using different scoring models trained on different reference data with different assumptions about what "high quality" means.

This trips people up when they are evaluating providers. They run 1,000 addresses through provider A and 1,000 through provider B, average the scores, and declare the higher average the winner. That is meaningless. Provider A might be calibrated to use the full 0-1 range; provider B might cluster all good results between 0.85 and 1.0 and reserve scores below 0.85 for genuine garbage. Comparing the *averages* is comparing different scales.

The honest comparison is distributional. Plot the histograms side by side. Look at where the curves break. Look at whether the bimodal shape is present or whether one provider has flattened it into a single hump (which usually means they are not really using the lower end of the range). Pick a threshold per provider — not a universal one — and compare the *outcome* of filtering: how many addresses pass, how many are correct, how many are wrong. That is the comparison that survives an audit.

For more on running this kind of comparison without lying to yourself, see benchmarking geocoding APIs.

Reading accuracy_score in code

The code is mostly trivial. The discipline is doing it on every single result, every single time, no exceptions.

Node.js:

async function geocodeFiltered(address, threshold = 0.7) {
  const url = new URL('https://csv2geo.com/api/v1/geocode');
  url.searchParams.set('q', address);
  const res = await fetch(url, {
    headers: { Authorization: `Bearer ${process.env.CSV2GEO_KEY}` },
  });
  if (!res.ok) throw new Error(`HTTP ${res.status}`);
  const data = await res.json();

  const r = data.results[0];
  if (!r) return { status: 'no_match' };
  if (r.accuracy_score < threshold) {
    return { status: 'low_confidence', score: r.accuracy_score, accuracy: r.accuracy };
  }
  return { status: 'ok', lat: r.location.lat, lng: r.location.lng, score: r.accuracy_score };
}

Python:

import os, requests

def geocode_filtered(address, threshold=0.7):
    r = requests.get(
        "https://csv2geo.com/api/v1/geocode",
        params={"q": address},
        headers={"Authorization": f"Bearer {os.environ['CSV2GEO_KEY']}"},
        timeout=10,
    )
    r.raise_for_status()
    results = r.json().get("results") or []
    if not results:
        return {"status": "no_match"}
    top = results[0]
    if top["accuracy_score"] < threshold:
        return {"status": "low_confidence", "score": top["accuracy_score"], "accuracy": top["accuracy"]}
    return {
        "status": "ok",
        "lat": top["location"]["lat"],
        "lng": top["location"]["lng"],
        "score": top["accuracy_score"],
    }

Both versions return a small status object rather than a raw (lat, lng). That is on purpose. Three different downstream conditions — matched, no result, low confidence — need three different code paths, and squashing them into "lat or null" loses information you want at audit time.

What "no_match" looks like

A no-match is when the geocoder cannot find anything plausible for your input. The response shape is unambiguous: data.results is an empty array. Not null, not missing — an empty [].

{
  "query": "asdf 123 nowhere road",
  "results": [],
  "meta": { "response_time_ms": 41, "source": "overture" }
}

A low-confidence match is different. data.results has at least one item, and accuracy_score is below your threshold. The geocoder found *something* it thought was plausible, but it is not confident.

These two cases need to be handled differently. A no-match is usually an input quality problem (typo, missing postcode, made-up address). A low-confidence match is usually a real address that the reference data cannot place precisely (new construction, rural region, ambiguous street name). The fix for the first is input validation. The fix for the second is sometimes "accept it anyway with a flag" or sometimes "send to a human."

If you collapse both into a single "failed" bucket, you lose the signal that tells you which one is happening. Roughly 70/30 in favour of low-confidence is common on dirty input lists. Roughly 95/5 in favour of true no-match is common on already-validated lists. The ratio tells you where to invest cleanup effort.

Threshold tuning per use case

The right threshold is a function of how expensive a false positive is for your use case. Here are the thresholds we see in production across a few common workloads, with the reasoning.

| use case | threshold | reasoning | |---|---|---| | Marketing list dedup | 0.5 | A wrong-postcode match is fine — you are clustering, not delivering | | Insurance risk pricing | 0.95 | Premium calculation requires rooftop; a 200m miss can put you in the wrong flood zone | | Last-mile delivery | 0.85 | Driver can navigate the last 50m visually, but not the last 500m | | Donor file aggregation | 0.7 | Standard threshold; outliers reviewed manually before mailshots | | Census aggregation | 0.6 | Postcode centroids are acceptable — you are aggregating to area anyway |

The pattern: the higher the consequence of a wrong match, the higher the threshold. Insurance underwriters care about being on the right side of a flood-zone boundary; if your geocoder lands you in the wrong zone, you mispriced the policy and there is no recovery. Marketing teams care about not double-counting John Smith at 12 Main Street; a 100m drift on the coordinate is irrelevant.

Worth noting: for the highest-precision use cases, you should also gate on the categorical accuracy field, not just the score. Insurance pricing should require accuracy === 'houseNumber' regardless of accuracy_score — a 0.96 street match is worse for risk pricing than a 0.91 houseNumber match.

What to do with rejected rows

Filtering on confidence does not solve the problem of low-confidence rows; it just moves them. A row that fails your threshold still represents a real customer, claim, or asset somewhere. Three paths handle the long tail.

Hand back to the user. If your product has a UI, surface the rejected rows so the user can see what failed and why. "We could not confidently locate 47 of your 1,000 addresses; here they are with the closest guesses we got" is far more useful than a silent 953-row output file. Most users would rather fix 47 typos than rerun the whole thing.

Send to a pre-cleaner. A surprisingly large fraction of low-confidence rows are not bad addresses — they are well-formed addresses with cosmetic problems that confuse the parser. "Apt 3B, 12 Main St" gets a low score because the unit number is bleeding into the street field; rebuilt as house_number=12, street=Main St, unit=3B, the same address scores 0.97. This is what an address parsing pre-cleaner is for. Run it on your rejected rows and re-geocode; recovery rates of 30-50% are typical.

Log for manual review. What survives both of the above is the genuinely hard tail: missing house numbers, vanity addresses, rural routes, demolished buildings. These need a human. Log them with the original input, the low-confidence guess, and a unique id. A weekly review queue of 0.5% of your daily volume is sustainable; trying to fix them in real time is not.

A confidence-score audit script

Before you set a threshold for a new dataset, look at the distribution. The two-minute audit below reads a result CSV (with an accuracy_score column), buckets the scores into 0.05-wide bins, and prints a histogram. Run it on a sample of 1,000-10,000 already-geocoded rows and you will see exactly where your distribution breaks.

// audit-scores.mjs — node audit-scores.mjs results.csv
import { createReadStream } from 'node:fs';
import { parse } from 'csv-parse';

const buckets = new Array(21).fill(0); // 0.00, 0.05, ..., 1.00
let total = 0, noMatch = 0;

const parser = createReadStream(process.argv[2]).pipe(parse({ columns: true }));
for await (const row of parser) {
  total++;
  const s = parseFloat(row.accuracy_score);
  if (!Number.isFinite(s)) { noMatch++; continue; }
  const bin = Math.min(20, Math.floor(s * 20));
  buckets[bin]++;
}

const max = Math.max(...buckets);
const width = 50;
console.log(`Total rows: ${total}, no_match: ${noMatch} (${(100 * noMatch / total).toFixed(1)}%)`);
console.log('score    | count    | bar');
console.log('---------+----------+' + '-'.repeat(width));
for (let i = 0; i < buckets.length; i++) {
  const lo = (i * 0.05).toFixed(2);
  const bar = '#'.repeat(Math.round((buckets[i] / max) * width));
  console.log(`${lo}     | ${String(buckets[i]).padStart(8)} | ${bar}`);
}

const cutoff = 0.7;
const rejected = buckets.slice(0, Math.floor(cutoff * 20)).reduce((a, b) => a + b, 0);
console.log(`\nAt threshold ${cutoff}: ${rejected} rows rejected (${(100 * rejected / total).toFixed(1)}%)`);

The output gives you three things: the bimodal shape (or its absence), the exact bin where the dip sits, and the rejection rate at the default 0.7 cutoff. If your dip is at 0.6 rather than 0.7 — common on already-cleaned input — lower your threshold. If it is at 0.85 — common on poorly-formatted CSV exports — your real problem is upstream parsing, not geocoding.

Frequently Asked Questions

Is 0.7 always the right cutoff?

No. It is the right cutoff for *most* general-purpose use cases on *most* datasets, because 0.7 is empirically where the bimodal distribution dips. For high-precision use cases (insurance, last-mile delivery, emergency response) you need 0.85 or higher. For low-precision use cases (marketing, census aggregation) you can drop to 0.5 without meaningful harm. Always run the audit script on a sample before committing to a number.

What if my data is mostly rural?

Rural data shifts the distribution leftward — fewer rooftop matches because rural reference data is sparser, more street and postcode matches. The dip can move to 0.6 instead of 0.7. Your absolute hit rate at any threshold will be lower; that is not the geocoder failing, it is the underlying data being thinner. Either lower your threshold to 0.6 and accept more street-level matches, or accept a lower throughput at the standard 0.7.

Why does my hit rate drop on apartment buildings?

Because apartment buildings are the hardest case in geocoding. The street and house number match cleanly; the unit number does not, because most reference databases do not store unit-level data. The geocoder returns the building centroid with a houseNumber accuracy and a high score — but every unit in the building gets the same coordinate. This is not a confidence issue, it is a house-number problem in disguise. Confidence scores cannot save you from it; you have to handle multi-unit buildings as a separate case.

Do scores compare across countries?

Loosely. The same geocoder will produce broadly comparable scores across countries because it is using the same scoring model. But the *shape* of the distribution shifts: Germany and the Netherlands skew very high (clean reference data), India and Brazil have flatter distributions (sparser reference data, harder address formats). If you are running a single threshold across a multi-country dataset, set it conservatively (0.7-0.75) and accept that you will reject more from the harder regions. Or set per-country thresholds based on per-country audits.

Should I retry failed addresses with a parser?

Yes, almost always. Pre-cleaning recovers 30-50% of low-confidence rows for the cost of one extra round trip. The other 50-70% are genuinely hard and need human review or acceptance with a flag. The economics are obvious: parsing costs milliseconds, manual review costs minutes. Run the parser on rejects before you escalate.

Does accuracy_score correlate with response time?

Slightly, in counter-intuitive ways. High-confidence rooftop matches tend to come from cache or fast-path index lookups and are quick. Low-confidence matches sometimes come from fallback providers that took longer to consult. So the very fast and the very slow tend to be confident; the middle is where the ambiguous matches live. Do not use response time as a confidence proxy — it is noisy — but do not be surprised when slow responses correlate with weird accuracy categories.

What's the difference between accuracy and accuracy_score?

accuracy is the categorical match level (houseNumber, street, place, postcode) — what *kind* of feature was matched. accuracy_score is the numerical confidence (0 to 1) — *how sure* the geocoder is in that match. They are independent dimensions. A 0.95 postcode match and a 0.65 houseNumber match are very different things, and you usually want both checks in your filter: gate on category for use cases that demand precision, gate on score for general quality control.

I.A. / CSV2GEO Creator