Address Parsing Before Geocoding: Cleaning Inputs for Better Matches

How to parse, normalize, and clean addresses before sending them to a geocoder — the preprocessing step that turns 70% match rates into 95%.

| May 11, 2026
Address Parsing Before Geocoding: Cleaning Inputs for Better Matches

The single biggest match-rate improvement most geocoding pipelines never make is parsing the address before sending it. The team running the pipeline assumes the geocoder will figure out "123 Main St Apt 4B Brooklyn NY 11201"; it usually does. They also assume it will figure out " Apt 4B / 123 main street Brooklyn, NEW YORK 11201-1234, USA "; this is where match rates start to slip. Every quirk in the input — a stray apartment prefix, a typo, a postcode in a non-canonical format — costs you a percentage point of match rate.

This post is the practical preprocessing playbook: what to parse out, what to normalize, what to leave alone, and the libraries that do this without you writing 2,000 lines of regex. Working code in Python (libpostal) and Node (the same, via bindings). At the end you should be able to lift a 70% match rate to 95% on dirty input data — with no change to the geocoding API itself.

Why preprocessing matters more than the geocoder

Most geocoding APIs do their own parsing internally. So why do it again upfront?

Because the geocoder's parser is optimized for one thing: figuring out which row in its address database your input refers to. It is not optimized for handling: garbage characters, foreign-language address formats it doesn't have priors for, OCR errors from scanned documents, partial addresses where the user typed only the street and forgot the city.

Cleaning inputs before they hit the geocoder gives you three concrete wins:

  1. Higher match rates. Cleaning normalizes 30+ ways to write the same address into one canonical form. The geocoder's index hits more often.
  2. Cheaper retries. If the first call returns no match, you can retry with progressively looser inputs (drop unit, drop ZIP+4, expand abbreviations). You can't retry intelligently if you don't know the structure of what you sent.
  3. Better cache hits. Two callers asking for "123 Main St" and "123 main street" should hit the same cache entry. They only do if you normalized first. The cache hit rate math is in How to Cache Geocoding Results.

The numbers are consistent across a few hundred batches we've watched: raw user input geocodes at 65–75%; the same inputs after a 30-line normalization step geocode at 90–95%. The gap is preprocessing.

What "parse" means here

Address parsing splits a free-form string into typed components:

| Component | Example | |---|---| | house_number | "1600" | | road | "Pennsylvania Avenue NW" | | unit | "Apt 4B" | | city | "Washington" | | state / region | "DC" | | postcode | "20500" | | country | "US" |

Doing this with regex is a trap. The address "100 Avenue Road, Toronto" will defeat a "house number first" regex; "50A Calle 8va, Mayagüez" will defeat any parser that assumes Latin word order; "Flat 2, 14 Beaumont Street, London W1G 6DT" will confuse anything that doesn't know about UK flat-prefixed addresses.

The right answer is a trained model. The dominant open-source one is libpostal — a C library trained on OpenStreetMap data, with bindings for Python, Node, Go, and Ruby. It handles 30+ countries with reasonable accuracy out of the box. CSV2GEO uses libpostal internally for both the geocoding API and the file-upload tool.

Installing libpostal

Build the C library once:

# Ubuntu/Debian
sudo apt install -y curl autoconf automake libtool pkg-config
git clone https://github.com/openvenues/libpostal.git
cd libpostal
./bootstrap.sh
./configure --datadir=/usr/local/share/libpostal
make -j$(nproc)
sudo make install
sudo ldconfig

The --datadir step downloads ~2GB of trained models, so the configure stage is slow (15–30 minutes). After that, every binding (Python, Node, Go) loads from /usr/local/share/libpostal.

Python:

pip install postal

Node:

npm install node-postal

Parsing in Python

# parse_address.py
from postal.parser import parse_address

raw = "  Apt 4B / 123 main street Brooklyn, NEW YORK 11201-1234, USA  "
parsed = parse_address(raw)
# [('apt 4b', 'unit'), ('123', 'house_number'), ('main street', 'road'),
#  ('brooklyn', 'city'), ('new york', 'state'), ('11201-1234', 'postcode'),
#  ('usa', 'country')]

components = {label: value for value, label in parsed}
print(components)
# {'unit': 'apt 4b', 'house_number': '123', 'road': 'main street',
#  'city': 'brooklyn', 'state': 'new york', 'postcode': '11201-1234',
#  'country': 'usa'}

Two things to notice:

  1. All values are lowercased. That's libpostal's convention; downstream you decide whether to title-case for display.
  2. Whitespace is trimmed. Leading/trailing spaces in the input have no effect on parsing.

A practical wrapper that's safer for production:

# parser.py
from postal.parser import parse_address

def parse(raw: str) -> dict:
    if not raw or not raw.strip():
        return {}
    components = {label: value for value, label in parse_address(raw)}
    # Drop empty values libpostal sometimes returns for partial inputs
    return {k: v for k, v in components.items() if v}

Parsing in Node

// parser.mjs
import { AddressParser } from 'node-postal';

export function parse(raw) {
  if (!raw || !raw.trim()) return {};
  const parts = AddressParser.parse_address(raw);
  // node-postal returns [{component, value}, ...]
  return parts.reduce((acc, p) => {
    if (p.value) acc[p.component] = p.value;
    return acc;
  }, {});
}

Normalize before sending

Parsing tells you the structure. Normalizing canonicalizes the values so equivalent addresses hash to the same cache key and hit the same geocoder index entry.

The five normalizations that matter:

1. Expand directional and street-type abbreviations

EXPANSIONS = {
    'st': 'street', 'st.': 'street',
    'ave': 'avenue', 'ave.': 'avenue',
    'blvd': 'boulevard', 'blvd.': 'boulevard',
    'rd': 'road', 'rd.': 'road',
    'ln': 'lane', 'ln.': 'lane',
    'ct': 'court', 'ct.': 'court',
    'dr': 'drive', 'dr.': 'drive',
    'hwy': 'highway', 'hwy.': 'highway',
    'pkwy': 'parkway',
    'n': 'north', 'n.': 'north',
    's': 'south', 's.': 'south',
    'e': 'east', 'e.': 'east',
    'w': 'west', 'w.': 'west',
    'ne': 'northeast', 'nw': 'northwest',
    'se': 'southeast', 'sw': 'southwest',
}

def expand_road(road: str) -> str:
    tokens = road.split()
    out = [EXPANSIONS.get(t, EXPANSIONS.get(t.lower(), t)) for t in tokens]
    return ' '.join(out)

This step alone moves match rates 5–10 points on US data. Geocoding indexes typically store the expanded form ("street") but accept both, so always expanding is the safe default.

2. Strip unit/apartment from the road component

The geocoder doesn't care about the unit; it cares about the building. Including "Apt 4B" in the road field hurts match rates.

def strip_unit_from_road(parsed: dict) -> dict:
    """libpostal usually does this, but defense-in-depth."""
    if 'unit' in parsed:
        return parsed   # already split — done
    road = parsed.get('road', '')
    # Crude but effective: split on common unit prefixes
    for prefix in [' apt ', ' suite ', ' unit ', ' #', ' fl ']:
        if prefix in road.lower():
            road, unit = road.lower().split(prefix, 1)
            parsed['road'] = road.strip()
            parsed['unit'] = (prefix.strip() + ' ' + unit).strip()
            break
    return parsed

3. Canonicalize the postcode

US ZIP+4 is fine for delivery but can hurt geocoding match rates if the +4 is wrong. Strip to base ZIP for the first attempt, retry with the +4 if you have one and the base failed:

def normalize_postcode(pc: str, country: str) -> str:
    if not pc:
        return ''
    pc = pc.upper().replace(' ', '').replace('-', '')
    if country == 'US' and len(pc) > 5:
        return pc[:5]   # drop ZIP+4
    if country == 'CA':
        # Canadian postcodes: 'A1A1A1' canonical
        return pc.replace(' ', '').upper()
    if country == 'GB':
        # UK: insert space before final 3 chars: 'SW1A1AA' → 'SW1A 1AA'
        if len(pc) >= 5 and ' ' not in pc:
            return pc[:-3] + ' ' + pc[-3:]
    return pc

Country-specific quirks matter. The full set of postcode formats (and how csv2geo handles them) is documented at /v1/divisions/by-postcode. 30+ countries supported.

4. Strip country if obvious

If your data is single-country (a US-only mailing list, a UK-only delivery roster), drop the country token before sending. The geocoder will use your country= parameter and ignore an inline "USA" anyway, but a stray "USA" in the road field can confuse fuzzy matching.

5. Preserve the original

Whatever you do, keep the raw input. Always. Two reasons:

  • If your normalization is wrong, you can re-process from the raw later.
  • For the support ticket "your geocoder failed on this address," you need to see the customer's exact input, not your transformed version.
def parse_and_normalize(raw: str, country: str = 'US') -> dict:
    parsed = parse(raw)
    parsed = strip_unit_from_road(parsed)
    if 'road' in parsed:
        parsed['road'] = expand_road(parsed['road'])
    if 'postcode' in parsed:
        parsed['postcode'] = normalize_postcode(parsed['postcode'], country)
    parsed['_raw'] = raw   # preserve original
    return parsed

Sending the cleaned address

The CSV2GEO API accepts both free-form (?q=...) and structured input. After preprocessing, prefer structured — it gives the geocoder more signal:

import requests

def geocode(parsed: dict, country: str = 'US') -> dict | None:
    parts = []
    if parsed.get('house_number'): parts.append(parsed['house_number'])
    if parsed.get('road'):         parts.append(parsed['road'])
    if parsed.get('city'):         parts.append(parsed['city'])
    if parsed.get('state'):        parts.append(parsed['state'])
    if parsed.get('postcode'):     parts.append(parsed['postcode'])
    q = ', '.join(parts)

    r = requests.get(
        'https://api.csv2geo.com/v1/geocode',
        params={'q': q, 'country': country},
        headers={'X-API-Key': API_KEY},
        timeout=10,
    )
    r.raise_for_status()
    results = r.json().get('results', [])
    return results[0] if results else None

The fallback ladder for low-confidence results

Sometimes the first attempt doesn't return a high-confidence match. Don't give up — retry with progressively less specific input. The classic ladder:

def geocode_with_fallback(raw: str, country: str = 'US') -> dict | None:
    parsed = parse_and_normalize(raw, country)

    # Attempt 1: full structured address
    result = geocode(parsed, country)
    if result and result['accuracy_score'] >= 0.8:
        return result

    # Attempt 2: drop the unit
    if 'unit' in parsed:
        no_unit = {k: v for k, v in parsed.items() if k != 'unit'}
        result = geocode(no_unit, country)
        if result and result['accuracy_score'] >= 0.8:
            return result

    # Attempt 3: house_number + road + postcode (drop city/state)
    minimal = {k: parsed[k] for k in ('house_number', 'road', 'postcode') if k in parsed}
    if 'road' in minimal and 'postcode' in minimal:
        result = geocode(minimal, country)
        if result and result['accuracy_score'] >= 0.7:
            return result

    # Attempt 4: postcode-only — last resort, gives postcode centroid
    if 'postcode' in parsed:
        result = geocode({'postcode': parsed['postcode']}, country)
        if result:
            result['_fallback_level'] = 'postcode_centroid'
            return result

    return None

The drop in confidence at each rung is the trade-off you're making. A postcode centroid is not the same accuracy as a rooftop coord — be honest about it in your downstream system. The full breakdown of confidence scores and what they mean is in Geocoding Confidence Scores Explained.

What NOT to do

Three preprocessing mistakes I've seen kill match rates:

Don't translate addresses to English

Addresses in non-English-speaking countries are usually best left in their native form. "Calle Mayor 1, Madrid" geocodes better than "Main Street 1, Madrid"; "Champs-Élysées" geocodes better than "Elysian Fields". Most geocoders index the local-language version.

Don't strip diacritics

Same reason. "Pentélē" should stay as is for Greek addresses; "Düsseldorf" should keep the umlaut for German. Stripping accents collapses distinct streets onto each other and your match rate drops.

Don't split on multiple newlines naively

A common shape:

Acme Corp
123 Main St
Suite 400
Springfield, IL 62701

Naive line-splitting puts the company name in the road field and corrupts everything. The right move is to detect "first line is not a number-led address" and skip it; libpostal handles this if you join the lines with spaces and let it parse the whole thing.

A complete pipeline

Putting it together, the preprocessing module for a real geocoding service:

# geocode_service.py
from postal.parser import parse_address
import hashlib
import requests

API_KEY = os.environ['CSV2GEO_API_KEY']

def cache_key(parsed: dict) -> str:
    """Stable cache key from normalized components, no PII."""
    parts = '|'.join([
        parsed.get('house_number', ''),
        parsed.get('road', ''),
        parsed.get('postcode', ''),
        parsed.get('country', ''),
    ])
    return hashlib.sha256(parts.encode()).hexdigest()[:32]

def preprocess(raw: str, country: str = 'US') -> dict:
    components = {label: value for value, label in parse_address(raw)}
    components = strip_unit_from_road(components)
    if 'road' in components:
        components['road'] = expand_road(components['road'])
    if 'postcode' in components:
        components['postcode'] = normalize_postcode(components['postcode'], country)
    components['_raw'] = raw
    components['_cache_key'] = cache_key(components)
    return components

def geocode(raw: str, cache: dict, country: str = 'US') -> dict | None:
    p = preprocess(raw, country)
    if p['_cache_key'] in cache:
        return cache[p['_cache_key']]
    result = geocode_with_fallback_using_parsed(p, country)
    if result:
        cache[p['_cache_key']] = result
    return result

A pipeline with this structure routinely sees:

  • 95%+ match rate on US business mailing lists (started at ~78%)
  • 92%+ match rate on European multi-country B2B data (started at ~65%)
  • 60–70% cache hit rate at scale (deduplicates the 30+ ways the same address appears in dirty data)
  • Roughly 60% lower per-row spend vs the naive "send raw to geocoder" pipeline, mostly from the cache hits

The full deduplication story (stable keys, fuzzy matching for near-duplicates) is in Deduplicating Geocoded Addresses. The queue patterns that hold this kind of pipeline together are in Designing a Batch Geocoding Queue.

Frequently Asked Questions

Should I use regex or libpostal for address parsing?

Use libpostal. Regex breaks on edge cases like "1234 Main St Apt 5" vs "1234 Main St #5" vs "Apt 5, 1234 Main St" — there are dozens of permutations and you will patch them forever. libpostal is a trained statistical parser that handles ~95% of real-world addresses across 60+ languages out of the box.

How much does preprocessing actually improve geocoding match rates?

Real numbers: 78% → 95% on US business lists, 65% → 92% on European multi-country B2B data. Preprocessing also drives 60–70% cache hit rates at scale, which translates to roughly 60% lower per-row spend versus sending raw input to the geocoder.

Why normalize before hashing for cache keys?

Because "123 Main St", "123 Main Street", and "123 MAIN ST." all refer to the same address. Without normalization, each variant gets a different hash and misses the cache. After normalizing (expand abbreviations, lowercase, strip units, canonicalize postcodes), all three hash to the same key, so 30+ surface variations collapse to one cache entry.

Should I store both the raw input and the normalized version?

Yes, always keep the raw. Normalization is lossy — when you improve the normalizer later, you need the original input to re-process old rows. Storing both costs nothing and pays off the first time you upgrade your parser.

Does libpostal work for non-English addresses?

Yes — it is trained on 60+ languages including Chinese, Arabic, Russian, German, Spanish, and Portuguese. Match rates vary by country (best on US/UK/DE, weaker where training data is sparse), but it handles diacritics, postal-code-after-city ordering, and language-specific street words like "ulitsa" or "calle".

Summary

Preprocessing is not glamorous. It is the single highest-leverage step you can take to improve a geocoding pipeline. Three principles:

  1. Parse with a trained model (libpostal), not regex. The edge cases will eat you alive otherwise.
  2. Normalize for cache and index hits. Expand abbreviations, canonicalize postcodes, strip units, lowercase.
  3. Keep the original input. Always. For debugging and for re-processing when your normalization improves.

A 30-line preprocessing module turns a 70% pipeline into a 95% pipeline. It costs you nothing per call. It is the easiest 25 percentage points you will find in a geocoding system.

Ready to geocode your addresses?

Use our batch geocoding tool to convert thousands of addresses to coordinates in minutes. Start with 100 free addresses.

Try Batch Geocoding Free →