Address Parsing Before Geocoding: Cleaning Inputs for Better Matches
How to parse, normalize, and clean addresses before sending them to a geocoder — the preprocessing step that turns 70% match rates into 95%.
The single biggest match-rate improvement most geocoding pipelines never make is parsing the address before sending it. The team running the pipeline assumes the geocoder will figure out "123 Main St Apt 4B Brooklyn NY 11201"; it usually does. They also assume it will figure out " Apt 4B / 123 main street Brooklyn, NEW YORK 11201-1234, USA "; this is where match rates start to slip. Every quirk in the input — a stray apartment prefix, a typo, a postcode in a non-canonical format — costs you a percentage point of match rate.
This post is the practical preprocessing playbook: what to parse out, what to normalize, what to leave alone, and the libraries that do this without you writing 2,000 lines of regex. Working code in Python (libpostal) and Node (the same, via bindings). At the end you should be able to lift a 70% match rate to 95% on dirty input data — with no change to the geocoding API itself.
Why preprocessing matters more than the geocoder
Most geocoding APIs do their own parsing internally. So why do it again upfront?
Because the geocoder's parser is optimized for one thing: figuring out which row in its address database your input refers to. It is not optimized for handling: garbage characters, foreign-language address formats it doesn't have priors for, OCR errors from scanned documents, partial addresses where the user typed only the street and forgot the city.
Cleaning inputs before they hit the geocoder gives you three concrete wins:
- Higher match rates. Cleaning normalizes 30+ ways to write the same address into one canonical form. The geocoder's index hits more often.
- Cheaper retries. If the first call returns no match, you can retry with progressively looser inputs (drop unit, drop ZIP+4, expand abbreviations). You can't retry intelligently if you don't know the structure of what you sent.
- Better cache hits. Two callers asking for "123 Main St" and "123 main street" should hit the same cache entry. They only do if you normalized first. The cache hit rate math is in How to Cache Geocoding Results.
The numbers are consistent across a few hundred batches we've watched: raw user input geocodes at 65–75%; the same inputs after a 30-line normalization step geocode at 90–95%. The gap is preprocessing.
What "parse" means here
Address parsing splits a free-form string into typed components:
| Component | Example | |---|---| | house_number | "1600" | | road | "Pennsylvania Avenue NW" | | unit | "Apt 4B" | | city | "Washington" | | state / region | "DC" | | postcode | "20500" | | country | "US" |
Doing this with regex is a trap. The address "100 Avenue Road, Toronto" will defeat a "house number first" regex; "50A Calle 8va, Mayagüez" will defeat any parser that assumes Latin word order; "Flat 2, 14 Beaumont Street, London W1G 6DT" will confuse anything that doesn't know about UK flat-prefixed addresses.
The right answer is a trained model. The dominant open-source one is libpostal — a C library trained on OpenStreetMap data, with bindings for Python, Node, Go, and Ruby. It handles 30+ countries with reasonable accuracy out of the box. CSV2GEO uses libpostal internally for both the geocoding API and the file-upload tool.
Installing libpostal
Build the C library once:
# Ubuntu/Debian
sudo apt install -y curl autoconf automake libtool pkg-config
git clone https://github.com/openvenues/libpostal.git
cd libpostal
./bootstrap.sh
./configure --datadir=/usr/local/share/libpostal
make -j$(nproc)
sudo make install
sudo ldconfigThe --datadir step downloads ~2GB of trained models, so the configure stage is slow (15–30 minutes). After that, every binding (Python, Node, Go) loads from /usr/local/share/libpostal.
Python:
pip install postalNode:
npm install node-postalParsing in Python
# parse_address.py
from postal.parser import parse_address
raw = " Apt 4B / 123 main street Brooklyn, NEW YORK 11201-1234, USA "
parsed = parse_address(raw)
# [('apt 4b', 'unit'), ('123', 'house_number'), ('main street', 'road'),
# ('brooklyn', 'city'), ('new york', 'state'), ('11201-1234', 'postcode'),
# ('usa', 'country')]
components = {label: value for value, label in parsed}
print(components)
# {'unit': 'apt 4b', 'house_number': '123', 'road': 'main street',
# 'city': 'brooklyn', 'state': 'new york', 'postcode': '11201-1234',
# 'country': 'usa'}Two things to notice:
- All values are lowercased. That's libpostal's convention; downstream you decide whether to title-case for display.
- Whitespace is trimmed. Leading/trailing spaces in the input have no effect on parsing.
A practical wrapper that's safer for production:
# parser.py
from postal.parser import parse_address
def parse(raw: str) -> dict:
if not raw or not raw.strip():
return {}
components = {label: value for value, label in parse_address(raw)}
# Drop empty values libpostal sometimes returns for partial inputs
return {k: v for k, v in components.items() if v}Parsing in Node
// parser.mjs
import { AddressParser } from 'node-postal';
export function parse(raw) {
if (!raw || !raw.trim()) return {};
const parts = AddressParser.parse_address(raw);
// node-postal returns [{component, value}, ...]
return parts.reduce((acc, p) => {
if (p.value) acc[p.component] = p.value;
return acc;
}, {});
}Normalize before sending
Parsing tells you the structure. Normalizing canonicalizes the values so equivalent addresses hash to the same cache key and hit the same geocoder index entry.
The five normalizations that matter:
1. Expand directional and street-type abbreviations
EXPANSIONS = {
'st': 'street', 'st.': 'street',
'ave': 'avenue', 'ave.': 'avenue',
'blvd': 'boulevard', 'blvd.': 'boulevard',
'rd': 'road', 'rd.': 'road',
'ln': 'lane', 'ln.': 'lane',
'ct': 'court', 'ct.': 'court',
'dr': 'drive', 'dr.': 'drive',
'hwy': 'highway', 'hwy.': 'highway',
'pkwy': 'parkway',
'n': 'north', 'n.': 'north',
's': 'south', 's.': 'south',
'e': 'east', 'e.': 'east',
'w': 'west', 'w.': 'west',
'ne': 'northeast', 'nw': 'northwest',
'se': 'southeast', 'sw': 'southwest',
}
def expand_road(road: str) -> str:
tokens = road.split()
out = [EXPANSIONS.get(t, EXPANSIONS.get(t.lower(), t)) for t in tokens]
return ' '.join(out)This step alone moves match rates 5–10 points on US data. Geocoding indexes typically store the expanded form ("street") but accept both, so always expanding is the safe default.
2. Strip unit/apartment from the road component
The geocoder doesn't care about the unit; it cares about the building. Including "Apt 4B" in the road field hurts match rates.
def strip_unit_from_road(parsed: dict) -> dict:
"""libpostal usually does this, but defense-in-depth."""
if 'unit' in parsed:
return parsed # already split — done
road = parsed.get('road', '')
# Crude but effective: split on common unit prefixes
for prefix in [' apt ', ' suite ', ' unit ', ' #', ' fl ']:
if prefix in road.lower():
road, unit = road.lower().split(prefix, 1)
parsed['road'] = road.strip()
parsed['unit'] = (prefix.strip() + ' ' + unit).strip()
break
return parsed3. Canonicalize the postcode
US ZIP+4 is fine for delivery but can hurt geocoding match rates if the +4 is wrong. Strip to base ZIP for the first attempt, retry with the +4 if you have one and the base failed:
def normalize_postcode(pc: str, country: str) -> str:
if not pc:
return ''
pc = pc.upper().replace(' ', '').replace('-', '')
if country == 'US' and len(pc) > 5:
return pc[:5] # drop ZIP+4
if country == 'CA':
# Canadian postcodes: 'A1A1A1' canonical
return pc.replace(' ', '').upper()
if country == 'GB':
# UK: insert space before final 3 chars: 'SW1A1AA' → 'SW1A 1AA'
if len(pc) >= 5 and ' ' not in pc:
return pc[:-3] + ' ' + pc[-3:]
return pcCountry-specific quirks matter. The full set of postcode formats (and how csv2geo handles them) is documented at /v1/divisions/by-postcode. 30+ countries supported.
4. Strip country if obvious
If your data is single-country (a US-only mailing list, a UK-only delivery roster), drop the country token before sending. The geocoder will use your country= parameter and ignore an inline "USA" anyway, but a stray "USA" in the road field can confuse fuzzy matching.
5. Preserve the original
Whatever you do, keep the raw input. Always. Two reasons:
- If your normalization is wrong, you can re-process from the raw later.
- For the support ticket "your geocoder failed on this address," you need to see the customer's exact input, not your transformed version.
def parse_and_normalize(raw: str, country: str = 'US') -> dict:
parsed = parse(raw)
parsed = strip_unit_from_road(parsed)
if 'road' in parsed:
parsed['road'] = expand_road(parsed['road'])
if 'postcode' in parsed:
parsed['postcode'] = normalize_postcode(parsed['postcode'], country)
parsed['_raw'] = raw # preserve original
return parsedSending the cleaned address
The CSV2GEO API accepts both free-form (?q=...) and structured input. After preprocessing, prefer structured — it gives the geocoder more signal:
import requests
def geocode(parsed: dict, country: str = 'US') -> dict | None:
parts = []
if parsed.get('house_number'): parts.append(parsed['house_number'])
if parsed.get('road'): parts.append(parsed['road'])
if parsed.get('city'): parts.append(parsed['city'])
if parsed.get('state'): parts.append(parsed['state'])
if parsed.get('postcode'): parts.append(parsed['postcode'])
q = ', '.join(parts)
r = requests.get(
'https://api.csv2geo.com/v1/geocode',
params={'q': q, 'country': country},
headers={'X-API-Key': API_KEY},
timeout=10,
)
r.raise_for_status()
results = r.json().get('results', [])
return results[0] if results else NoneThe fallback ladder for low-confidence results
Sometimes the first attempt doesn't return a high-confidence match. Don't give up — retry with progressively less specific input. The classic ladder:
def geocode_with_fallback(raw: str, country: str = 'US') -> dict | None:
parsed = parse_and_normalize(raw, country)
# Attempt 1: full structured address
result = geocode(parsed, country)
if result and result['accuracy_score'] >= 0.8:
return result
# Attempt 2: drop the unit
if 'unit' in parsed:
no_unit = {k: v for k, v in parsed.items() if k != 'unit'}
result = geocode(no_unit, country)
if result and result['accuracy_score'] >= 0.8:
return result
# Attempt 3: house_number + road + postcode (drop city/state)
minimal = {k: parsed[k] for k in ('house_number', 'road', 'postcode') if k in parsed}
if 'road' in minimal and 'postcode' in minimal:
result = geocode(minimal, country)
if result and result['accuracy_score'] >= 0.7:
return result
# Attempt 4: postcode-only — last resort, gives postcode centroid
if 'postcode' in parsed:
result = geocode({'postcode': parsed['postcode']}, country)
if result:
result['_fallback_level'] = 'postcode_centroid'
return result
return NoneThe drop in confidence at each rung is the trade-off you're making. A postcode centroid is not the same accuracy as a rooftop coord — be honest about it in your downstream system. The full breakdown of confidence scores and what they mean is in Geocoding Confidence Scores Explained.
What NOT to do
Three preprocessing mistakes I've seen kill match rates:
Don't translate addresses to English
Addresses in non-English-speaking countries are usually best left in their native form. "Calle Mayor 1, Madrid" geocodes better than "Main Street 1, Madrid"; "Champs-Élysées" geocodes better than "Elysian Fields". Most geocoders index the local-language version.
Don't strip diacritics
Same reason. "Pentélē" should stay as is for Greek addresses; "Düsseldorf" should keep the umlaut for German. Stripping accents collapses distinct streets onto each other and your match rate drops.
Don't split on multiple newlines naively
A common shape:
Acme Corp
123 Main St
Suite 400
Springfield, IL 62701Naive line-splitting puts the company name in the road field and corrupts everything. The right move is to detect "first line is not a number-led address" and skip it; libpostal handles this if you join the lines with spaces and let it parse the whole thing.
A complete pipeline
Putting it together, the preprocessing module for a real geocoding service:
# geocode_service.py
from postal.parser import parse_address
import hashlib
import requests
API_KEY = os.environ['CSV2GEO_API_KEY']
def cache_key(parsed: dict) -> str:
"""Stable cache key from normalized components, no PII."""
parts = '|'.join([
parsed.get('house_number', ''),
parsed.get('road', ''),
parsed.get('postcode', ''),
parsed.get('country', ''),
])
return hashlib.sha256(parts.encode()).hexdigest()[:32]
def preprocess(raw: str, country: str = 'US') -> dict:
components = {label: value for value, label in parse_address(raw)}
components = strip_unit_from_road(components)
if 'road' in components:
components['road'] = expand_road(components['road'])
if 'postcode' in components:
components['postcode'] = normalize_postcode(components['postcode'], country)
components['_raw'] = raw
components['_cache_key'] = cache_key(components)
return components
def geocode(raw: str, cache: dict, country: str = 'US') -> dict | None:
p = preprocess(raw, country)
if p['_cache_key'] in cache:
return cache[p['_cache_key']]
result = geocode_with_fallback_using_parsed(p, country)
if result:
cache[p['_cache_key']] = result
return resultA pipeline with this structure routinely sees:
- 95%+ match rate on US business mailing lists (started at ~78%)
- 92%+ match rate on European multi-country B2B data (started at ~65%)
- 60–70% cache hit rate at scale (deduplicates the 30+ ways the same address appears in dirty data)
- Roughly 60% lower per-row spend vs the naive "send raw to geocoder" pipeline, mostly from the cache hits
The full deduplication story (stable keys, fuzzy matching for near-duplicates) is in Deduplicating Geocoded Addresses. The queue patterns that hold this kind of pipeline together are in Designing a Batch Geocoding Queue.
Frequently Asked Questions
Should I use regex or libpostal for address parsing?
Use libpostal. Regex breaks on edge cases like "1234 Main St Apt 5" vs "1234 Main St #5" vs "Apt 5, 1234 Main St" — there are dozens of permutations and you will patch them forever. libpostal is a trained statistical parser that handles ~95% of real-world addresses across 60+ languages out of the box.
How much does preprocessing actually improve geocoding match rates?
Real numbers: 78% → 95% on US business lists, 65% → 92% on European multi-country B2B data. Preprocessing also drives 60–70% cache hit rates at scale, which translates to roughly 60% lower per-row spend versus sending raw input to the geocoder.
Why normalize before hashing for cache keys?
Because "123 Main St", "123 Main Street", and "123 MAIN ST." all refer to the same address. Without normalization, each variant gets a different hash and misses the cache. After normalizing (expand abbreviations, lowercase, strip units, canonicalize postcodes), all three hash to the same key, so 30+ surface variations collapse to one cache entry.
Should I store both the raw input and the normalized version?
Yes, always keep the raw. Normalization is lossy — when you improve the normalizer later, you need the original input to re-process old rows. Storing both costs nothing and pays off the first time you upgrade your parser.
Does libpostal work for non-English addresses?
Yes — it is trained on 60+ languages including Chinese, Arabic, Russian, German, Spanish, and Portuguese. Match rates vary by country (best on US/UK/DE, weaker where training data is sparse), but it handles diacritics, postal-code-after-city ordering, and language-specific street words like "ulitsa" or "calle".
Summary
Preprocessing is not glamorous. It is the single highest-leverage step you can take to improve a geocoding pipeline. Three principles:
- Parse with a trained model (libpostal), not regex. The edge cases will eat you alive otherwise.
- Normalize for cache and index hits. Expand abbreviations, canonicalize postcodes, strip units, lowercase.
- Keep the original input. Always. For debugging and for re-processing when your normalization improves.
A 30-line preprocessing module turns a 70% pipeline into a 95% pipeline. It costs you nothing per call. It is the easiest 25 percentage points you will find in a geocoding system.
Use our batch geocoding tool to convert thousands of addresses to coordinates in minutes. Start with 100 free addresses.
Try Batch Geocoding Free →