Where your hotel guests come from: address catchment analysis
Batch geocode guest addresses, map catchment zones with boundaries, and find feeder markets. REST patterns in curl, Python, and Node. No SDK required.
Your property management system has recorded the home address of every guest who checked in for the last three years. That data is sitting in a reservation table, used for nothing except mailing confirmation letters. This post shows how to turn it into something useful: a geocoded, boundary-joined catchment map that tells your revenue management team exactly where demand comes from, how far guests are willing to travel by market segment, and which feeder markets are underserved by your current channel mix.
The technical work is a batch geocoding job followed by a boundary lookup per coordinate. Neither is complicated. The engineering effort is a day; the dataset you produce will change conversations in your next quarterly review.
Why hotels have this problem
Property management systems are built to manage rooms, not analyse geography. The guest address lives in a free-text field that varies in quality — some records have full postcodes, others have city names only, a few are corporate billing addresses three countries from where the guest actually lives. The CRM exports it as a CSV column labelled guest_address or billing_address_line_1 and leaves the interpretation to you.
Revenue management tools typically consume this data one of two ways: they either ignore it entirely, or they aggregate it to the country level, which tells you nothing useful. "42% of guests are domestic" is not actionable. "42% of guests are domestic, and of those, 68% come from within 250 km, concentrated in three postal districts that are currently underrepresented in our paid search spend" — that is actionable.
To get from raw addresses to that second sentence, you need three things: a geocoder that handles messy, international, free-text input; a boundary dataset that tells you which administrative zone each coordinate falls in; and a pipeline that runs at the scale of a real booking history (tens of thousands of records) without requiring a GIS engineer to babysit it.
CSV2GEO's batch geocoding endpoint and its Boundaries endpoint handle the first two. This post handles the third.
What the API gives you
Two endpoints do the work here.
`POST /api/v1/geocode/batch` accepts up to 100 addresses per request body and returns a geocoded result for each — latitude, longitude, confidence score, and normalised address components. It handles free-text input, partial addresses, international formats across 39 countries, and the full 461M+ address index. You POST a JSON array; you get a JSON array back in the same order. Confidence below a threshold (typically 0.7) signals a match you should flag for manual review rather than include in your analysis.
`GET /api/v1/boundaries` takes a latitude and longitude and returns the boundary polygons — and the names — of the administrative zones that contain that point. You choose which level: country, state/region, county, or postcode. For hotel catchment analysis you typically want county or postcode level for domestic guests and country or region level for international guests. The response includes the zone name, the zone code, and the polygon geometry if you want to render it.
Both endpoints are described at csv2geo.com/api. Both use the same API key as every other endpoint on the platform.
The pipeline in plain English
Before touching code, here is the shape of the full pipeline. Each step maps to a section below.
- Export the guest address CSV from your PMS or CRM.
- Batch geocode every address in groups of 100. Write lat/lng and confidence back to the same table.
- Drop rows below your confidence threshold — these go to a separate review file.
- For each geocoded coordinate, call the Boundaries endpoint to assign the guest to a zone.
- Aggregate: count guests per zone, compute zone-level metrics (average lead time, average rate, segment split).
- Export the final table to your BI tool or plot it directly as a choropleth.
Steps 2 and 4 are the ones that touch the API. The rest is SQL or Pandas. The full pipeline for a 20,000-reservation dataset runs in under thirty minutes on a laptop, most of which is sleep time between batches to stay inside rate limits.
Step 1: Export and clean the raw address data
Pull the address CSV from your PMS. The minimum columns you need are reservation_id, guest_address, and whatever segment fields you want to analyse against geography — rate category, booking channel, length of stay, lead time, nationality declared at check-in.
Clean the address column before it touches the geocoder. Two transformations are worth doing in Python before the API call:
import csv
import re
def clean_address(raw):
if not raw:
return None
# collapse whitespace
addr = re.sub(r'\s+', ' ', raw.strip())
# drop known placeholder strings
placeholders = {'n/a', 'na', 'unknown', 'tba', '-', '.'}
if addr.lower() in placeholders:
return None
return addr
with open("reservations_raw.csv") as fin, open("reservations_clean.csv", "w", newline="") as fout:
reader = csv.DictReader(fin)
writer = csv.DictWriter(fout, fieldnames=reader.fieldnames)
writer.writeheader()
for row in reader:
row["guest_address"] = clean_address(row.get("guest_address", ""))
if row["guest_address"]:
writer.writerow(row)Rows where clean_address returns None go to a separate ungeocoded.csv — these are the records where your PMS captured nothing useful. In a typical PMS export you will find 5–15% of records in this category, higher for OTA bookings where the channel does not pass guest address through.
Step 2: Batch geocode in groups of 100
The batch endpoint takes up to 100 addresses per POST. The pattern is straightforward — chunk the clean CSV, POST each chunk, write the results back.
import csv
import os
import time
import requests
API = "https://csv2geo.com/api/v1/geocode/batch"
KEY = os.environ["CSV2GEO_API_KEY"]
BATCH_SIZE = 100
CONFIDENCE_THRESHOLD = 0.7
def chunks(seq, n):
for i in range(0, len(seq), n):
yield seq[i:i + n]
rows_in = list(csv.DictReader(open("reservations_clean.csv")))
fieldnames_out = rows_in[0].keys() | {"lat", "lng", "confidence", "geocode_status"}
with open("reservations_geocoded.csv", "w", newline="") as fout:
writer = csv.DictWriter(fout, fieldnames=list(fieldnames_out))
writer.writeheader()
for batch in chunks(rows_in, BATCH_SIZE):
payload = [{"id": r["reservation_id"], "address": r["guest_address"]} for r in batch]
resp = requests.post(API, json={"addresses": payload, "api_key": KEY}, timeout=60)
resp.raise_for_status()
results = {item["id"]: item for item in resp.json()["results"]}
for row in batch:
res = results.get(row["reservation_id"], {})
lat = res.get("lat")
lng = res.get("lng")
conf = res.get("confidence", 0)
row["lat"] = lat
row["lng"] = lng
row["confidence"] = conf
row["geocode_status"] = "ok" if conf >= CONFIDENCE_THRESHOLD else "low_confidence"
writer.writerow(row)
time.sleep(0.5) # polite pacing; adjust to your plan's rate limitThe same logic in Node if that is your preferred pipeline language:
import { createReadStream, createWriteStream } from 'node:fs';
import { parse } from 'csv-parse/sync';
import { stringify } from 'csv-stringify/sync';
const API = 'https://csv2geo.com/api/v1/geocode/batch';
const KEY = process.env.CSV2GEO_API_KEY;
const BATCH_SIZE = 100;
const CONFIDENCE_THRESHOLD = 0.7;
function chunks(arr, n) {
const out = [];
for (let i = 0; i < arr.length; i += n) out.push(arr.slice(i, i + n));
return out;
}
const raw = parse(createReadStream('reservations_clean.csv'), { columns: true });
const out = createWriteStream('reservations_geocoded.csv');
const allRows = [];
for (const batch of chunks(raw, BATCH_SIZE)) {
const body = {
addresses: batch.map(r => ({ id: r.reservation_id, address: r.guest_address })),
api_key: KEY,
};
const resp = await fetch(API, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify(body),
});
if (!resp.ok) throw new Error(`HTTP ${resp.status}`);
const { results } = await resp.json();
const byId = Object.fromEntries(results.map(r => [r.id, r]));
for (const row of batch) {
const res = byId[row.reservation_id] ?? {};
const conf = res.confidence ?? 0;
allRows.push({
...row,
lat: res.lat ?? '',
lng: res.lng ?? '',
confidence: conf,
geocode_status: conf >= CONFIDENCE_THRESHOLD ? 'ok' : 'low_confidence',
});
}
await new Promise(r => setTimeout(r, 500));
}
out.write(stringify(allRows, { header: true }));The time.sleep(0.5) / setTimeout(r, 500) pacing gives you roughly 200 batches per minute — well within the free tier's 3,000 calls/day budget for a pilot, and far inside paid plan limits for production. If your reservation history runs to 100,000 rows, the full geocoding job is 1,000 batches; at a conservative pace it completes in under two hours unattended.
Step 3: Handle low-confidence rows honestly
After geocoding, split the output into two files:
import csv
with open("reservations_geocoded.csv") as fin, \
open("catchment_ready.csv", "w", newline="") as fgood, \
open("catchment_review.csv", "w", newline="") as fbad:
reader = csv.DictReader(fin)
good_writer = csv.DictWriter(fgood, fieldnames=reader.fieldnames)
bad_writer = csv.DictWriter(fbad, fieldnames=reader.fieldnames)
good_writer.writeheader()
bad_writer.writeheader()
for row in reader:
if row["geocode_status"] == "ok":
good_writer.writerow(row)
else:
bad_writer.writerow(row)The review file is for your reservations team, not the geocoder. A corporate billing address with a confidence of 0.55 usually means the address field has a company name in it rather than a street address. Fixing the upstream CRM data is worth the effort if the low-confidence rate is above 10% — anything below that is expected noise for a live hospitality dataset.
Do not discard low-confidence rows silently. Dropping them from your analysis without logging them overstates data quality and will produce a catchment map that systematically under-represents markets where your PMS integration is weakest — often corporate accounts, which are exactly the accounts your sales team most wants to understand.
For a full treatment of what confidence scores mean and how to calibrate the threshold for your use case, see Geocoding confidence scores explained.
Step 4: Assign each guest to a boundary zone
With clean coordinates in hand, the Boundaries endpoint assigns each guest to an administrative zone. A single curl call to illustrate the shape of the request and response:
curl -G "https://csv2geo.com/api/v1/boundaries" \
--data-urlencode "lat=51.5074" \
--data-urlencode "lng=-0.1278" \
--data-urlencode "levels=country,region,county" \
--data-urlencode "api_key=$CSV2GEO_API_KEY"Returns something like:
{
"results": {
"country": { "name": "United Kingdom", "code": "GB" },
"region": { "name": "England", "code": "GB-ENG" },
"county": { "name": "Greater London", "code": "UKI" }
}
}For a catchment analysis you usually want one level per call — pick the level appropriate to your property's market. A city-centre hotel analysing domestic guests wants county or postcode. An airport resort doing international analysis wants country or region.
In Python, batch the boundary lookups by reading the geocoded file row by row and making one call per coordinate. There is no multi-point batch endpoint for boundaries (unlike elevation), so you parallelise with a thread pool:
import csv
import os
import requests
from concurrent.futures import ThreadPoolExecutor, as_completed
API = "https://csv2geo.com/api/v1/boundaries"
KEY = os.environ["CSV2GEO_API_KEY"]
LEVEL = "county" # or "country", "region", "postcode"
def get_zone(row):
try:
r = requests.get(
API,
params={"lat": row["lat"], "lng": row["lng"],
"levels": LEVEL, "api_key": KEY},
timeout=15,
)
r.raise_for_status()
zone = r.json()["results"].get(LEVEL, {})
return row["reservation_id"], zone.get("name"), zone.get("code")
except Exception as exc:
return row["reservation_id"], None, None
rows = list(csv.DictReader(open("catchment_ready.csv")))
zone_map = {}
with ThreadPoolExecutor(max_workers=10) as pool:
futures = {pool.submit(get_zone, r): r for r in rows}
for fut in as_completed(futures):
rid, name, code = fut.result()
zone_map[rid] = {"zone_name": name, "zone_code": code}Ten concurrent workers is conservative. Adjust based on your plan's concurrency allowance and the guidance in Concurrency tuning for geocoding pipelines.
Step 5: Aggregate into a catchment table
With every reservation now carrying a zone_name and zone_code, the aggregation is plain SQL or Pandas. Here is the Pandas version:
import pandas as pd
df = pd.read_csv("catchment_ready.csv")
zones_df = pd.DataFrame.from_dict(zone_map, orient="index").reset_index()
zones_df.columns = ["reservation_id", "zone_name", "zone_code"]
merged = df.merge(zones_df, on="reservation_id", how="left")
merged["lead_time_days"] = pd.to_numeric(merged.get("lead_time_days", 0), errors="coerce")
merged["adr"] = pd.to_numeric(merged.get("adr", 0), errors="coerce") # average daily rate
catchment = (
merged
.groupby(["zone_name", "zone_code"])
.agg(
guest_count=("reservation_id", "count"),
avg_lead_days=("lead_time_days", "mean"),
avg_adr=("adr", "mean"),
)
.sort_values("guest_count", ascending=False)
.reset_index()
)
catchment.to_csv("catchment_summary.csv", index=False)
print(catchment.head(20))The output is a table with one row per zone: how many guests came from that zone, how far in advance they booked, and what average rate they paid. The top ten rows of that table are almost certainly the most useful single artifact your revenue management team has seen this year.
What the data typically shows
Having run this analysis on a range of hospitality datasets, the patterns that emerge consistently are worth naming so you know what to look for.
The 80/20 catchment. In most city hotels, 80% of domestic guests come from fewer than 12 zones. The remaining zones together contribute 20% of domestic volume spread across dozens of areas. The 12 core zones are the ones that justify dedicated sales resources, direct mail, and targeted paid search. The long tail gets brand-level marketing, not one-to-one relationship management.
Rate versus volume inversion. The highest-volume feeder zone is often not the highest-rate feeder zone. Guests from the nearest major city drive volume; guests from secondary markets two to three hours away often pay higher rates because they are travelling for a purpose (event, healthcare, family) rather than for leisure. Splitting the catchment table by segment — leisure versus corporate versus group — usually reveals that your sales team's territory map does not match the actual revenue map.
The missing zone. Nearly every hotel that runs this analysis for the first time finds one zone that sends meaningful volume but receives no attention from the sales or marketing function. It is usually a place no one on the team personally lives in, which is exactly why it has been overlooked. The geocoded catchment makes it visible.
Elevation as a surrogate for urban density. Cities at higher elevation tend to be smaller regional centres with strong drive-to-leisure markets. A property in Denver (at 1,597 m) draws differently from a property in Miami (at 1 m) even when both are full-service hotels of comparable size. If you are building a multi-property analysis, elevation per feeder-market centroid is a cheap additional variable that correlates with the urban-density and drive-time story.
Caching the boundary lookups
Boundary results are stable. A coordinate inside Greater London today is inside Greater London next year. Cache them aggressively — at the coordinate level, rounded to four decimal places (approximately 11 m precision, far tighter than any boundary).
A simple Redis-backed cache wrapper:
import hashlib, json, redis
_redis = redis.Redis(host="localhost", decode_responses=True)
TTL = 86400 * 90 # 90 days
def get_zone_cached(lat, lng, level="county"):
key = f"boundary:{level}:{round(lat,4)}:{round(lng,4)}"
cached = _redis.get(key)
if cached:
return json.loads(cached)
r = requests.get(
API,
params={"lat": lat, "lng": lng, "levels": level, "api_key": KEY},
timeout=15,
)
r.raise_for_status()
zone = r.json()["results"].get(level, {})
_redis.setex(key, TTL, json.dumps(zone))
return zoneOn a 20,000-reservation dataset, repeat guests from the same postal district are common. With caching, the second time you see a guest from the same zone you pay nothing — the boundary was already resolved. In a full-year reservation history with typical repeat-guest rates, caching typically reduces boundary API calls by 40–60%. See Caching geocoding results — 90% cost reduction for the broader caching architecture this pattern fits into.
Cost arithmetic
A concrete example. A mid-size hotel with 18,000 reservations per year, after cleaning removing 10% placeholder rows, leaves 16,200 geocodable records.
- Geocoding: 163 batch calls of 100 = 163 credits
- Boundary lookups: 16,200 calls (uncached) = 16,200 credits
- With 50% cache hit on repeat-guest zones: ~8,100 boundary credits
- Total first year: approximately 8,263 credits
The free tier covers 3,000 calls per day, so a pilot on 3,000 reservations costs nothing. The paid tier starts at $54/month for 100,000 calls — a full year's analysis for a single property comfortably fits within one month's allocation at that level. Add a second property's reservation history and you are still within the same bracket. See the live brackets at csv2geo.com/pricing/api.
The number to take to finance is not "API cost per year." It is "API cost per actionable insight" — if the catchment map surfaces one under-invested feeder zone that your sales team then converts into a corporate account worth £40k in annual room revenue, the ROI calculation does not require a spreadsheet.
Operationalising: from pilot to production
The pipeline above is designed as a one-shot enrichment. In production, you want it to run incrementally so that new reservations are geocoded and zone-assigned within 24 hours of check-out.
The simplest pattern: a nightly job pulls all reservations where geocode_status IS NULL, geocodes them in batches, then resolves their boundary zones. Write the results back to the reservation table. Your BI tool queries the table directly; no manual export-and-rerun cycle.
The observability layer matters here. Log the number of records geocoded per run, the proportion landing in low_confidence, the cache-hit rate on boundary lookups, and the proportion coming back out_of_coverage (which flags international addresses that your geocoder could not resolve with high confidence). A chart of those four metrics over time tells you whether your PMS data quality is improving or degrading, which is a useful proxy for whether your front-desk team is capturing address information reliably at check-in.
See Observability for geocoding pipelines for the full metric set and the alerting rules that catch data-quality regressions before they corrupt your catchment map.
Frequently Asked Questions
How many addresses can I geocode in one call? The batch geocoding endpoint accepts up to 100 addresses per POST request. For larger datasets, chunk your input and make sequential calls, pacing between batches to stay within your plan's rate limit. A 20,000-row reservation history requires 200 batch calls.
What confidence score threshold should I use for hospitality address data? 0.7 is a reasonable starting point for typical PMS data. Hospitality address quality varies: OTA bookings often pass truncated or corporate billing addresses, which lower confidence. Run a sample of 100 rows manually against the geocoded output to calibrate whether 0.7 is too aggressive or too lenient for your specific PMS export. See Geocoding confidence scores explained for a full treatment.
Does the Boundaries endpoint return polygon geometry I can use in a map? Yes. Append &geometry=true to the request and each boundary result includes the polygon in GeoJSON format. You can use this to render a choropleth in any mapping library that accepts GeoJSON — the colour gradient of guest counts per zone is the standard visualisation for catchment analysis.
How do I handle guests who gave a corporate billing address rather than their home address? Flag them using the geocode_status field and route them to a separate analysis. Corporate addresses are useful for a different kind of analysis — mapping corporate account concentration — but mixing them with residential addresses in a catchment map distorts the leisure-demand picture. Keep them in a separate segment cut.
Can I run this analysis across multiple properties to compare catchment overlap? Yes. The pipeline is property-agnostic. Run it for each property, tag each output row with property_id, then aggregate across properties to see which feeder zones send guests to multiple properties and which zones are exclusively served by one. This is the input your revenue management team needs for multi-property channel allocation.
Is the free tier sufficient to run a pilot on a real reservation dataset? The free tier allows 3,000 calls per day without a credit card. At 100 addresses per batch call, that is 300,000 address lookups per day — more than enough to geocode a full year's reservations for most hotels in a single day. Boundary lookups consume one call each, so for a 3,000-reservation pilot the entire run — geocoding plus boundary assignment — fits within a single day's free quota.
What happens for international guest addresses in countries outside the 39 covered? The geocoder returns a result for the 39 supported countries only. Addresses from unsupported countries return a no_results status. In practice, for most hotels operating primarily in supported markets, unsupported-country addresses represent a small tail of exotic bookings — route them to a simple country = [declared nationality] fallback using the guest's nationality field from the PMS, which is usually more reliable than the billing address for international guests anyway.
Related Articles
- Benchmarking geocoding APIs — honest numbers — what to measure when evaluating the accuracy of your geocoded reservation data
- Caching geocoding results — 90% cost reduction — why boundary lookups are the easiest thing to cache in any geocoding pipeline
- Reverse geocoding accuracy in meters — understanding how positional error affects zone assignment at boundary edges
- Geocoding confidence scores explained — how to calibrate the confidence threshold for your PMS address data quality
- Observability for geocoding pipelines — metrics and alerting to keep your nightly enrichment job honest
---
*I.A. / CSV2GEO Creator*
Use our batch geocoding tool to convert thousands of addresses to coordinates in minutes. Start with 100 free addresses.
Try Batch Geocoding Free →