Cleaning hotel inventory addresses from many suppliers

Normalise, geocode, and deduplicate hotel inventory addresses from dozens of suppliers in one pipeline. REST patterns, failure modes covered.

| June 24, 2026
Cleaning hotel inventory addresses from many suppliers

Hotel inventory aggregators run on a dirty secret: the same property appears in their database dozens of times. Supplier A calls it "Radcliffe Hotel & Conference Ctr, 14 Broad St". Supplier B sends "Hotel Radcliff, 14 Broad Street, Floor Lobby". Supplier C sends coordinates that are 80 metres off — close enough to be the same building, far enough that a naïve proximity check splits them into two records. The same room is available on all three feeds, all three records show up in a customer search, and your customer books twice.

At small scale, a data team hand-merges. At medium scale, someone builds a fuzzy name-match job. At enterprise scale — 500 suppliers, 300,000 properties, feeds refreshed daily — the only solution that does not require a dedicated team of data engineers is a geocode-then-dedup pipeline: normalise every supplier address through a geocoding API, collapse on normalised coordinates, and keep one canonical record per physical building.

This post builds that pipeline end to end. Patterns, code, failure modes, and the cost maths to defend it.

The problem with name-based deduplication

Name matching is the instinctive starting point and it does not survive contact with real supplier data.

Hotels change brand flags. The "Marriott Waterfront" that Supplier A indexed three years ago is now the "Delta Hotels by Marriott Waterfront" in Supplier B's feed. Name similarity drops below any reasonable threshold; your dedup treats them as different properties.

Transliteration varies. An East Asian property with a romanised name will appear as four or five plausible spellings across international suppliers, none of which a fuzzy string matcher will confidently collapse. Meanwhile an entirely different hotel with a similar English name in a nearby city will produce false positives.

Street address formatting varies by country and by data team culture. "14 Broad St" and "14 Broad Street" match trivially. "14 Broad St, Suite 100" and "14 Broad St" might or might not refer to the same front door. "Avenida das Nações Unidas 14401" and "14401 Av. Nações Unidas" almost certainly do but require country-specific parsing logic to confirm.

The solution is to abandon string matching as the primary key and replace it with physical location. Geocode every address to a lat/lng, snap each lat/lng to the nearest 50-metre grid cell, and collapse records that share a grid cell and a plausible name cluster. This strategy is robust to brand changes, transliteration variation, and format differences, because the physical building does not move when the brand flag changes.

Pipeline architecture

The full pipeline has four stages.

  1. Ingest — receive supplier feeds (CSV, JSON, XML, SFTP drop, webhook — all of them, in whatever format they arrive).
  2. Geocode — send each record's address through the geocoding API, store lat/lng and confidence score alongside the raw address.
  3. Snap and cluster — group geocoded records by proximity (50 m grid cell or Haversine distance threshold), then apply a lightweight name-similarity check within each geographic cluster to catch the rare case where two genuinely different hotels occupy the same city block.
  4. Canonicalise — elect one record per cluster as the canonical master (typically the one with the highest geocoding confidence score), merge supplier-specific fields (content, photos, room types, rates), and write the canonical record to the public-facing inventory.

Stages 3 and 4 are application logic. Stage 2 is where the API does the heavy lifting, and it is what this post focuses on.

Geocoding the pipeline — REST patterns

CSV2GEO's geocoding endpoint accepts a free-text address and returns a normalised lat/lng, a canonical address string, and a confidence score between 0 and 1. The canonical address string alone is valuable — it normalises "14 Broad St" and "14 Broad Street" to the same output, so a downstream string comparison on the canonical form eliminates a class of false duplicates before the coordinate-clustering step.

A single call — curl

curl -G "https://csv2geo.com/api/v1/geocode" \
  --data-urlencode "q=14 Broad Street London EC2N 1HQ" \
  --data-urlencode "api_key=$CSV2GEO_API_KEY"

Response shape (simplified):

{
  "meta": { "count": 1 },
  "results": [
    {
      "lat": 51.5138,
      "lng": -0.0877,
      "confidence": 0.95,
      "formatted": "14 Broad Street, London, EC2N 1HQ, England",
      "country_code": "GB"
    }
  ]
}

The confidence field is your first-pass dedup signal. A score above 0.85 means the geocoder matched the full address to a point in our 461M-address dataset. A score below 0.6 means the match was at city or postal-code level — the lat/lng is a centroid, not a door, and you should flag the record for manual review before publishing it in live inventory. See Geocoding Confidence Scores Explained for the full breakdown of what each confidence band means in practice.

Batch geocoding — Python

A real supplier feed arrives as thousands of records. Looping with a single call per record works but wastes latency budget. A better pattern is to run concurrent requests bounded by a semaphore — geocoding is I/O-bound and embarrassingly parallel up to the rate limit on your plan.

import asyncio
import os
import csv
import aiohttp

API = "https://csv2geo.com/api/v1/geocode"
KEY = os.environ["CSV2GEO_API_KEY"]
CONCURRENCY = 20  # tune to your plan's rate limit

async def geocode_one(session, sem, row):
    async with sem:
        params = {"q": row["address"], "api_key": KEY}
        async with session.get(API, params=params, timeout=aiohttp.ClientTimeout(total=30)) as r:
            if r.status == 429:
                # Back off and let the caller retry — see idempotent-geocoding post.
                return {**row, "lat": None, "lng": None, "confidence": None, "error": "rate_limited"}
            r.raise_for_status()
            data = await r.json()
            if not data["results"]:
                return {**row, "lat": None, "lng": None, "confidence": 0.0, "error": "no_result"}
            top = data["results"][0]
            return {
                **row,
                "lat": top["lat"],
                "lng": top["lng"],
                "confidence": top["confidence"],
                "formatted": top.get("formatted"),
                "error": None,
            }

async def geocode_feed(input_path, output_path):
    with open(input_path) as fin:
        rows = list(csv.DictReader(fin))

    sem = asyncio.Semaphore(CONCURRENCY)
    async with aiohttp.ClientSession() as session:
        tasks = [geocode_one(session, sem, row) for row in rows]
        results = await asyncio.gather(*tasks)

    out_fields = rows[0].keys() | {"lat", "lng", "confidence", "formatted", "error"}
    with open(output_path, "w", newline="") as fout:
        writer = csv.DictWriter(fout, fieldnames=list(out_fields))
        writer.writeheader()
        writer.writerows(results)

asyncio.run(geocode_feed("supplier_feed.csv", "geocoded_feed.csv"))

A 10,000-record supplier feed at CONCURRENCY=20 typically finishes in under 60 seconds on a standard server. The semaphore keeps you politely inside the rate limit; if a 429 slips through, the caller retries with exponential backoff — the pattern in Exponential Backoff — When to Retry, When to Stop applies verbatim here.

Batch geocoding — Node

import { createReadStream, createWriteStream } from 'node:fs';
import { pipeline } from 'node:stream/promises';

const API = 'https://csv2geo.com/api/v1/geocode';
const KEY = process.env.CSV2GEO_API_KEY;
const CONCURRENCY = 20;

async function geocodeOne(address) {
  const url = `${API}?q=${encodeURIComponent(address)}&api_key=${KEY}`;
  const r = await fetch(url, { signal: AbortSignal.timeout(30_000) });
  if (r.status === 429) return { lat: null, lng: null, confidence: null, error: 'rate_limited' };
  if (!r.ok) throw new Error(`http ${r.status}`);
  const data = await r.json();
  if (!data.results.length) return { lat: null, lng: null, confidence: 0, error: 'no_result' };
  const top = data.results[0];
  return { lat: top.lat, lng: top.lng, confidence: top.confidence, formatted: top.formatted, error: null };
}

async function geocodeBatch(addresses) {
  const results = [];
  for (let i = 0; i < addresses.length; i += CONCURRENCY) {
    const chunk = addresses.slice(i, i + CONCURRENCY);
    const settled = await Promise.allSettled(chunk.map(a => geocodeOne(a)));
    for (const s of settled) {
      results.push(s.status === 'fulfilled' ? s.value : { lat: null, lng: null, error: s.reason?.message });
    }
  }
  return results;
}

The Promise.allSettled pattern is deliberate — a single bad address in the chunk should not abort the whole batch. Individual failures are logged, collected, and replayed in the next run rather than crashing the pipeline.

Step-by-step: the full dedup workflow

Step 1: Normalise and geocode every incoming record

On every supplier feed ingestion, geocode each address and write (lat, lng, confidence, formatted_address) back to the staging table alongside the raw supplier fields. Do not attempt deduplication at this stage — you want the full geocoded dataset before you cluster.

For records where confidence < 0.6, write them to a needs_review queue rather than the main staging table. A hotel address that geocodes to city-centroid level is almost certainly a data-entry error or a country-format problem. Publishing it to live inventory with a centroid coordinate will place the map pin in the middle of a river.

Treat geocoding as idempotent — if the pipeline crashes mid-feed, re-running it should produce the same output for records already processed. The simplest implementation is a geocoded_at timestamp column: skip records where it is non-null. This also means re-geocoding is free — you only pay for records you have not seen before.

Step 2: Snap each geocoded point to a 50-metre grid cell

Convert lat/lng to a grid cell key. The simplest implementation that works for most urban hotel densities:

def grid_cell(lat, lng, cell_m=50):
    """
    Returns a string key for the ~50m grid cell containing (lat, lng).
    Precision of 3 decimal degrees ≈ 111 m at the equator; halving gives ≈ 55 m,
    close enough for urban hotel clustering.
    """
    return f"{round(lat, 3):.3f},{round(lng, 3):.3f}"

This is deliberately approximate. The goal is to group records that are almost certainly the same building, not to achieve cadastral precision. Records that share a grid cell key are candidates for merging; records in adjacent cells that fail a secondary name-similarity check are left as separate properties (the rare genuine case of two hotels on the same city block).

For city blocks in dense urban centres — Tokyo, Manhattan, central London — you may need to tighten the cell to 25 m. For rural resorts where a single property spans 10 hectares, you may need to loosen to 200 m. Make the cell_m parameter a configuration value, not a constant.

Step 3: Apply a name-similarity check within each geographic cluster

Most clusters at this point contain records that are unambiguously the same property — different spellings, different supplier formatting, same building. A small fraction contain genuinely different properties (two hotels on the same block, a hotel and a serviced apartment complex sharing an address block).

A simple safeguard: within each geographic cluster, compute pairwise normalised Levenshtein distance on the stripped property name. If the minimum distance between any two records in the cluster exceeds a threshold (typically 0.6 for hotel names), split the cluster.

from difflib import SequenceMatcher

def name_similarity(a, b):
    a_norm = "".join(c.lower() for c in a if c.isalnum())
    b_norm = "".join(c.lower() for c in b if c.isalnum())
    return SequenceMatcher(None, a_norm, b_norm).ratio()

def should_merge(records, threshold=0.4):
    """Return True if all records in the cluster are plausibly the same property."""
    names = [r["property_name"] for r in records]
    for i, a in enumerate(names):
        for b in names[i+1:]:
            if name_similarity(a, b) < threshold:
                return False
    return True

The threshold is conservative by design — 0.4 means "Radcliffe Hotel" and "Hotel Radcliff" merge (similarity ~0.75) while "Radcliffe Hotel" and "Budget Inn" do not (similarity ~0.2). Tune it against a sample of your actual supplier data before shipping.

Step 4: Elect a canonical record and merge supplier fields

Within each confirmed cluster, elect the canonical record. The simplest heuristic: the record with the highest geocoding confidence score wins the canonical lat/lng and formatted_address. Supplier-specific fields (room types, rates, photos, policies) are kept from all suppliers and stored in a supplier array on the canonical record — you want all supplier content, you just want one physical-location identity.

def canonicalise_cluster(cluster):
    primary = max(cluster, key=lambda r: r["confidence"] or 0)
    return {
        "canonical_id": primary["id"],
        "lat": primary["lat"],
        "lng": primary["lng"],
        "formatted_address": primary["formatted"],
        "confidence": primary["confidence"],
        "supplier_records": [r["supplier_id"] for r in cluster],
        "supplier_count": len(cluster),
    }

The supplier_count field earns its place in the schema. A canonical record backed by eight supplier records is more reliable than one backed by one. Surface this in your inventory dashboard — it is a useful proxy for "how confident are we that this property exists and is in the right location."

Step 5: Publish and cache, then re-geocode only what changes

The canonical record goes to your public-facing inventory. Geocoded coordinates do not change unless the address changes — hotels do not relocate. Cache the geocoded result per (supplier_id, address) pair and skip re-geocoding on subsequent feed refreshes where the address field is unchanged.

This is where the 90% cost reduction materialises. A daily feed refresh for 300,000 properties where 1% of addresses change per day means you are geocoding 3,000 records, not 300,000. At the entry paid tier, that is well within the monthly call budget. See Caching Geocoding Results — 90% Cost Reduction for the full cache-key strategy.

Failure modes to plan for

Three failure modes that always appear in production and never appear in the prototype.

The moved hotel. A property physically relocates — a boutique hotel that closes and reopens two blocks away under the same brand name. Supplier A updates their address; Supplier B has a stale record. Your pipeline geocodes both, gets two different grid cells, and (correctly) treats them as two separate canonical records. The old canonical record accumulates zero supplier coverage and eventually falls out of inventory through your TTL logic. This is correct behaviour; document it so the operations team knows it is not a bug.

The multi-building property. A resort with five buildings spread across 400 metres. Supplier A geocodes to the main reception; Supplier B geocodes to the spa building; Supplier C sends coordinates for the conference centre. All three are on the same property but land in different grid cells. The fix is a property-boundary merge step: after the grid-cell cluster, run a second pass that groups clusters whose centroids are within a configurable radius (300 m by default) and whose names are highly similar. This is rarer than it sounds — it affects resorts and casino complexes, not urban hotels.

Geocoding confidence drift. An address that geocoded at confidence 0.92 three months ago re-geocodes at 0.55 today because the supplier changed their address format and the new format matches less cleanly. If your pipeline only geocodes new or changed addresses, this drift goes undetected. A lightweight nightly job that spot-checks 1% of the canonical inventory for confidence score changes catches drift before it pollutes the public-facing coordinates.

Cost maths

A realistic enterprise scenario: 300,000 unique hotel properties across 39 countries, fed by 50 suppliers, refreshed daily.

  • Initial load: 300,000 geocoding calls (one per unique address on first ingest). At the entry paid tier ($54/month for 100,000 calls), this is a one-time spend of roughly three months of the entry tier — or a single higher-tier month. See csv2geo.com/pricing/api for the current bracket that fits your volume.
  • Daily incremental: 1% address-change rate = 3,000 geocoding calls per day = 90,000 calls per month. That fits comfortably within a 100,000-call plan.
  • Re-geocoding on format change: accounted for in the 1% rate — in practice, supplier address-format changes cause more re-geocoding than property moves. Log the address_changed flag per record so you can measure this accurately.
  • Free tier development: 3,000 calls per day at no cost. Enough to geocode a 3,000-record supplier sample per day — a month of free development covers a solid pilot.

The total ongoing marginal cost per canonical property per month is well under a cent. The value of clean, non-duplicated inventory — correct availability, no double-bookings, customer trust — is not something to quantify in a blog post, but you already know it.

Observability

Two metrics to watch from day one.

Cluster merge rate. The ratio of geocoded records to canonical records. If this is 1.0, your dedup is not working — every record is its own canonical. If it is 5.0, your dedup is too aggressive — you are merging genuinely different properties. For a healthy aggregator with 50 suppliers, expect a merge rate of 2.5 to 4.0: each canonical property is covered by two to four supplier records on average.

Low-confidence rate. The fraction of geocoded records with confidence < 0.6. Above 5%, something changed in a supplier's address formatting — investigate. A sudden spike in low-confidence records from a single supplier is usually an upstream data-quality event (a format migration, a new market they entered with unfamiliar address conventions) that you want to catch before it degrades the canonical inventory.

For the full observability pattern, see Observability for Geocoding Pipelines.

Frequently Asked Questions

What confidence threshold should I use to flag records for manual review? Start at 0.6. Records below that threshold geocoded to a postal district or city centroid rather than a building. For hospitality inventory, a centroid puts the map pin in the wrong place — potentially in the middle of a park or a river. Above 0.85 is building-level confidence and safe to publish automatically.

How do I handle addresses in countries where the street-number-after-street-name convention differs? Pass the full address string as-is. The geocoding API covers 39 countries and handles local address conventions server-side — you do not need to pre-parse or reorder address components. If your supplier strips the country name, append the two-letter ISO country code as a suffix before geocoding; it materially improves match rates for short or ambiguous addresses.

What if two genuinely different hotels share the same address? This happens — separate towers with different brands in the same mixed-use development sometimes share a street address. The name-similarity check in Step 3 is your safety net. If name_similarity("Marriott Tower", "Element Hotel") < 0.4, the cluster splits even though the coordinates match. You will occasionally need to manually curate edge cases; log them and use them to tune your threshold.

Is geocoding safe to retry if the pipeline crashes mid-run? Yes, if you implement idempotency correctly. Store a geocoded_at timestamp per record and skip records where it is set. A crash mid-feed then a restart re-processes only the unfinished records, not the whole feed. The idempotent geocoding pattern covers the full implementation.

Does the geocoding API cover all 39 countries equally well? Coverage depth varies. Western Europe, North America, Australia, and Japan have the densest address coverage — approaching building-level precision for the vast majority of addresses. Some markets in South-East Asia, parts of Africa, and rural areas globally may match at street or postal-district level. The confidence score tells you per-record which situation you are in; do not assume uniform coverage depth across your entire inventory.

Can I use the API key across multiple supplier ingestion workers running in parallel? Yes. The API key is not tied to a single connection — run as many parallel workers as your concurrency budget allows. Use the semaphore pattern shown in the Python example above to stay within your plan's rate limit. If you need higher throughput, contact the team about a higher-rate plan before hitting the limit in production.

What happens if a supplier sends coordinates (lat/lng) rather than a text address? Use the reverse geocoding endpoint (/api/v1/reverse) to produce a canonical formatted address, then proceed normally. Reverse geocoding accuracy at building level is covered in detail in Reverse Geocoding Accuracy in Meters. Note that some suppliers send intentionally imprecise coordinates — a hotel that does not want to appear in competitor price scrapers — so always sanity-check that the confidence score on the returned address is reasonable before using the coordinate as canonical.

Related Articles

---

*I.A. / CSV2GEO Creator*

Ready to geocode your addresses?

Use our batch geocoding tool to convert thousands of addresses to coordinates in minutes. Start with 100 free addresses.

Try Batch Geocoding Free →