Mapping health data to boundaries without storing PII
Aggregate patient or case addresses up to admin boundaries for public health reporting without retaining raw PII. REST patterns, code, failure modes.
Epidemiologists need case counts at the postcode, census tract, or health district level. They do not need the patient's home address sitting in a reporting database three years after the case was closed. The problem is that most pipelines are built backwards: they store the full address first, attach the boundary code as a derived field, and rely on a scheduled deletion job that either never gets written or runs months late.
The safer pattern inverts that order. You geocode in-process, call the boundaries API to attach the administrative code, and drop the precise coordinates and address string before the row ever touches persistent storage. The reporting database sees boundary codes and case metadata. It never sees street addresses.
This post shows exactly how to build that pipeline — the API calls, the in-memory processing pattern, the failure modes, and the observability you need to trust the aggregation counts in production.
Why the forward-then-drop pattern matters
There is a regulatory dimension and a practical dimension, and they point in the same direction.
The regulatory dimension. Patient residential addresses are a classic quasi-identifier. In most jurisdictions a home address, combined with age, diagnosis, and date, is sufficient to re-identify an individual — which means storing it in a reporting table, even without a name or record number attached, creates a data-minimisation problem. Data protection legislation in most markets has explicit requirements around keeping identifiable data only as long as necessary for the stated purpose. For aggregate disease surveillance, the stated purpose does not require the raw address; it requires the boundary code derived from it. If you are storing the raw address as well, you are storing data beyond what the purpose requires.
The practical dimension. Aggregate reporting tables get queried by more people than raw case tables. Analysts pull them into notebooks. They get attached to funding reports. Someone pastes a query result into a spreadsheet and emails it. When that table contains boundary codes and counts rather than street addresses, the blast radius of any future data leak or inadvertent disclosure shrinks to zero. There is nothing to re-identify from a count of 14 cases in postcode SW1A.
The pipeline described in this post is not a compliance product — the developer bears responsibility for their own data handling obligations. What it does is make the correct pattern the easy pattern, using a geocoding API whose /api/v1/reverse and /api/v1/boundaries endpoints give you an administrative code in a single round-trip.
What the two endpoints give you
Two surfaces do the work here. Both are part of the same 56-endpoint platform and use the same API key.
`GET /api/v1/reverse` — takes a lat and lng and returns the administrative context for that point: country, region, state or county, local authority or district, and postcode. This is the call you make when you have already geocoded the patient address (or received GPS coordinates from a mobile case-investigation app) and need the administrative hierarchy. The important field for aggregation purposes is whichever level you are rolling up to — typically the district code, census-equivalent code, or postcode code depending on your reporting framework.
`GET /api/v1/boundaries` — exposes the boundary geometry and ancestor/descendant relationships for an administrative area. You would call this when you need to materialise the polygon for mapping, validate that a code falls within an expected parent boundary, or enumerate all child boundaries within a health district for a denominator calculation. For the no-PII pipeline described here, the reverse-geocode call is the primary tool; the boundaries endpoint becomes useful when you are building the map tiles or computing denominators from population data.
The platform covers 39 countries. For a public-health reporting pipeline that spans borders — refugee health programmes, cross-border disease surveillance, travel-associated illness — the same API key handles all of them without switching provider.
The pipeline design
The sequence is simple and the important detail is what happens in memory versus what gets written to disk.
raw input (address string OR lat/lng)
|
v
[geocode if needed]
/api/v1/geocode or /api/v1/reverse
|
v
[attach boundary code] <--- this is the only field you keep
district_code, postcode, region
|
v
[DROP address, lat, lng] <--- before any write to storage
|
v
[increment count in output]
{ "boundary_code": "...",
"case_count": N,
"week": "...",
"condition_category": "..." }Nothing between the first step and the third step is allowed to touch a database, a file, a message queue, or a log line. It all happens in the RAM of your processing function. That is the architectural guarantee that makes "we do not store PII" a true statement rather than a policy aspiration.
Python: the in-process aggregation pattern
Below is a working Python example that reads a stream of case records (each carrying a lat/lng from the case-investigation app or a freshly geocoded address), calls /api/v1/reverse for the boundary code, accumulates counts in memory, and writes only aggregate rows to the output.
import os
import csv
import requests
from collections import defaultdict
API = "https://csv2geo.com/api/v1"
KEY = os.environ["CSV2GEO_API_KEY"]
# Adjust to whatever administrative level your report needs.
BOUNDARY_FIELD = "postcode"
def resolve_boundary(lat, lng):
"""Return a single boundary code for a coordinate. Returns None on failure."""
r = requests.get(
f"{API}/reverse",
params={"lat": lat, "lng": lng, "api_key": KEY},
timeout=15,
)
if r.status_code == 200:
result = r.json().get("result", {})
return result.get(BOUNDARY_FIELD)
return None
def aggregate_to_boundaries(input_path, output_path, reporting_week):
counts = defaultdict(int)
unresolved = 0
with open(input_path, newline="") as fin:
reader = csv.DictReader(fin)
for row in reader:
lat = row.get("lat")
lng = row.get("lng")
if not lat or not lng:
unresolved += 1
continue
# All PII lives only in these two local variables.
# They never get written to disk.
code = resolve_boundary(float(lat), float(lng))
if code is None:
unresolved += 1
continue
# At this point we only carry the boundary code.
# lat, lng, and any address string are discarded.
counts[code] += 1
# Write aggregate output — no addresses, no coordinates.
with open(output_path, "w", newline="") as fout:
writer = csv.DictWriter(fout, fieldnames=["boundary_code", "case_count", "week"])
writer.writeheader()
for code, count in sorted(counts.items()):
writer.writerow({"boundary_code": code, "case_count": count, "week": reporting_week})
print(f"Resolved {sum(counts.values())} cases across {len(counts)} boundaries. "
f"Unresolved: {unresolved}")Two things to notice.
First, lat and lng only ever exist as function-local variables inside the for loop. They go out of scope before any write. This is not a security guarantee — it is a code-review property. When a future colleague reads this file, there is no question of "where could the coordinates have leaked?" The loop contains the answer.
Second, the unresolved count is logged at the end but not attached to any input row identifier. If you need to investigate unresolved records, do it in a separate secure environment with the original case data — do not embed row identifiers in the aggregate output as a debugging convenience, because that creates a linkage.
Node.js: the same pattern for a serverless function
Many public-health reporting pipelines land in serverless environments because they run on a schedule rather than continuously. Here is the equivalent pattern in Node, written as a function that would work inside an AWS Lambda, a Cloud Function, or a Cloudflare Worker:
const API = 'https://csv2geo.com/api/v1';
const KEY = process.env.CSV2GEO_API_KEY;
const BOUNDARY_FIELD = 'postcode';
async function resolveBoundary(lat, lng) {
const url = `${API}/reverse?lat=${lat}&lng=${lng}&api_key=${KEY}`;
const r = await fetch(url, { signal: AbortSignal.timeout(15_000) });
if (!r.ok) return null;
const data = await r.json();
return data?.result?.[BOUNDARY_FIELD] ?? null;
}
export async function aggregateCases(records, reportingWeek) {
// records: [{ lat, lng, ...caseMetadata }]
// We destructure only what we need and never write lat/lng anywhere.
const counts = new Map();
let unresolved = 0;
for (const { lat, lng } of records) {
if (lat == null || lng == null) { unresolved++; continue; }
const code = await resolveBoundary(lat, lng);
if (!code) { unresolved++; continue; }
counts.set(code, (counts.get(code) ?? 0) + 1);
}
// Aggregate output only — no lat/lng, no address strings.
return {
week: reportingWeek,
unresolved,
boundaries: Array.from(counts.entries()).map(([code, count]) => ({
boundary_code: code,
case_count: count,
})),
};
}The destructuring { lat, lng } in the for loop is deliberate. The full caseMetadata object — which might include a case reference number, age band, or condition category — never enters the geocoding call, and lat and lng never enter the downstream aggregation map. The two halves of the data never meet in a form that could be written to a log.
Concurrency and rate budgets
A sequential loop over tens of thousands of records calling /api/v1/reverse one at a time will take a long time. The right pattern is bounded concurrency — process 10-20 cases in parallel, respect the platform's rate limit for your tier, and back off exponentially when you see a 429.
The free tier allows 3,000 calls per day. A weekly disease-surveillance report covering a medium-sized city might involve 500–2,000 new case records per week, which fits comfortably within the free tier for piloting. Production deployments at county or national scale will need a paid tier; plans start at $54/month for 100,000 calls, and the pricing is published at csv2geo.com/pricing/api.
For the rate-limiting mechanics — token bucket vs leaky bucket, when to back off, how to detect a 429 before you hit it — see Rate Limiting: Token Bucket vs Leaky Bucket. For the retry policy that handles transient 5xx errors without losing case records, see Exponential Backoff — When to Retry, When to Stop.
Using the boundaries endpoint for denominator work
Incidence rates require a denominator — the population at risk in each boundary. That calculation needs the list of all boundary codes that fall within your health district, so you can compute cases per 10,000 for each one, including boundaries with zero cases in the current period.
The /api/v1/boundaries endpoint gives you the child boundaries of any parent. A call to get the postcodes within a health district looks like this:
curl -G "https://csv2geo.com/api/v1/boundaries" \
--data-urlencode "code=E38000010" \
--data-urlencode "relation=children" \
--data-urlencode "level=postcode" \
--data-urlencode "api_key=$CSV2GEO_API_KEY"The response gives you an array of child boundary codes and their geometries. Join that array to your case-count output on boundary_code and you have a complete enumeration — every boundary in the district, case count for zero-case boundaries explicitly shown as 0, ready for a choropleth renderer.
The same call with relation=ancestors resolves upward — given a postcode, return the district, county, and region codes. That is useful when a case record arrives with only a postcode and you need to roll it up to a higher administrative level for a national surveillance return.
A quick curl reference for the reverse geocode call
For integration testing, smoke tests in CI, and debugging individual records in a secure environment:
curl -G "https://csv2geo.com/api/v1/reverse" \
--data-urlencode "lat=51.5074" \
--data-urlencode "lng=-0.1278" \
--data-urlencode "api_key=$CSV2GEO_API_KEY"A typical response fragment for a UK coordinate:
{
"result": {
"country_code": "GB",
"region": "England",
"county": "Greater London",
"district": "City of Westminster",
"postcode": "SW1A"
}
}The postcode field here is the outward code — the first half of a UK postcode. Full postcode resolution depends on coordinate precision. If you need full postcode (inward + outward), confirm with the /api/v1/reverse documentation for the country you are processing, as granularity varies by nation.
The five-step production pipeline
Step 1: Validate and normalise input coordinates
Before any API call, validate that lat and lng are plausible numbers. Reject out-of-range values (lat outside −90 to +90, lng outside −180 to +180) and log a count to your metrics system — not the coordinates themselves, just the count of invalid rows. A sudden spike in invalid coordinates is a data-quality signal worth alerting on.
def is_valid_coord(lat, lng):
try:
return -90 <= float(lat) <= 90 and -180 <= float(lng) <= 180
except (TypeError, ValueError):
return FalseThis is cheaper than letting the API return a 400 for each invalid row, and it means your error counts are clean by the time they reach your observability stack. See Observability for Geocoding Pipelines for what metrics to emit and where.
Step 2: Call /api/v1/reverse with a hard timeout
Set a per-request timeout and treat a timeout as a transient failure to be retried, not as an unresolvable record. The typical cause is a momentary network hiccup, not a bad coordinate. A 15-second timeout is generous for a geocoding API; most responses arrive far faster. Keep the timeout in a constant you can tune in configuration, not a magic number buried in the call site.
r = requests.get(
f"{API}/reverse",
params={"lat": lat, "lng": lng, "api_key": KEY},
timeout=15, # seconds; tune this to your SLA budget
)Log the HTTP status code and the time taken for each call to your metrics sink — not the input coordinates. Aggregate latency percentiles at the pipeline level, not per-record. If your p99 starts drifting, you want to know before it affects a weekly reporting deadline.
Step 3: Extract the boundary code and immediately discard the coordinate
This step is the crux of the no-PII pattern. The moment you have the boundary code from the API response, the coordinate has served its purpose. Do not accumulate a list of (lat, lng, code) tuples for later processing — that list would be a PII-bearing intermediate artefact. Extract the code, increment the counter for that code, and move on.
code = r.json().get("result", {}).get(BOUNDARY_FIELD)
if code:
counts[code] += 1
# lat and lng go out of scope here.If you are processing in batches rather than record-by-record, the same principle applies: the batch should produce a boundary-code array, not a coordinate-plus-code array, before anything is written anywhere.
Step 4: Handle unresolvable records without logging the input
Some records will not resolve to a boundary code — coordinates for a point in international waters, a malformed input, a territory not yet in coverage for the 39-country dataset. These need to be counted, not silently dropped, but the count should not carry the input data.
Maintain a simple integer counter for unresolved. At the end of the pipeline run, write the count to your audit log alongside the total resolved count and the reporting week. If the unresolved rate exceeds a threshold (say, 5% of input), raise an alert and hold the report for manual review. Do not append unresolvable rows to an "exceptions" table that carries their coordinates — that table would become a PII store by the back door.
Step 5: Write aggregate output and run the reconciliation check
The final write is aggregate rows only: boundary_code, case_count, week, and any non-identifying case metadata you need for the report (condition category, age band, sex — as long as these are not in combination with the boundary code sufficient to identify an individual in small-population areas; that is a suppression decision for your statistical disclosure team, not this pipeline).
Before the pipeline exits, emit a reconciliation log line:
week=2026-W27 input_rows=1847 resolved=1821 unresolved=26 boundaries=94 top_boundary=E01003001 top_count=47This log line contains no PII. It is the audit trail that lets you answer "did the pipeline run correctly?" without re-running it with the original data. Keep it for as long as the aggregate report is retained. See Observability for Geocoding Pipelines for how to wire this into a structured log and alert on anomalies.
The small-number suppression problem
Aggregate data becomes re-identifiable when the count per cell is very small. A boundary with one case of a rare condition in a small population is, in practice, identifiable. Public health agencies typically apply a suppression rule: any cell with fewer than five cases (the threshold varies by jurisdiction and data sensitivity) is either suppressed entirely or merged with a neighbouring boundary before publication.
This pipeline does not implement suppression — it is a data-engineering concern, not an API-integration concern. What the pipeline does correctly is make suppression easier: because the output is a clean aggregate with one row per boundary, the suppression logic is a simple post-processing filter:
MIN_COUNT = 5 # adjust per your disclosure-control policy
published = {k: v for k, v in counts.items() if v >= MIN_COUNT}
suppressed_count = len(counts) - len(published)The suppressed boundaries get logged as a count, not enumerated with their codes, if the codes themselves are sensitive.
What this pipeline does not do
Honest scope.
It does not make your overall data handling HIPAA-compliant or GDPR-compliant. Those are organisational obligations that cover every system that touches case data, not just the geocoding step. The no-PII pattern described here is one component of a defensible data-minimisation posture. If you need a broader discussion of geocoding in privacy-sensitive pipelines, the post HIPAA-Safe Geocoding — `no_record` and BAA covers the broader picture.
It does not handle forward geocoding from free-text addresses. If your input is an address string rather than a coordinate, you need a geocoding step before the reverse step. That step — turning an address into a lat/lng — does briefly process the address in memory. The same forward-then-drop discipline applies: geocode in-process, extract the coordinate, extract the boundary code from the reverse call, and discard everything else before any write.
It does not validate the administrative boundary hierarchy. If a postcode returned by the reverse geocode falls outside the expected health district — perhaps due to a boundary revision or a coordinate at the edge of two districts — the pipeline will not detect the mismatch. For high-stakes reporting, a post-processing join against the boundaries endpoint with relation=ancestors lets you verify that the returned postcode's parent district matches the expected district for each reporting period.
Caching to control cost at scale
The reverse geocode call costs 1 credit per call. For a pipeline that processes 50,000 case records per week, that is 50,000 credits per week if called naively — well within paid tier budgets, but avoidable if many records share the same postcode or district and you can cache the coordinate-to-code mapping.
The catch: caching a coordinate-to-boundary-code mapping still implies storing a coordinate, which brings you back into PII territory if the coordinate is precise enough to identify an individual. The resolution to this is to cache at a coarser granularity — round the coordinate to three decimal places (roughly 100 m) before using it as a cache key, store only the boundary code as the cached value, and set a cache TTL of 30 days. Boundary codes do not change on a timescale that matters for weekly surveillance. See Caching Geocoding Results — 90% Cost Reduction for the full caching pattern.
At three decimal places, two addresses on the same street resolve to the same cache key. For postcode-level aggregation this is well within the acceptable error margin. For census-tract-level aggregation in dense urban areas, you may need to be more conservative — test the rounding error against your boundary granularity before enabling the cache.
Frequently Asked Questions
Does CSV2GEO store the coordinates I pass to `/api/v1/reverse`? The platform's no_record pattern is documented separately at HIPAA-Safe Geocoding — `no_record` and BAA. The responsibility for your overall data-handling posture — including any regulatory obligations — rests with your organisation. Review the API terms of service and data processing documentation before processing patient data.
What administrative levels does the reverse geocode return? The response includes country, region, state or county equivalent, local district or authority, and postcode. Available levels vary by country — some countries have three administrative tiers, others have five. The 39-country coverage means all tiers are available for those nations; check the API documentation for the specific country before building your reporting schema.
How do I handle records that straddle a boundary edge? Point-in-polygon lookups for addresses at the edge of two boundaries can return either code depending on floating-point precision. For surveillance purposes, pick one consistently — use the code returned by the API and treat it as the authoritative assignment. Do not attempt to split the case or assign it to both boundaries; that introduces double-counting. Log the count of edge-case records if you want visibility into this.
Can I use this pipeline for addresses in all 39 covered countries simultaneously? Yes. The same /api/v1/reverse endpoint handles all 39 countries with the same API key and the same response schema. The country_code field in the response lets you apply country-specific boundary-level logic (UK postcodes vs US ZIP codes vs French communes) in a single downstream switch statement.
How should I handle a week where no cases resolve to a particular boundary? Use the /api/v1/boundaries endpoint with relation=children to enumerate all boundaries in your reporting area at the start of the pipeline run. Build your counts map with all boundary codes initialised to zero. After processing, any boundary that received no cases shows as zero — making it clear that the zero is a real zero, not a missing boundary. This matters for trend analysis and for the denominator in incidence-rate calculations.
What if the reverse geocode returns a different postcode than the one on the case record? Trust the coordinate over the self-reported postcode. Patients report their postcode from memory, and postcodes are periodically reassigned. The coordinate-derived boundary code is derived from the actual geographic location — it is more accurate for spatial analysis than the self-reported string. If the delta between the two is large (more than two postcode zones), log it as a data-quality flag without logging the actual postcode.
Is there a minimum call volume before a paid tier makes sense? The free tier gives you 3,000 calls per day — enough for a pilot covering several thousand cases per week. Once your weekly volume exceeds roughly 20,000 cases, a paid tier at $54/month for 100,000 calls will be more economical. Current pricing brackets are at csv2geo.com/pricing/api.
Related Articles
- HIPAA-safe geocoding — `no_record` and BAA — the broader privacy-safe geocoding pattern this post builds on
- Reverse geocoding accuracy in metres — understanding what the boundary assignment actually means spatially
- Geocoding confidence scores explained — how to decide which records need manual review before aggregation
- Caching geocoding results — 90% cost reduction — how to cache at coarse resolution without reintroducing PII
- Observability for geocoding pipelines — what to instrument, what to alert on, and what your audit log should contain
---
*I.A. / CSV2GEO Creator*
Use our batch geocoding tool to convert thousands of addresses to coordinates in minutes. Start with 100 free addresses.
Try Batch Geocoding Free →