Designing a Batch Geocoding Queue: SQS, BullMQ, or Custom

Production patterns for batch geocoding pipelines: queue choice, retry semantics, dead-letter handling, and when to pick SQS, BullMQ, or custom.

| May 10, 2026

Designing a Batch Geocoding Queue: SQS, BullMQ, or Custom

A geocoding pipeline that handles a million addresses on a Tuesday morning is not a script with a for loop. It is a queue, a pool of workers, a retry policy, a dead-letter sink, and a status table. Get the shape right and you handle ten million the same way you handle ten thousand. Get it wrong and you spend the next six months patching edge cases that all share the same root cause: the queue was an afterthought.

This post is the practical version. It compares the three queue choices that cover ~95% of real geocoding pipelines — Amazon SQS, BullMQ on Redis, and a custom in-process worker pool — with the cost math, the failure modes, and a working code sample for each. By the end you should be able to point at the right answer for your scale and write the first version yourself.

The job description

Forget queues for a moment. The job a batch geocoding pipeline does is unglamorous:

Read addresses from somewhere (CSV, S3, database, an HTTP request body).
For each, call a geocoding provider.
Write the result back somewhere.
Track which addresses are done, which failed, and why.
Handle the fact that the provider will rate-limit you, time out, return 503s, and occasionally return wrong answers — without losing track of state.

A queue is the shape that makes step 5 cheap. Without a queue, every retry is your application code's problem. With a queue, retries become the queue's problem and your application code stays small.

The decision tree for which queue is honestly two questions:

How many messages per day? Below ~100K, your laptop is the queue. Above ~100M, only managed services keep up.
Where does the rest of your stack live? If you are already on AWS, SQS is free with what you pay AWS. If you are on a Node monolith with Redis, BullMQ is free with what you pay Redis. If you have neither and don't want to add infrastructure, an in-process worker pool is fine.

Everything below is a refinement of those two answers.

Option 1 — In-process worker pool (Node, Python, Go)

This is the right answer for batches up to roughly 100,000 addresses per run, where the run is bounded (CSV upload, scheduled report, one-off backfill) and you don't need cross-process coordination.

The shape:

// pool.mjs — Node 20+
import pLimit from 'p-limit';
import fs from 'node:fs';
import { parse } from 'csv-parse/sync';

const CONCURRENCY = 20;       // tune to provider rate-limit
const RETRY_LIMIT = 3;
const limit = pLimit(CONCURRENCY);

const rows = parse(fs.readFileSync('addresses.csv'), { columns: true });

async function geocodeOne(addr, attempt = 1) {
  try {
    const r = await fetch(
      `https://api.csv2geo.com/v1/geocode?q=${encodeURIComponent(addr)}`,
      { headers: { 'X-API-Key': process.env.API_KEY } }
    );
    if (r.status === 429) {
      const retryAfter = parseInt(r.headers.get('retry-after') || '5');
      await new Promise(s => setTimeout(s, retryAfter * 1000));
      throw new Error('rate_limited');
    }
    if (!r.ok) throw new Error(`http_${r.status}`);
    return (await r.json()).results[0];
  } catch (e) {
    if (attempt >= RETRY_LIMIT) {
      return { error: e.message, address: addr };
    }
    await new Promise(s => setTimeout(s, 2 ** attempt * 1000));
    return geocodeOne(addr, attempt + 1);
  }
}

const results = await Promise.all(
  rows.map(row => limit(() => geocodeOne(row.address)))
);

fs.writeFileSync(
  'results.jsonl',
  results.map(r => JSON.stringify(r)).join('\n')
);

Why this works for the small-to-medium case:

Backpressure is automatic. p-limit(20) means at most 20 in-flight requests; the rest wait. No memory blow-up.
Retries are local. An exponential-backoff retry inside geocodeOne covers the 90% case (transient timeout, brief 5xx, 429).
Final state is on disk. A failed batch leaves you with a results.jsonl showing exactly which rows succeeded and which failed. You can re-run on the failures only.
Zero infrastructure. No Redis, no SQS, no Docker compose.

Where it stops working:

The process gets killed mid-run (server reboot, OOM, OS update). You lose all in-flight state and have to re-process everything that didn't make it to disk.
You want to scale across machines. You can't.
Multiple users / teams want to submit batches concurrently. You'd be hand-rolling a queue.

For most CSV-upload geocoding tools, this is the answer. The CSV2GEO web upload product itself uses a worker pool for batches under 50,000 rows.

Option 2 — BullMQ on Redis

The next step up is BullMQ, a Redis-backed queue that gives you persistence, distributed workers, scheduled retries, and a dashboard for free. It's the right answer when:

You need batches to survive a process restart.
Multiple workers (potentially on multiple boxes) should pull from the same queue.
You want a UI to see what's stuck.
You're already running Redis for caching or session storage.

Producer:

// producer.mjs
import { Queue } from 'bullmq';
import fs from 'node:fs';
import { parse } from 'csv-parse/sync';

const queue = new Queue('geocode', {
  connection: { host: 'localhost', port: 6379 }
});

const rows = parse(fs.readFileSync('addresses.csv'), { columns: true });

await queue.addBulk(
  rows.map((row, i) => ({
    name: 'geocode-one',
    data: { address: row.address, batchId: 'b-' + Date.now(), rowIndex: i },
    opts: {
      attempts: 5,
      backoff: { type: 'exponential', delay: 2000 },
      removeOnComplete: 1000,
      removeOnFail: false,
    },
  }))
);

console.log(`Enqueued ${rows.length} jobs`);

Worker:

// worker.mjs
import { Worker } from 'bullmq';

new Worker('geocode', async (job) => {
  const r = await fetch(
    `https://api.csv2geo.com/v1/geocode?q=${encodeURIComponent(job.data.address)}`,
    { headers: { 'X-API-Key': process.env.API_KEY } }
  );
  if (r.status === 429) throw new Error('rate_limited');  // BullMQ will retry
  if (!r.ok) throw new Error(`http_${r.status}`);

  const result = (await r.json()).results[0];

  // Write to your result store: Postgres, S3, anywhere.
  await saveResult(job.data.batchId, job.data.rowIndex, result);
}, {
  connection: { host: 'localhost', port: 6379 },
  concurrency: 20,
});

Run multiple worker processes (pm2, systemd, Kubernetes) and BullMQ load-balances jobs across them. Concurrency × workers gives you total parallelism. Hit the provider's per-minute rate limit by tuning either knob.

What you get for free:

Persistence. A worker crash doesn't lose the job; it goes back to the queue.
Exponential backoff. Configured per job via opts.backoff.
Dead-letter queue. Jobs that fail all retries land in a "failed" state, queryable separately.
Bull Board. A free dashboard at /admin/queues that shows live job counts.

What you pay for it:

One more service (Redis) in your infra. Most teams already have this.
~10–30ms of overhead per job (Redis round trips). Negligible vs the geocoding call itself.

Option 3 — Amazon SQS

For large pipelines (≥1M jobs/day) or anywhere you're already on AWS, SQS is hard to beat. It scales horizontally without any operator effort, has built-in dead-letter queues, and at the volumes geocoding pipelines hit, the cost is lunch money.

Producer:

# producer.py
import boto3
import csv
import json

sqs = boto3.client('sqs', region_name='us-east-1')
queue_url = 'https://sqs.us-east-1.amazonaws.com/123456789/geocode-jobs'

with open('addresses.csv') as f:
    rows = list(csv.DictReader(f))

# Batch sends in groups of 10 (SQS limit)
for i in range(0, len(rows), 10):
    chunk = rows[i:i + 10]
    sqs.send_message_batch(
        QueueUrl=queue_url,
        Entries=[
            {
                'Id': str(i + j),
                'MessageBody': json.dumps({
                    'address': r['address'],
                    'batch_id': 'b-2026-05-10',
                    'row_index': i + j,
                }),
            }
            for j, r in enumerate(chunk)
        ],
    )

Worker:

# worker.py
import boto3
import json
import requests
import time

sqs = boto3.client('sqs')
queue_url = 'https://sqs.us-east-1.amazonaws.com/123456789/geocode-jobs'
API_KEY = os.environ['API_KEY']

while True:
    resp = sqs.receive_message(
        QueueUrl=queue_url,
        MaxNumberOfMessages=10,
        WaitTimeSeconds=20,        # long-poll
        VisibilityTimeout=60,      # 60s to process or message reappears
    )
    for msg in resp.get('Messages', []):
        job = json.loads(msg['Body'])
        try:
            r = requests.get(
                'https://api.csv2geo.com/v1/geocode',
                params={'q': job['address']},
                headers={'X-API-Key': API_KEY},
                timeout=10,
            )
            r.raise_for_status()
            save_result(job['batch_id'], job['row_index'], r.json()['results'][0])
            sqs.delete_message(QueueUrl=queue_url, ReceiptHandle=msg['ReceiptHandle'])
        except Exception as e:
            # Don't delete — message becomes visible again after VisibilityTimeout
            # SQS retries automatically; after `maxReceiveCount` it goes to DLQ.
            print(f"failed: {e}")

What you pay attention to:

Visibility timeout. Set to 2–3× the p99 of geocodeOne. Too short and SQS thinks the job is stuck and re-delivers it; too long and a real crash blocks progress.
Dead-letter queue. Configure maxReceiveCount (e.g. 5) so jobs that fail repeatedly land in a separate DLQ. Inspect manually.
Long polling. WaitTimeSeconds=20 cuts SQS request volume by ~95% vs short polling.

Cost math at 1M jobs/day:

1M send_message_batch calls (in batches of 10) = 100K API calls = $0.04
1M receive_message calls (long-polled, 10 per batch) = 100K calls = $0.04
1M delete_message calls = 100K calls = $0.04
Total: ~$3.60/month for the queue itself. The geocoding API bill will be 1000× this.

Decision matrix

| Scale | Stack you already have | Go with | |---|---|---| | <50K jobs / batch, single machine | anything | In-process worker pool | | 50K–1M jobs / day, persistence required | Redis (or willing to add) | BullMQ | | 50K–1M jobs / day, AWS shop | AWS | SQS | | >1M jobs / day | anything | SQS | | Multi-tenant SaaS pulling from one queue | anything except in-process | BullMQ or SQS |

What I would not recommend, despite seeing it suggested in other articles:

RabbitMQ for new pipelines. It's a fine queue but BullMQ on Redis covers the same use cases with less operator overhead, and SQS scales further without any.
Kafka. Geocoding is a request/reply workload, not an event stream. Kafka is overkill and the consumer groups model fights against per-job retry semantics.
PostgreSQL `LISTEN/NOTIFY` as a queue. It works at small scale, but the moment you need persistence, retries, and DLQ, you're rebuilding BullMQ on top of Postgres. Just use BullMQ.

Idempotency: the part that bites everyone

Whichever queue you pick, the geocoding worker must be idempotent — calling it twice with the same input should produce the same result and not double-charge you. Two reasons:

SQS is at-least-once delivery. Same message can arrive twice if the worker crashes after the API call but before deleting the message.
BullMQ retries on failure, and a failure that's actually a successful API call followed by a network blip will re-execute.

The fix is to make the cache key the natural deduplicator: hash the input address, look up the cache before calling the provider, return cached result if hit. The cost story for that pattern is in How to Cache Geocoding Results. The detailed version of this idea — making every geocoding call safe to retry — is in Idempotent Geocoding.

Dead-letter sinks: don't skip this

Every queue ends up with jobs that won't succeed no matter how many times you retry. Common causes:

The address is genuinely unparseable ("bring me a sandwich, the office").
The provider returned a 4xx that won't change with another attempt.
A bug in your worker code.

You want these in a separate place where they don't pollute the metrics for the main queue. SQS gives you a DLQ for free; BullMQ has a "failed" state that serves the same purpose; an in-process pool needs you to write the failed records to a failures.jsonl so you can inspect them.

What to do with DLQ contents weekly:

Sample 20 random failures.
Categorize: bad input, transient, real bug.
Bad input → fix upstream data.
Transient → check why it didn't recover; tune retry config if needed.
Real bug → fix code, replay DLQ.

Skipping this step is how silent quality regressions happen. A geocoding pipeline with 99% success rate that quietly drifts to 96% over six months is your dead-letter queue telling you it noticed.

Status tracking is the queue's other half

A queue handles "do this work." It doesn't tell the user submitting a batch how many of their rows are done. For that you need a separate status table:

CREATE TABLE geocode_batch (
  batch_id    text PRIMARY KEY,
  user_id     bigint NOT NULL,
  total_rows  int NOT NULL,
  done_rows   int NOT NULL DEFAULT 0,
  failed_rows int NOT NULL DEFAULT 0,
  created_at  timestamptz NOT NULL DEFAULT now(),
  finished_at timestamptz
);

CREATE TABLE geocode_row (
  batch_id    text NOT NULL REFERENCES geocode_batch(batch_id),
  row_index   int NOT NULL,
  status      text NOT NULL CHECK (status IN ('pending','done','failed')),
  result      jsonb,
  error       text,
  PRIMARY KEY (batch_id, row_index)
);

Each worker updates geocode_row after every job. A small reconciliation cron updates geocode_batch.done_rows from a count. Front-end polls geocode_batch for progress. Done.

Without this, "where is my batch" support tickets become an unsolvable problem; with it, you get a progress bar you can publish.

Putting it together

A complete production geocoding pipeline at most companies looks like this:

[CSV upload]
      │
      ▼
[Producer]                     [Status table]
      │  enqueues N jobs              ▲
      ▼                                │
[BullMQ / SQS]                          │
      │                                 │
   ┌──┴──┐  ┌──┴──┐  ┌──┴──┐            │
   │  W  │  │  W  │  │  W  │  …         │ updates per job
   └──┬──┘  └──┬──┘  └──┬──┘            │
      │       │       │                 │
      ▼       ▼       ▼                 │
[csv2geo API]                           │
      │                                 │
      ▼                                 │
[Result store + status update]──────────┘
                                        ▲
[DLQ for retry-exhausted jobs]──────────┘ inspect weekly

Five components, each with a single responsibility. The shape works at 10,000 jobs and at 100,000,000 jobs. What changes is which queue runs the middle and how many workers pull from it.

If you're starting today and don't already have Redis or AWS, start with the in-process pool. When you outgrow it (and you will know — symptoms are "the script crashed and we restarted from scratch" or "the second user can't submit a batch"), graduate to BullMQ. When you outgrow that, graduate to SQS. Each step is a refactor, not a rewrite, because the worker code (geocodeOne above) is identical at every scale.

The infrastructure is the easy part. The hard part is the discipline to track state, name failures, and inspect the dead-letter queue every week. Do that and the queue almost runs itself.