Why use BullMQ with Redis instead of a simple async call or unawaited promise?

An unawaited promise lives and dies with the Node.js process—a deploy, crash, or scale-down silently drops the work. BullMQ persists each job in Redis the moment it is enqueued, so a worker restart simply picks the job back up. You also get retry policies, concurrency control, and per-job state (waiting, active, completed, failed) that a floating promise cannot provide.

When should you choose exponential backoff over fixed backoff?

Exponential backoff is the right default for anything that calls an external system. If a marketplace API is down for a minute, retrying every 2 seconds burns your rate limit, while exponential spacing—doubling the wait per attempt using the formula 2^(attempts_made - 1) × base_delay—rides out the outage. Fixed backoff is reserved for purely internal work like image resizing, where the failure cause is usually a transient resource spike rather than a downstream service outage.

What is a dead-letter queue and why does it matter in production?

A dead-letter queue (DLQ) is a separate queue that receives jobs which have exhausted all retry attempts. Without it, terminal failures sit in the failed set forever or get retried in an infinite loop that hides real errors. Moving exhausted jobs to a DLQ with full context lets you triage each entry—replay after a fix, discard as obsolete, or convert it into a bug ticket—and alerts a human immediately instead of waiting for a customer complaint.

Why must every BullMQ job processor be idempotent?

Because BullMQ guarantees at-least-once delivery: a job is retried until it succeeds, so your processor will run more than once for the same logical work. If the handler is not idempotent—for example, inserting instead of upserting, or sending an email without checking whether it was already sent—each retry can cause duplicate side effects. Making processors idempotent is the only way to make retries harmless.

What three metrics should you monitor per queue and why?

The post recommends exporting waiting count and its trend, failed count delta, and a worker heartbeat gauge to Prometheus. A waiting count that grows for 15 consecutive minutes signals that consumers are under-provisioned or stuck. More than N new failures in 10 minutes indicates a downstream API change or incident. A stale heartbeat while the waiting count rises means the worker process is dead or blocked on a poisoned job, which determines whether you restart or investigate further.

Reliable Background Jobs with BullMQ and NestJS

Every backend eventually grows work that should not happen inside an HTTP request: rendering a 40-page invoice PDF, sending a batch of emails, syncing stock to a marketplace API that throttles you. The first version is always an await inside the controller, and it always ends the same way: a user staring at a spinner for 20 seconds, then a gateway timeout, then a duplicate submission. Background jobs are not an optimization. Past a certain job duration they are the only correct architecture.

My stack for this in Node.js land is BullMQ on Redis, wired into NestJS. BullMQ is a mature queue library built on Redis with Lua-scripted atomic operations, and it covers the four things production actually demands: retries with backoff, failure parking, concurrency control, and observability. This post is the setup I deploy for ERP workloads, with the operational lessons that the quickstart guides skip.

What a Queue Buys You Over a Naive Async Call

Firing an unawaited promise inside your request handler feels like a background job, but it lives and dies with the process. Deploy, crash, or scale down and the work silently evaporates. A Redis-backed queue gives the job a life of its own:

Durability: the job is persisted in Redis the moment it is enqueued, so a worker restart or deploy does not lose it; an unfinished job is picked up again.
Retry policy as configuration: attempts and backoff are declared per queue or per job, not hand-rolled with setTimeout in seventeen places.
Backpressure and concurrency: eight PDF renders at a time instead of three hundred, regardless of how fast users click, so the box stays alive.
Observability: every job has a state you can inspect, count, and graph; waiting, active, completed, failed, delayed. A floating promise has none.

The NestJS Setup I Actually Deploy

The official NestJS integration uses the BullModule from the bullmq-flavored package. Connection config lives at the root; each feature module registers its own queue with default job options so the retry policy is owned by the queue, not scattered across producers:

// app.module.ts — connection once, queues per feature
BullModule.forRoot({
  connection: { host: process.env.REDIS_HOST, port: 6379 },
})

BullModule.registerQueue({
  name: 'invoices',
  defaultJobOptions: {
    attempts: 5,
    backoff: { type: 'exponential', delay: 2000 },
    removeOnComplete: { age: 86400, count: 5000 },
    removeOnFail: false, // keep failures for inspection
  },
})

// invoices.processor.ts
@Processor('invoices', { concurrency: 8 })
export class InvoicesProcessor extends WorkerHost {
  async process(job: Job<GenerateInvoiceDto>) {
    switch (job.name) {
      case 'generate-pdf':
        return this.pdf.render(job.data.invoiceId)
      case 'send-email':
        return this.mailer.sendInvoice(job.data.invoiceId)
    }
  }
}

Three defaults in that snippet do disproportionate work. The attempts and exponential backoff pair handles transient failure automatically. The removeOnComplete policy with an age limit keeps Redis memory flat instead of growing forever, which is the most common way teams accidentally fill a Redis instance. And removeOnFail set to false preserves terminal failures, because a failure you cannot inspect is a failure you cannot fix.

Retries and Backoff: Make Failure Boring

BullMQ retries a failing job when attempts is greater than one, and the backoff setting controls spacing. The built-in strategies are fixed, which waits the same delay every time, and exponential, which doubles the wait per attempt using the documented formula of two to the power of attempts-made minus one, multiplied by the base delay. With a 2-second base delay and five attempts, the schedules diverge like this:

Attempt	Fixed backoff (2s)	Exponential backoff (2s base)
1	immediate (first run)	immediate (first run)
2	after 2s wait	after 2s wait
3	after 2s wait	after 4s wait
4	after 2s wait	after 8s wait
5	after 2s wait (total ~8s of waiting)	after 16s wait (total ~30s of waiting)

Exponential is the right default for anything that calls an external system: if a marketplace API is down for a minute, hammering it every 2 seconds just burns your rate limit, while exponential spacing rides out the outage. I reserve fixed backoff for purely internal work like image resizing, where the failure cause is usually a transient resource spike. You can also register a custom backoff strategy for special cases, like reading a retry-after header from a 429 response and honoring it exactly.

Retries mean your processor will run more than once for the same logical work; that is the contract you signed. Every handler must be idempotent: generate the PDF to a deterministic path, upsert instead of insert, check whether the email send was already recorded before sending. The queue guarantees the job runs at least once until it succeeds; only your handler can make repeats harmless.

Dead-Letter Queues: Where Jobs Go to Be Debugged

Some failures are not transient. A malformed payload, a deleted record, a bug in the handler: these will fail attempt five exactly like attempt one. Without a plan, they sit in the failed set forever or, worse, get retried in an infinite loop that hides real errors in the noise. The pattern that works is a dead-letter queue: when a job exhausts its attempts, a worker event handler moves it to a separate queue with full context and pings a human:

// Listen for jobs that exhausted all attempts
@OnWorkerEvent('failed')
onFailed(job: Job, err: Error) {
  if (job.attemptsMade >= (job.opts.attempts ?? 1)) {
    // terminal failure → park it in a DLQ for humans
    this.dlqQueue.add('dead-letter', {
      source: 'invoices',
      jobName: job.name,
      data: job.data,
      error: err.message,
      failedAt: new Date().toISOString(),
    })
    this.alerts.notify('Invoice job dead-lettered: ' + job.id)
  }
}

// Replay later, after the bug is fixed:
// pull from DLQ → re-add to the original queue with attempts reset

The DLQ becomes a triage inbox with three outcomes per entry: replay after a fix, discard as obsolete, or convert into a bug ticket. On one ERP deployment this single pattern cut our mean time to noticing broken integrations from days, when a customer complained, to minutes, when the alert fired. The replay path matters as much as the parking path: keep job payloads small and self-contained, IDs rather than whole objects, so a week-old dead letter still replays correctly against current data.

Four Patterns That Pay Rent in Production

IDs in payloads, not objects

Enqueue the invoice ID, not the invoice. The worker fetches fresh state at execution time, so a job delayed by retries does not act on a stale snapshot. It also keeps Redis memory predictable.

Deduplication by job ID

BullMQ treats a custom jobId as unique while the job exists: adding the same ID again is ignored. Use deterministic IDs like sync-stock-sku123 to collapse duplicate triggers into one execution.

Delayed and repeatable jobs

Payment reminders three days before due date are a delayed job. Nightly stock reconciliation is a repeatable job with a cron pattern. Both replace fragile external cron plus HTTP setups with one consistent system.

Separate queues per blast radius

Emails, PDFs, and external syncs get their own queues with their own concurrency. One stuck integration then cannot starve invoice generation, and you can pause a single queue during an incident.

Monitoring: the Queue Is Lying Until You Graph It

A queue that looks fine in development has two production failure modes you must see coming: backlog growth, when producers outpace consumers, and silent worker death, when no one is consuming at all. I export three numbers per queue into Prometheus and alert on them in Grafana:

Waiting count and its trend: a backlog that grows for 15 straight minutes means consumers are under-provisioned or stuck; alert before users notice latency.
Failed count delta: more than N new failures in 10 minutes is an incident signal, usually a downstream API change, not N independent coincidences.
Worker heartbeat: a gauge each worker updates every 30 seconds. If it goes stale while waiting count rises, the process is dead or stuck on a poisoned job, and that distinction decides whether you restart or dig.

Run your BullMQ workers as separate processes from your API, even on the same VPS. A worker chewing a CPU-heavy PDF render should never compete with request latency, and you gain independent deploys and restarts. In Docker Compose or Swarm that is just a second service sharing the same image with a different command.

Production Checklist

Defaults set per queue: attempts of 3 to 5, exponential backoff, removeOnComplete with an age cap, removeOnFail kept for inspection.
Every processor idempotent and payloads carrying IDs, not snapshots.
Dead-letter handler on the failed event with alerting and a documented replay procedure.
Queues split by failure domain with explicit concurrency per worker.
Waiting, failed, and heartbeat metrics graphed and alerted; Redis configured with persistence and maxmemory-policy noeviction so the queue store never silently drops jobs.

Queues Are an Investment in Boring Operations

The difference between a background job system you trust and one you fear is not the happy path; any tutorial gets a job from producer to processor in 20 lines. It is what happens on the bad day: the marketplace API goes down for an hour, a deploy lands mid-job, a poisoned payload arrives at 2 AM. Attempts and backoff absorb the transient failures, dead-letter queues catch the permanent ones, idempotent handlers make retries harmless, and metrics tell you about all of it before your users do. Set those four things up on day one and BullMQ will repay you with years of jobs you simply never think about.

Sources and further reading

Frequently Asked Questions

Reliable Background Jobs with BullMQ and NestJS

Frequently Asked Questions

Reliable Background Jobs with BullMQ and NestJS

What a Queue Buys You Over a Naive Async Call

The NestJS Setup I Actually Deploy

Retries and Backoff: Make Failure Boring

Dead-Letter Queues: Where Jobs Go to Be Debugged

Four Patterns That Pay Rent in Production

Monitoring: the Queue Is Lying Until You Graph It

Production Checklist

Queues Are an Investment in Boring Operations

What a Queue Buys You Over a Naive Async Call

The NestJS Setup I Actually Deploy

Retries and Backoff: Make Failure Boring

Dead-Letter Queues: Where Jobs Go to Be Debugged

Four Patterns That Pay Rent in Production

Monitoring: the Queue Is Lying Until You Graph It

Production Checklist

Queues Are an Investment in Boring Operations