Reliable Background Jobs with BullMQ and NestJS

Photo by Tiger Lily

Photo by Tiger Lily
Every backend eventually grows work that should not happen inside an HTTP request: rendering a 40-page invoice PDF, sending a batch of emails, syncing stock to a marketplace API that throttles you. The first version is always an await inside the controller, and it always ends the same way: a user staring at a spinner for 20 seconds, then a gateway timeout, then a duplicate submission. Background jobs are not an optimization. Past a certain job duration they are the only correct architecture.
My stack for this in Node.js land is BullMQ on Redis, wired into NestJS. BullMQ is a mature queue library built on Redis with Lua-scripted atomic operations, and it covers the four things production actually demands: retries with backoff, failure parking, concurrency control, and observability. This post is the setup I deploy for ERP workloads, with the operational lessons that the quickstart guides skip.
Firing an unawaited promise inside your request handler feels like a background job, but it lives and dies with the process. Deploy, crash, or scale down and the work silently evaporates. A Redis-backed queue gives the job a life of its own:
The official NestJS integration uses the BullModule from the bullmq-flavored package. Connection config lives at the root; each feature module registers its own queue with default job options so the retry policy is owned by the queue, not scattered across producers:
// app.module.ts — connection once, queues per feature
BullModule.forRoot({
connection: { host: process.env.REDIS_HOST, port: 6379 },
})
BullModule.registerQueue({
name: 'invoices',
defaultJobOptions: {
attempts: 5,
backoff: { type: 'exponential', delay: 2000 },
removeOnComplete: { age: 86400, count: 5000 },
removeOnFail: false, // keep failures for inspection
},
})
// invoices.processor.ts
@Processor('invoices', { concurrency: 8 })
export class InvoicesProcessor extends WorkerHost {
async process(job: Job<GenerateInvoiceDto>) {
switch (job.name) {
case 'generate-pdf':
return this.pdf.render(job.data.invoiceId)
case 'send-email':
return this.mailer.sendInvoice(job.data.invoiceId)
}
}
}Three defaults in that snippet do disproportionate work. The attempts and exponential backoff pair handles transient failure automatically. The removeOnComplete policy with an age limit keeps Redis memory flat instead of growing forever, which is the most common way teams accidentally fill a Redis instance. And removeOnFail set to false preserves terminal failures, because a failure you cannot inspect is a failure you cannot fix.
BullMQ retries a failing job when attempts is greater than one, and the backoff setting controls spacing. The built-in strategies are fixed, which waits the same delay every time, and exponential, which doubles the wait per attempt using the documented formula of two to the power of attempts-made minus one, multiplied by the base delay. With a 2-second base delay and five attempts, the schedules diverge like this:
| Attempt | Fixed backoff (2s) | Exponential backoff (2s base) |
|---|---|---|
| 1 | immediate (first run) | immediate (first run) |
| 2 | after 2s wait | after 2s wait |
| 3 | after 2s wait | after 4s wait |
| 4 | after 2s wait | after 8s wait |
| 5 | after 2s wait (total ~8s of waiting) | after 16s wait (total ~30s of waiting) |
Exponential is the right default for anything that calls an external system: if a marketplace API is down for a minute, hammering it every 2 seconds just burns your rate limit, while exponential spacing rides out the outage. I reserve fixed backoff for purely internal work like image resizing, where the failure cause is usually a transient resource spike. You can also register a custom backoff strategy for special cases, like reading a retry-after header from a 429 response and honoring it exactly.
Retries mean your processor will run more than once for the same logical work; that is the contract you signed. Every handler must be idempotent: generate the PDF to a deterministic path, upsert instead of insert, check whether the email send was already recorded before sending. The queue guarantees the job runs at least once until it succeeds; only your handler can make repeats harmless.
Some failures are not transient. A malformed payload, a deleted record, a bug in the handler: these will fail attempt five exactly like attempt one. Without a plan, they sit in the failed set forever or, worse, get retried in an infinite loop that hides real errors in the noise. The pattern that works is a dead-letter queue: when a job exhausts its attempts, a worker event handler moves it to a separate queue with full context and pings a human:
// Listen for jobs that exhausted all attempts
@OnWorkerEvent('failed')
onFailed(job: Job, err: Error) {
if (job.attemptsMade >= (job.opts.attempts ?? 1)) {
// terminal failure → park it in a DLQ for humans
this.dlqQueue.add('dead-letter', {
source: 'invoices',
jobName: job.name,
data: job.data,
error: err.message,
failedAt: new Date().toISOString(),
})
this.alerts.notify('Invoice job dead-lettered: ' + job.id)
}
}
// Replay later, after the bug is fixed:
// pull from DLQ → re-add to the original queue with attempts resetThe DLQ becomes a triage inbox with three outcomes per entry: replay after a fix, discard as obsolete, or convert into a bug ticket. On one ERP deployment this single pattern cut our mean time to noticing broken integrations from days, when a customer complained, to minutes, when the alert fired. The replay path matters as much as the parking path: keep job payloads small and self-contained, IDs rather than whole objects, so a week-old dead letter still replays correctly against current data.
IDs in payloads, not objects
Enqueue the invoice ID, not the invoice. The worker fetches fresh state at execution time, so a job delayed by retries does not act on a stale snapshot. It also keeps Redis memory predictable.
Deduplication by job ID
BullMQ treats a custom jobId as unique while the job exists: adding the same ID again is ignored. Use deterministic IDs like sync-stock-sku123 to collapse duplicate triggers into one execution.
Delayed and repeatable jobs
Payment reminders three days before due date are a delayed job. Nightly stock reconciliation is a repeatable job with a cron pattern. Both replace fragile external cron plus HTTP setups with one consistent system.
Separate queues per blast radius
Emails, PDFs, and external syncs get their own queues with their own concurrency. One stuck integration then cannot starve invoice generation, and you can pause a single queue during an incident.
A queue that looks fine in development has two production failure modes you must see coming: backlog growth, when producers outpace consumers, and silent worker death, when no one is consuming at all. I export three numbers per queue into Prometheus and alert on them in Grafana:
Run your BullMQ workers as separate processes from your API, even on the same VPS. A worker chewing a CPU-heavy PDF render should never compete with request latency, and you gain independent deploys and restarts. In Docker Compose or Swarm that is just a second service sharing the same image with a different command.
The difference between a background job system you trust and one you fear is not the happy path; any tutorial gets a job from producer to processor in 20 lines. It is what happens on the bad day: the marketplace API goes down for an hour, a deploy lands mid-job, a poisoned payload arrives at 2 AM. Attempts and backoff absorb the transient failures, dead-letter queues catch the permanent ones, idempotent handlers make retries harmless, and metrics tell you about all of it before your users do. Set those four things up on day one and BullMQ will repay you with years of jobs you simply never think about.
Sources and further reading