How to Shield Your App from a Rogue Sidekiq Job
Sidekiq is great for background processing—but a single misbehaving job (infinite loops, massive data scans, N+1 storms…) can choke your entire pipeline. Restarting Sidekiq only brings the offender right back.
How do you quarantine a bad job so that it can’t block all your critical work?
A Real-World Incident
A user submitted a report with a huge date range—no sanitization—and our ReportJob
tried to load terabytes of data. CPU spiked, threads stalled, and everything ground to a halt. Killing and restarting Sidekiq pods simply requeued the same job, and the outage persisted. We needed a way to isolate that single problematic worker.
Introducing a “Quarantine” Queue
The idea is simple: route suspect jobs into a low-concurrency queue serviced by a dedicated Sidekiq process. They’re still retried, but can’t block your high-priority queues.
1. Moving jobs to the quarantine queue manually
We begin by intercepting job enqueuing and swapping their queue name if they match an ENV-driven “quarantine” list:
class QuarantineMiddleware
def call(worker, job, queue, redis_pool)
if ENV.fetch('QUARANTINED_JOBS', '').split(';').include?(job['class'])
job['queue'] = 'quarantine'
end
yield
end
end
Sidekiq.configure_client do |config|
config.client_middleware { |chain| chain.add QuarantineMiddleware }
end
Set
QUARANTINED_JOBS=ReportJob;MegaLoopJob
to start shunting those jobs into quarantine.
Inspired by this deep-dive on isolating memory-leak jobs..
2. Handling Infinite Loops on Restart
That client middleware only works on new enqueues; it can’t catch jobs Sidekiq auto-requeues on shutdown. To trap long-running jobs after a pod restart, we use a server middleware:
class LongRunningJobInterceptor
QUARANTINE_QUEUE = 'quarantine'
def call(worker, job, queue)
jid = job['jid']
if queue != QUARANTINE_QUEUE && should_quarantine?(jid)
# Requeue safely into quarantine
worker.class.set(queue: QUARANTINE_QUEUE).perform_async(*job['args'])
return
end
track_start(jid)
begin
yield
rescue Sidekiq::Shutdown
raise
ensure
untrack(jid)
end
end
private
def should_quarantine?(jid)
start_time = Sidekiq.redis { |r| r.get("job:#{jid}:started_at").to_time }
(Time.current - start_time) > 1.hour
end
def track_start(jid)
Sidekiq.redis { |r| r.set("job:#{jid}:started_at", Time.current) }
end
def untrack(jid)
Sidekiq.redis { |r| r.del("job:#{jid}:started_at") }
end
end
Sidekiq.configure_server do |config|
config.server_middleware { |chain| chain.add LongRunningJobInterceptor }
end
Now, whenever you restart Sidekiq, any job that had been running “too long” automatically moves into your quarantine queue instead of clogging default or critical queues.
3. Automating Detection & Restart
Maintaining a list of known bad actors is helpful, but it still leaves you manually updating ENV variables and restarting processes when things go awry. What if we could automate both detection and remediation of runaway jobs? Let’s watch for runaway jobs and trigger a graceful process restart.
3.a Why Not Simple Timeouts?
Some teams rely on hard timeouts (Sidekiq::JobTimeout
) or even OS-level kills. Unfortunately, these can:
- Abort mid-transaction, leaving partial writes or queued messages.
- Leak resources (DB connections, file handles), causing pool exhaustion.
3.b Soft Timeout Monitoring
Instead of killing the job mid-flight, let’s signal our orchestrator to restart the entire Sidekiq process, then quarantine repeat offenders on the next run.
- Track job start times in a shared registry.
- Periodically scan for jobs exceeding a configurable threshold (e.g. 2 minutes).
- Flag the Sidekiq process as unhealthy (e.g. by touching a file).
- Let Kubernetes or Systemd detect the unhealthy marker and recycle the pod or service.
class SidekiqMonitor
HEALTH_FILE = Rails.root.join('tmp/sidekiq_unhealthy').freeze
def self.start
Thread.new(name: 'sidekiq-monitor') do
loop do
RunningJobs.list.each do |job|
if Time.current - job[:started_at] > 2.minutes
FileUtils.touch(HEALTH_FILE)
break
end
end
sleep 1
end
end
end
end
class RunningJobs
@jobs = Concurrent::Map.new
def call(worker, job, queue)
@jobs[Thread.current.object_id] = { jid: job['jid'], started_at: Time.current }
yield
ensure
@jobs.delete(Thread.current.object_id)
end
def self.list
@jobs.values
end
end
# In config/initializers/sidekiq.rb
Sidekiq.configure_server do |config|
config.server_middleware { |chain| chain.add RunningJobs }
SidekiqMonitor.start
end
Kubernetes (or Systemd) watches
tmp/sidekiq_unhealthy
and gracefully restarts the process.On restart, our
LongRunningJobInterceptor
(from step 2) reroutes any job that previously ran too long into the quarantine queue.
Conclusion
By combining:
- Client-side queue rerouting (for future enqueues),
- Server-side interception (for jobs requeued on shutdown), and
- Automated health monitoring with liveness probes,
you can ensure that one rogue Sidekiq job never brings down your entire background pipeline.