-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Scaling Sidekiq
Sidekiq’s architecture makes it easy to scale up to thousands of jobs per second and millions of jobs per day. Scaling Sidekiq can simply be a matter of “adding more servers”, but how do you optimize each server, how “big” do the servers need to be, and how do you know when to add more? Those are the questions this guide will answer.
Let’s start with an overview of Sidekiq’s architecture and the various “levers” we have available to us. We’ll also define some terms we’ll use throughout this guide.
- Concurrency - The Sidekiq setting that controls the number of threads available to a single Sidekiq process.
- Swarm - A feature of Sidekiq Enterprise that supports running multiple Sidekiq processes on a single container.
- Container - A container instance running one or more Sidekiq processes. You might call this a server, service, dyno, pod, etc. We’ll just call them containers.
- Total concurrency - The total number of Sidekiq threads across all containers and processes.
Here’s a diagram that shows the relationship between these concepts:
Sidekiq is of course all about queues, so let’s clarify some terms here.
- Queues - You put your jobs into queues (which live in Redis), and Sidekiq processes the jobs in the queue, oldest first (FIFO). When starting a Sidekiq process, you tell it which queues to monitor and how to prioritize them.
- Queue assignment - You can assign queues (or groups of queues) to specific Sidekiq processes, or you can have a single queue assignment used by all Sidekiq processes.
- Queue priority - When assigning multiple queues to a process, Sidekiq has a couple fetch algorithms that dictate how it pulls jobs from those queues: strict and weighted. We’ll call those the queue priority.
And finally we have our connection pools. Yes, multiple connection pools.
-
Database connection pool - A pool of database connections shared by all Sidekiq threads within a process. This is managed by Rails and configured in
database.yml
. - Redis connection pool - A pool of Redis connections shared by all Sidekiq threads and Sidekiq internals within a process. This is managed by redis-client and is configured automatically by Sidekiq based on your concurrency.
In total this is a lot of concepts and configurations. The good news is most of them are handled for us or are straightforward to configure ourselves.
These are some general recommendations that will help things run smoothly in the beginning of an app and prepare you to scale later.
The fewer queues the better. Don’t make your life harder than it needs to be. Two or three queues are plenty for a new app. We’ll talk later about when it makes sense to add more queues, but scaling will generally be more challenging the more queues you have.
Name your queues based on priority or urgency. Some teams name their queues using domain specific terms that are no help at all when it comes to planning queue priority or latency requirements. “Urgent”, “default”, and “low” are much easier to work with. You might take a step further and embrace Gusto’s approach of latency-based queue names such as “within_30_seconds”, “within_5_minutes”, etc. This approach makes it very clear which queues have priority and when queue latency is unacceptable.
Keep your jobs as small as possible! Fan out large jobs into many small jobs. Smaller jobs are much easier to scale, but we’ll talk later about strategies to use when this isn’t possible.
Run a single Sidekiq process per container. You can add Sidekiq Swarm later, but don’t assume you’ll need it. This is one less variable to juggle when scaling. Keep it simple.
Choose a container size based on memory. If you’re working with a lot of large files, such as generating PDF’s or importing large CSV files, you’ll need more memory. If you’re not doing that, you can probably get away with 1GB or less.
Start with five threads per process (concurrency). This is just a starting point—you will need to tweak it. Many teams get too ambitious with their concurrency, saturating their CPU and slowing down all jobs. The good news is five is the Sidekiq default, so if you don’t do anything, you’ll have a good starting point.
These guidelines will get you started, but what about optimizing your configuration and scaling beyond the basics? That’s what we’ll tackle in the following sections.
Depending on your container CPU and the type of work your jobs are doing (mainly the percentage of time spent in I/O), you’ll probably need to tweak your concurrency setting. As a very simple rule, you want to CPU usage to be high but not 100% when all threads are in use.
If CPU is hitting 100%, you need to reduce your concurrency. If your CPU usage never goes above 50% as max throughput, you probably want to increases your concurrency.
Use RAILS_MAX_THREADS
to tweak concurrency. When you decide to tweak your concurrency, you could configure it with the -c
CLI flag, but Sidekiq will also respect the RAILS_MAX_THREADS
environment variable. This is what Rails uses by default to configure your database pool in database.yml
, so by embracing this convention, your database pool will always be correctly sized for your Sidekiq process.
Don’t waste your energy calculating how many containers you need to run. Sidekiq loads are highly variable by nature, and you don’t want to pay for a cluster of 10 containers when no jobs are enqueued. Autoscaling solves this problem by automatically scaling your containers up and down, but what metric should you use for autoscaling?
Sidekiq workloads are more often I/O-bound than CPU-bound—in other words, you can easily encounter a queue backlog even when CPU utilization is low. This makes CPU an inappropriate and frustrating metric to use for autoscaling, even though it’s the most commonly-used metric used by tools like AWS CloudWatch.
Instead, you should autoscale your Sidekiq containers using queue latency. Your business requirements will have an implicit (or hopefully explicit) expectation how long each job can reasonably wait before being processed. This expectation makes queue latency the perfect metric for autoscaling. (And if you’re using latency-based queue names, you’ve already identified those latency expectations!)
Several services exist for autoscaling Sidekiq based on queue latency:
(*) You’ll need to measure queue latency yourself and report it to CloudWatch or HPA.
Sometimes it makes sense to add a queue for a specific job or a particular “shape” of job. Some examples:
- If you’re unable to break down large jobs into smaller jobs, you might not want those long-running jobs to become a bottleneck in your queue.
- If you have some jobs that use lots of memory, you might need a larger container for those jobs.
- If you have jobs that can’t be processed in parallel, you might need those jobs on a dedicated queue that run single-threaded.
These aren’t ideal scenarios, but they’re real-world scenarios that many apps will encounter. It’s best to treat these queues as the anomalies they are and dedicate them to their own Sidekiq process. This way your long-running jobs will only block other long-running jobs, and your memory-hungry jobs won’t require all of your jobs to run on larger, higher-priced containers.
This isolation makes scaling easier because you’re scaling your “special” queues separately from your “normal” queues. Here’s what it might look like in a Procfile
, using RAILS_MAX_THREADS
to force the memory-hungry jobs to be processed single-threaded (reducing memory bloat):
web: bundle exec rails s
worker: bundle exec sidekiq -q within_30_sec -q within_5_min -q within_5_hours
worker_high_mem: RAILS_MAX_THREADS=1 bundle exec sidekiq -q high_mem
The best way to make scaling easy is by keeping it simple: a few queues with small jobs. But of course keeping it simple isn’t always easy, especially in a legacy codebase or a large team. Here are some of the problems or anti-patterns you’ll generally want to avoid:
-
Not enough connections in your database pool. If you’re seeing the dreaded
ActiveRecord::ConnectionTimeoutError
in your Sidekiq jobs, chances are you’ve misconfigured your database connection pool. Make sure yourdatabase.yml
is usingRAILS_MAX_THREADS
as the pool size, and useRAILS_MAX_THREADS
instead of-c
to configure your concurrency. - ERR max number of clients reached. Unlike the error above, this error is coming from Redis, and it usually means you’re using a Redis service with an extremely limited number of connections available. You can either upgrade your Redis service or reduce your concurrency setting.
- Slow job performance / saturated CPU. These go hand-in-hand when you’ve set your concurrency too high. Reduce your concurrency or use a more powerful container.
- Sporadic queue backlogs. Most apps have extremely variable load patterns for background jobs. If you don’t have autoscaling in place, you’ll need to run more containers to avoid these backlogs.
- Unreliable autoscaling. If you’re not scaling up and down when expected, you’re probably autoscaling based on CPU. Autoscale based on queue latency instead.
- Memory bloat. If your worker containers are using way more memory than you expect, you can either fix the memory bloat, or isolate those jobs to their own queue and process, potentially processing them single-threaded.
- Upstream (database) slow-down. It’s easy to scale Sidekiq to the point that you’re overloading your database. There’s no Sidekiq fix here—you either need to reduce total concurrency to alleviate DB pressure, upgrade your database, or make your queries more efficient.
The short answer is here is that Redis is almost never the problem when scaling Sidekiq. But for very high-scale apps, you might hit the limits of what’s possible with a single Redis server. The sharding wiki article walks you through some options here, and now Dragonfly might be an even better option.
Just remember that most apps don’t need this! Make sure you’ve worked through the earlier suggestions and confirmed that Redis is your bottleneck before proceeding down these paths.
Nate Berkopec dives deep into many of the ideas discussed above in his excellent book Sidekiq in Practice. He also has an in-depth article that explains the relationship between processes, threads, and the GVL. For more on latency-based queue names, check out Scaling Sidekiq at Gusto.