Running DeepCell on Google Batch with node pools

David Haley
8 min readJun 15, 2024

--

Coauthors: David Haley, Lynn Langit, Weihao Ge

We’re excited to share our first benchmarking results using an early access Google Cloud Batch feature called node pools. Node pools let Batch jobs reuse compute nodes (GCE VMs) rather than acquire and initialize a new node per job. Reusing nodes saves significant setup time between job runs of the same type.

Skipping setup helps with running multiple jobs with comparable setup & run times. In one of our use cases, a job setup takes ~3 min whereas a job task (predicting a 140 million pixel image) takes ~4 min. Setup is nearly half the job time. Our target benchmark uses 260k pixel images, which process in less than 30 seconds: setup takes 6x the work itself!

This post starts with DeepCell itself (the problem we’re trying to solve). Then an overview of serverless DeepCell on GCP Batch and node pools. And lastly, our results.

Overview: DeepCell

DeepCell is an open source cellular image segmentation tool by the Van Valen Lab at CalTech. Cellular segmentation means to identify (or draw circles around) cells on a human tissue sample image. DeepCell uses a TensorFlow model to predict the segmentation, which is used downstream in spatial omics pipelines for cancer research. For example, if we detect proteins at certain pixels, the segmentation maps those pixels to cells, telling us which proteins are present in which tissue cells.

DeepCell receives a tissue sample microscope image, and predicts a mapping from pixel to cell number.

A diagram showing an image entering DeepCell as a numpy array. It goes through preprocessing on the CPU, inference on the CPU and GPU with TensorFlow, then postprocessing with CPU. Then the output numpy array is returned as an annotated output image.
DeepCell process from input tissue image to segmented tissue image

DeepCell inputs come from wet lab tissue samples (like a cancer biopsy). Samples are treated with stains like DAPI that “dye” or react with target compounds such as DNA. The stain enhances contrast around the compounds of interest. Fluorescence microscopes and other instruments capture an image with an intensity channel per stain.

DeepCell prediction works by merging the stain intensities into membrane and nuclear channels depending on which kind of compound they react with. At this point the image looks like a 2d array of pixels with 2 colors. In other words, the image is a numpy array of shape (num_rows, num_columns, 2) floats.

The DeepCell logic runs in three phases:

  1. Pre-processing the input:
    1a. Resize input image to match training model micron per pixel density.
    1b. Normalize input image intensity histogram. (Aka, reduce outliers.)
  2. AI inference:
    2a. Divide preprocessed image into training image size tiles: 256x256 pixels.
    2b. Run tile batches through a TensorFlow keras model.
    2c. Recombine predicted tile results into overall predictions.
  3. Post-processing:
    3a. Translate the model result into local maxima for each cell.
    3b. Expand local maxima with “flood fill” to find cell boundaries.

Pre-processing is relatively fast. Prediction and post-processing take about the same amount of time with GPU acceleration. Sometimes, prediction is actually faster. Post-processing is surprisingly slow due to algorithmic inefficiencies. We are tracking ongoing efforts to improve on GitHub.

We initially benchmarked DeepCell on GCP Vertex AI. This was a quick way to get our feet wet with GPU-accelerated Python notebooks on GCP. Here’s how DeepCell timing breaks down by phase for GPU-accelerated prediction on Vertex AI for a 668 million pixel prostate cancer image:

A stacked bar chart showing the time spent in preprocessing, inference, and postprocessing.
Breakdown of time spent per phase

Of note, only the inference step uses the GPU even though it’s allocated for all three steps. The optimization here is another story for another day. (Spoiler alert: it will involve node pools.)

Once we established performance baselines using single VMs in VertexAI, we moved on to exploring how best to implement this workload on Google Batch. We did this so that we can support the full use case, which is to guide our researchers in how best to efficiently run multiple samples as a batch workload(s) on GCP.

Overview: Google Batch and node pools

A bioinformatician using DeepCell is primarily concerned with segmenting a set of tissue sample images efficiently as possible (not infrastructure). The computational challenge to implement a solution results from the combination of complex analysis (custom TensorFlow neural network) AND the image sizes, which can be more than 32 GB per image.

Batch is a good fit for the use case, in our view better than Kubernetes. Why? In a word: complexity. Deploying and maintaining Kubernetes clusters is challenging (maybe impossible?) for the typical bioinformatics team which often doesn’t have full- or part-time DevOps and cloud infrastructure practitioners. We believe that accessibility at scale will come from serverless deployments on Google Batch.

Using Batch and Node Pools

Using Google Batch, a researcher submits one or more jobs to the Batch API, which manages the VM-based infrastructure to run a set of configured, scalable jobs. Generally researchers submit one job per image file. Because each job is managed separately, the infrastructure is spun up, initialized, utilized, and shut down — once per job. This is inefficient with workloads of multiple jobs of similar type.

Node pools (currently in private preview) solve this problem by reusing GCE nodes created by Batch. Once a node finishes its task, it becomes available for more tasks for a configurable idle time. This means subsequent jobs assigned to the node skip allocation and software download & installation. Unless GCE spot nodes get preempted, node pools remove all setup time (excluding that required for the initial job).

To understand the impact, let’s look at running 15 DeepCell jobs on 512x512 pixel images (the DeepCell benchmark size). Assuming that:

  1. job infrastructure allocation is instantaneous (in reality it’s fast but not that fast)
  2. job nodes are never preempted
  3. each job takes 3 minutes with 60% on setup and 40% on workload
  4. max 4 jobs can be run in parallel due to GPU quota

Then the whole set of 15 jobs finishes in 12 minutes as shown here:

Timeline of job executions without node pool

Now let’s consider the same scenario but without any setup cost after the first set of jobs.

Timeline of job executions with node pool

Since node setup is 60% of the job’s 3 minutes, we’re saving 108 seconds per run after node pool initialization. That leaves just the workload’s 40% or 72 seconds. Net result: all 15 jobs finish in 6.5 minutes vs running in 12 minutes without a node pool.

We modeled the node pool savings for our scenario in this spreadsheet (feel free to copy!). This chart visualizes the savings as the number of jobs increases.

A line chart showing node pool savings versus job count. The line is flat, and steps up every 4 jobs.
Node pool savings as a function of job count

The formula for the savings is:

setup_time * (ceiling(job_count / parallel_factor) - 1)

In other words, researchers will save on setup time for each iteration, except the first. Less parallelism means more iterations, therefore more savings. In practice, parallelism is bounded by GPU quota and availability.

In the real world, the savings can be even more dramatic because node allocation is indeed not instantaneous. Allocation is usually fast, but sometimes it takes minutes or more because GPUs are in high demand.

Speaking of the real world… Now that we’ve looked at these idealized figures, let’s look at some real-world benchmarks.

Benchmarking

Our benchmark was to process all 15 Mesmer sample images through DeepCell as individual Batch jobs, each running a single task. We ran on n1-standard-8 machines, using 1 NVIDIA Tesla T4 GPU.

We measured these data points:

  • Total end-to-end time: the time from the earliest job creation, to the latest job completion (as reported by Batch). This is how long a researcher will wait.
  • Total job runtime: the total job duration (as reported by Batch). Excludes setup time. This is the time a researcher will pay for.

We tested 3 pool configurations: (1) no node pool, (2) a node pool starting cold with zero nodes, and (3) a node pool warmed up with 4 nodes. Results are shown below.

As expected, node pools don’t impact the total runtime. But using a node pool dramatically reduces the end-to-end time. Note that the total runtime ignores parallelism: if 2 jobs run in parallel, their duration is summed.

Processing the 15 Mesmer samples without a node pool took ~32.5 minutes. Using a node pool brought the time down to ~7.5 minutes. Using a warmed up node pool reduced the time by an order of magnitude: to just ~2.5 min (92% faster). This is a remarkable speedup.

Observations and takeaways

We’re very pleased with node pools! 🎉 The vast majority of our controllable setup time (aka not waiting for a machine) is spent fetching & extracting a ~3 GB container (2–3 minutes). If we had to repeat that setup, it’d be a lot of overhead when running multiple jobs. This reduces pressure to optimize container size (see our previous work on container size).

This chart (source sheet) visualizes how much time is spent in setup during job time. For a 30 min job, we’d spend ~10% in setup. For a 1 min job, it’s ~75%. Past 100 minutes it doesn’t matter so much.

Our researchers want us to optimize a solution for ease of implementation & deployment (given their team’s skill set). They also want their analysis to run on GCP at a reasonable cost and speed for their workloads. They are used to feedback cycles of hours to days: we need to worry more about minutes than seconds. Node pools sped up the workload by an order of magnitude.

That said, we see a few ways to improve our Batch implementation:

  • Consolidate jobs into multiple tasks. This would let Batch use infrastructure more efficiently. We chose simplicity for now: one image processed by one task of one job.
  • Predict several images at once. This is the software layer of consolidating into tasks. This saves at least these two categories:
    1. Model load. It takes ~10s to load the model from storage into memory. Each prediction needs this model. If a single process were predicting multiple images, the model load time would wash out.
    2. GPU “warmup”. For whatever reason, the first call to TensorFlow predict takes ~2.5s and the second takes 0.2s. 🤔 We wonder if it has to do with loading the model into GPU memory.
    INFO 2024–06–14T06:00:50.895170759Z Running inference
    INFO 2024–06–14T06:00:50.895173929Z Ran inference in 2.73 s
    INFO 2024–06–14T06:00:50.895176894Z Running inference again
    INFO 2024–06–14T06:00:50.895180222Z Ran inference again in 0.21 s

If we saved ~13s on model loading & GPU “warmup” on all jobs but the first four, then in theory that would save 11 * 13s = ~2.4 min of total runtime, or 0.6 min accounting for 4 GPUs in parallel. That’s ~20% off our current 2.5min end-to-end time.

As a closing remark, if we didn’t have node pools, we’d use similar techniques to save on setup with Batch. But these techniques to optimize within a job require knowing the job’s tasks ahead of time. That prevents us from optimizing infrastructure across jobs. Node pools deliver on serverless simplicity; we can focus more on submitting jobs, less on managing how we gather & manage users’ jobs.

We are excited to continue this work to bring DeepCell to even more cancer researchers. Here’s our issue tracker — feel free to jump in! 😎

--

--