Dask adpative cluster serving multiple jobs #8403

RogerThomas · 2021-11-19T14:51:56Z

RogerThomas
Nov 19, 2021

I asked this question on stack overflow but haven't got any response, was hoping to get some help here.

I am working on upgrading our DASK infrastructure where currently we create a LocalCluster for each Job (a Job is 1 of ~10 computationally intensive tasks kicked off when one of your users clicks a button in our web app). We are hoping to make use of adaptive clusters, probably a KubeCluster in production, such that when no users are using the app the cluster will scale down to zero workers, and will then scale up adaptively as more and more Jobs get kicked off.

I have identified 3 issues with this that I would like some help/clarification with.

I am currently trying a proof of concept on my mac using the SpecCluster running in adapt mode. So far this is not yielding promosing results, I am getting lots of tornado.iostream.StreamClosedError: Stream is closed and distributed.comm.core.CommClosedError: in <TCP (closed) errors from the workers and then the cluster seems to hold on to workers, and the Job I am trying to run never finishes. My theory is that the SpecCluster is simply not scaling up workers quickly enough to handle the demand and then just crashes out. (I can run the test Job very easily with a LocalCluster with 4 workers so it's defiitely not too big a job to run locally). Am I doing something fundamnetally wrong here? The SpecCluster looks like this
cluster = SpecCluster(
scheduler={"cls": Scheduler, "options": {"host": "localhost", "port": 8786}},
worker={"cls": Nanny, "options": {"nthreads": 1, "memory_limit": "4GB"}},
)
cluster.adapt(minimum=0, maximum_cores=4, wait_count=10, target_duration="2.5s")
We want to run the apadtive scheduler in a completely seperate process, however all the examples I have seen look like
cluster = AdaptiveCluster()
do_work()
Is there a recommended way to run an adaptive scheduler in it's own process
To mitigate issue 1 we can probably know ahead of time roughly how many workers a given Job is going to need so can manually scale the cluster out by n workers and let the adaptivity scale the rest of the job up and down as needed, however going back to issue 2, how would we communicate with the apative scheduler to tell it to scale up by n?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dask adpative cluster serving multiple jobs #8403

{{title}}

Replies: 0 comments

Select a reply

Dask adpative cluster serving multiple jobs #8403

RogerThomas Nov 19, 2021

Replies: 0 comments

RogerThomas
Nov 19, 2021