You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I asked this question on stack overflow but haven't got any response, was hoping to get some help here.
I am working on upgrading our DASK infrastructure where currently we create a LocalCluster for each Job (a Job is 1 of ~10 computationally intensive tasks kicked off when one of your users clicks a button in our web app). We are hoping to make use of adaptive clusters, probably a KubeCluster in production, such that when no users are using the app the cluster will scale down to zero workers, and will then scale up adaptively as more and more Jobs get kicked off.
I have identified 3 issues with this that I would like some help/clarification with.
I am currently trying a proof of concept on my mac using the SpecCluster running in adapt mode. So far this is not yielding promosing results, I am getting lots of tornado.iostream.StreamClosedError: Stream is closed and distributed.comm.core.CommClosedError: in <TCP (closed) errors from the workers and then the cluster seems to hold on to workers, and the Job I am trying to run never finishes. My theory is that the SpecCluster is simply not scaling up workers quickly enough to handle the demand and then just crashes out. (I can run the test Job very easily with a LocalCluster with 4 workers so it's defiitely not too big a job to run locally). Am I doing something fundamnetally wrong here? The SpecCluster looks like this
cluster = SpecCluster(
scheduler={"cls": Scheduler, "options": {"host": "localhost", "port": 8786}},
worker={"cls": Nanny, "options": {"nthreads": 1, "memory_limit": "4GB"}},
)
cluster.adapt(minimum=0, maximum_cores=4, wait_count=10, target_duration="2.5s")
We want to run the apadtive scheduler in a completely seperate process, however all the examples I have seen look like
cluster = AdaptiveCluster()
do_work()
Is there a recommended way to run an adaptive scheduler in it's own process
To mitigate issue 1 we can probably know ahead of time roughly how many workers a given Job is going to need so can manually scale the cluster out by n workers and let the adaptivity scale the rest of the job up and down as needed, however going back to issue 2, how would we communicate with the apative scheduler to tell it to scale up by n?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
I asked this question on stack overflow but haven't got any response, was hoping to get some help here.
I am working on upgrading our DASK infrastructure where currently we create a LocalCluster for each Job (a Job is 1 of ~10 computationally intensive tasks kicked off when one of your users clicks a button in our web app). We are hoping to make use of adaptive clusters, probably a KubeCluster in production, such that when no users are using the app the cluster will scale down to zero workers, and will then scale up adaptively as more and more Jobs get kicked off.
I have identified 3 issues with this that I would like some help/clarification with.
I am currently trying a proof of concept on my mac using the SpecCluster running in adapt mode. So far this is not yielding promosing results, I am getting lots of tornado.iostream.StreamClosedError: Stream is closed and distributed.comm.core.CommClosedError: in <TCP (closed) errors from the workers and then the cluster seems to hold on to workers, and the Job I am trying to run never finishes. My theory is that the SpecCluster is simply not scaling up workers quickly enough to handle the demand and then just crashes out. (I can run the test Job very easily with a LocalCluster with 4 workers so it's defiitely not too big a job to run locally). Am I doing something fundamnetally wrong here? The SpecCluster looks like this
cluster = SpecCluster(
scheduler={"cls": Scheduler, "options": {"host": "localhost", "port": 8786}},
worker={"cls": Nanny, "options": {"nthreads": 1, "memory_limit": "4GB"}},
)
cluster.adapt(minimum=0, maximum_cores=4, wait_count=10, target_duration="2.5s")
We want to run the apadtive scheduler in a completely seperate process, however all the examples I have seen look like
cluster = AdaptiveCluster()
do_work()
Is there a recommended way to run an adaptive scheduler in it's own process
To mitigate issue 1 we can probably know ahead of time roughly how many workers a given Job is going to need so can manually scale the cluster out by n workers and let the adaptivity scale the rest of the job up and down as needed, however going back to issue 2, how would we communicate with the apative scheduler to tell it to scale up by n?
Beta Was this translation helpful? Give feedback.
All reactions