Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRMAA? #2205

Open
hjoliver opened this issue Mar 12, 2017 · 8 comments
Open

DRMAA? #2205

hjoliver opened this issue Mar 12, 2017 · 8 comments
Milestone

Comments

@hjoliver
Copy link
Member

Could we replace our custom support for each batch scheduler (and batch scheduler job submission via the shell: qsub etc.) with generic use of the DRMAA?

From wikipedia:

DRMAA or Distributed Resource Management Application API is a high-level Open Grid Forum API specification for the submission and control of jobs to a Distributed Resource Management (DRM) system, such as a Cluster or Grid computing infrastructure. The scope of the API covers all the high level functionality required for applications to submit, control, and monitor jobs on execution resources in the DRM system.

Also:

Possible problems?

  • do all common batch schedulers support the API?
  • does the API support, or provide a way to pass through, all the low level resource directives often needed when running large scientific models?
  • no use for background and at jobs (but maybe we could write an API back end for them).
@hjoliver hjoliver added this to the some-day milestone Mar 12, 2017
@matthewrmshin
Copy link
Contributor

matthewrmshin commented Mar 14, 2017

Although it sounds like a good idea, I have many questions.

There is a pure Python implementation of the Open Grid Forum API called SAGA, which does look like it can do most of what we want, but it looks seriously heavy weight at almost 40K lines of code (c.f. cylc which has about 55K lines of internal code). Just the PBS adaptor has >1300 lines of code, with lots of logic to deal with outputs from qsub, qstat, etc from different versions of PBS. It is very impressive, but it is also quite scary. (We only have 78 lines of code in cylc's PBS handler!)

The alternative is the C APIs Python binding solution. The problem is that it looks very fragmented and a lot of dependencies. Are we guaranteed that sites using cylc will all have the DRMAA C libraries installed for their batch schedulers? (Or do they come with the batch schedulers as standards these days?)

I think the best strategy for cylc is to have a good look at the Open Grid Forum documentation and see if we can align our configuration, naming convention and terminology with theirs. If and when the Open Grid Forum technology and APIs really become out-of-the-box standards for batch schedulers, we will then have no problem migrating cylc to use those APIs.

@hjoliver
Copy link
Member Author

Yeah, our current batch system support is light-weight and flexible. A downside is that we can't express batch system directives generically, which would make suite portability easier. I presume this API allows that, but if it is (I've heard rumours) "lowest common denominator" support, that might be a problem for some of our use cases.

@matthewrmshin
Copy link
Contributor

Ideally, batch schedulers will move towards implementing their CLI options and job directives in the direction of the DRMAA API, then suite portability will be solved without us having to do anything!

@kinow
Copy link
Member

kinow commented Jun 12, 2018

Some time ago I used PBS Torque, and wrote a small Java program. It got a few more users who asked for DRMAA support. In the end I never finished the implementation, but I remember using Galaxy's API as reference.

Galaxy is a well-established bioinformatics middleware (I like it for visualisation, but you can use for workflow orchestration, data management, etc, as well).

It is written in Python, and supports DRMAA, https://docs.galaxyproject.org/en/latest/admin/cluster.html

It has been ages since I looked at Galaxy's code, but I remember I enjoyed, and found what I was looking for quickly. So perhaps this file could be useful for whoever implements it: https://github.com/galaxyproject/galaxy/blob/dev/lib/galaxy/jobs/runners/drmaa.py

Cheers

@kinow
Copy link
Member

kinow commented Jun 13, 2019

Was reading my news feed when I saw this blog post by Dask: Dask on HPC.

There is one researcher in Auckland that uses Dask to parse/process large datasets from climate data in his own notebook, as Dask helps him to easily break it down into smaller parallel chunks of process.

I haven't used it, but noticed there are a few Python tools coming out with Dask support, e.g. Apache Airflow.

The blog post has a paragraph that could be interesting for this PR:

Dask interacts natively with our existing job schedulers (SLURM/SGE/LSF/PBS/…) so there is no additional system to set up and manage between users and IT. All the infrastructure that we need is already in place.

Planning to spend some training time until the end of the year to learn more about Dask, and understand how it can be used by tools and libraries.

@hjoliver
Copy link
Member Author

I believe Met Office post processing people already use Dask inside Cylc suites (presumably to distribute single Python tasks across multiple cores). We should talk about this next week in Exeter.

@matthewrmshin
Copy link
Contributor

E.g. Iris is configured to use Dask at MO. However, my believe is that Dask's primary focus is on work flow and parallelisation at the application or sub-application level, whereas Cylc's focus is on the level above.

#1962 A Python API for configuring suite tasks will also help us integrate with Dask.

@TomekTrzeciak
Copy link
Contributor

Another recent project aiming at this space is Portable Submission Interface for Jobs (PSI/J) - Python Library.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants