Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

main loop: break the lock-step - assimilate the main loop into coroutines #3498

Open
oliver-sanders opened this issue Feb 7, 2020 · 1 comment
Labels
efficiency For notable efficiency improvements
Milestone

Comments

@oliver-sanders
Copy link
Member

Supersedes #3495 (which was the wrong way of looking at the problem - my bad)
Closely related to #3304
Related to #3497, #2123

The Current Situation:

At the moment the Scheduler main loop is a monolithic function:

  • The time the main loop takes to complete defines the responsiveness of the workflow to any action.
  • The implementation is imperative rather than event driven which means it requires iteration for event detection.

Some things aren't in lock-step with the main loop:

  • The SuiteRuntimeServer (Server) for instance.
  • The Publisher publishes in lock-step but the ZMQ implementation means that the PUB-SUB system is not in lock-step in the main loop.
  • The ZMQ Curve authenticator runs in its own thread.
  • The SubProcPool is in lock-step but its processes, obviously, aren't.

We bridge this gap using queues, at present we have the following queues:

  • commands - For commands (eg. user commands) received by the server.
  • ext_triggers - For old-style "ext" triggers.
  • message - For task-triggers and message-triggers.
  • SubProcPool.queuings/runnings - For queued/running subprocesses.
  • (Probably some others)

Why Not #3495:

Making individual components of the main-loop asynchronous isn't actually going to be a meaningful improvement. It would allow us to run certain "chunks" of the main loop asynchronously (e.g. health checking) but not the main-loop as a whole. The IO component of these functions is very small so there is little to gain.

The real benefit to us of making the main loop asynchronous lies elsewhere.

Grandiose Long-Term Vision:

The real benefit to be gained from asynchronicity is breaking the main-loop lock-step allowing the Scheduler to become event driven and removing the sleep() call.

  • Breaks the "global" lock-step of the main-loop.
  • The Scheduler becomes responsive on the time-scale of individual coroutines.
  • Allows code to become event-driven.

We assimilate the main-loop into event-driven coroutines. There is no while true, do something, sleep loop. There is no sleep statement at all.

  • The scheduler never sleeps when there is work to be done.
  • The scheduler always sleeps unless there is work to be done.
  • Monolythic updates (e.g. task pool iteration) become fast event orientated operations which only care about a single update.
  • The pathway between cause and effect is minimised providing a massive boost to responsiveness.
  • The main-loop functionality is broken down into small, easy to write, easy to read coroutines which do one thing and do it well.
  • We can unit-test the hell out of coroutines and get coverage way up.
  • We can simulate impossible to re-produce bugs in unit-tests.
  • We can use integration tests with small groups of coroutines to eliminate most of the functional test battery.

How It Works:

We write a collection of coroutines to be run in the place of the imperative main loop.

A coroutine can be a producer and/or consumer of events. For example the task pool logic is a consumer of messages (from the server) and a producer of task events.

async def task_pool(messages, task_events):
    while True:
        # yield control to other coroutines until a message arrives
        await message = messages.get()
        # do something

        # yield control to other coroutines to process this new event (and other stuff)
        await task_events.push(_)

Until a coroutine hits an await statement it is synchronous (i.e. blocking) so operations where consistency is important remain safe.

At a very high-level the main coroutines might look something like this:

(Note: Trigger is ab abstraction of task-triggers, message-triggers, ext-triggers and x-triggers)

coro

Queues Processes And Threads:

The observant may have noticed in the above diagram two coroutines are both consuming items from the "task events" queue - which wouldn't work. This is because I'm using queue in the loosest-possible sense. Really coroutines desire a publisher-subscriber interface, if only we had a system in Cylc for this pattern ...oh wait ZMQ.

ZMQ can be used with in-proc communication to serve as a PUB-SUB queueing system for our coroutines.

A very interesting side-effect is that once a coroutine is implemented in this way it could be run as an asynchronous coroutine using asyncio, however it can be very easily run in its own thread, or its own process, or even a remote process. A free benefit of the implementation with interesting potential, e.g. multi-processing speedup for large busy workflows, remote execution of xtriggers, etc.

How We Get There:

I can't tell yet how difficult this will be, it could turn out to be quite simple, many of the queues are already in place.

It doesn't all have to be done at once, we can mode code into "top-level" coroutines which run out of lock-step with the main loop (but in the same thread) one function at a time. Here is a suggested pathway:

  1. Proof of concept

    • Migrate main loop plugins to a top-level coroutine running out of lock-step with the main loop.
    • Should be quick and simple.
  2. XTriggers - xtriggers: re-implement as async functions #3497

    • More advanced
    • Involves queues
  3. Workflow Commands

    • This involves interactions with the server which runs in its own thread.
    • Note: At present the command queue is processed multiple times per main-loop to make workflows more responsive to user commands.
  4. ZMQ

    • POC for replacing queues with ZMQ
    • Should be fairly straight-forward as queue items will already be serialise-able.
    • But will involve string conversions.
  5. Task Pool

    • Time to start on the hard stuff.
    • This is where we start to actually need ZMQ/PUB-SUB
  6. The Rest

I would tentatively suggest that we should aim to get 1, 2 and 3 into Cylc8 as both are on the pathway in other ways.

Beyond that the rest relates to event-driven scheduling which facilitates an efficient and responsive spawn-on-demand solution.

Hurdles:

  • This will removed the dependency of the order in which the main-loop proceeds:
    • This should be harmless but there may be edge cases.
    • This will likely break functional tests (potentially for invalid reasons).
  • We can prioritise items in queues, but I don't think we can prioritise coroutines themselves.
    • At the moment we process server commands multiple times per main-loop for responsiveness.
    • This effectively raises their prioritisation.
    • I'm not sure we would have a mechanism for this with coroutines.
  • The guaranteed window of consistency will change:
    • E.G: at the moment the main loop CANNOT proceed to the next iteration until DB writes have been performed
    • This narrows down the window of inconsistency to a very short period.
    • With coroutines, unless we can prioritise DB write the window could potentially be longer, especially in busy workflows.
    • The window might still be sufficiently short not to be an issue, there will always be a window after all.
@oliver-sanders oliver-sanders added the efficiency For notable efficiency improvements label Feb 7, 2020
@oliver-sanders oliver-sanders added this to the cylc-9 milestone Feb 7, 2020
@dwsutherland
Copy link
Member

Useful/Vital info/reference for this development:
https://docs.python.org/3/library/asyncio-dev.html

In particular, the IO blocking bits:
https://docs.python.org/3/library/asyncio-dev.html#running-blocking-code
and referenced:
https://docs.python.org/3/library/asyncio-eventloop.html#executing-code-in-thread-or-process-pools

With this is mind, it's not necessary to use PUB/SUB (although sufficient), blocking bits like socket.recv can just wait in a separate thread/process for the send (using REQ/RES, or ROUTER/DEALER), or loop with async recv no-wait option (depending on the pattern).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
efficiency For notable efficiency improvements
Projects
None yet
Development

No branches or pull requests

2 participants