main loop: break the lock-step - assimilate the main loop into coroutines #3498

oliver-sanders · 2020-02-07T00:21:23Z

Supersedes #3495 (which was the wrong way of looking at the problem - my bad)
Closely related to #3304
Related to #3497, #2123

The Current Situation:

At the moment the Scheduler main loop is a monolithic function:

The time the main loop takes to complete defines the responsiveness of the workflow to any action.
The implementation is imperative rather than event driven which means it requires iteration for event detection.

Some things aren't in lock-step with the main loop:

The SuiteRuntimeServer (Server) for instance.
The Publisher publishes in lock-step but the ZMQ implementation means that the PUB-SUB system is not in lock-step in the main loop.
The ZMQ Curve authenticator runs in its own thread.
The SubProcPool is in lock-step but its processes, obviously, aren't.

We bridge this gap using queues, at present we have the following queues:

commands - For commands (eg. user commands) received by the server.
ext_triggers - For old-style "ext" triggers.
message - For task-triggers and message-triggers.
SubProcPool.queuings/runnings - For queued/running subprocesses.
(Probably some others)

Why Not #3495:

Making individual components of the main-loop asynchronous isn't actually going to be a meaningful improvement. It would allow us to run certain "chunks" of the main loop asynchronously (e.g. health checking) but not the main-loop as a whole. The IO component of these functions is very small so there is little to gain.

The real benefit to us of making the main loop asynchronous lies elsewhere.

Grandiose Long-Term Vision:

The real benefit to be gained from asynchronicity is breaking the main-loop lock-step allowing the Scheduler to become event driven and removing the sleep() call.

Breaks the "global" lock-step of the main-loop.
The Scheduler becomes responsive on the time-scale of individual coroutines.
Allows code to become event-driven.

We assimilate the main-loop into event-driven coroutines. There is no while true, do something, sleep loop. There is no sleep statement at all.

The scheduler never sleeps when there is work to be done.
The scheduler always sleeps unless there is work to be done.
Monolythic updates (e.g. task pool iteration) become fast event orientated operations which only care about a single update.
The pathway between cause and effect is minimised providing a massive boost to responsiveness.
The main-loop functionality is broken down into small, easy to write, easy to read coroutines which do one thing and do it well.
We can unit-test the hell out of coroutines and get coverage way up.
We can simulate impossible to re-produce bugs in unit-tests.
We can use integration tests with small groups of coroutines to eliminate most of the functional test battery.

How It Works:

We write a collection of coroutines to be run in the place of the imperative main loop.

A coroutine can be a producer and/or consumer of events. For example the task pool logic is a consumer of messages (from the server) and a producer of task events.

async def task_pool(messages, task_events):
    while True:
        # yield control to other coroutines until a message arrives
        await message = messages.get()
        # do something

        # yield control to other coroutines to process this new event (and other stuff)
        await task_events.push(_)

Until a coroutine hits an await statement it is synchronous (i.e. blocking) so operations where consistency is important remain safe.

At a very high-level the main coroutines might look something like this:

(Note: Trigger is ab abstraction of task-triggers, message-triggers, ext-triggers and x-triggers)

Queues Processes And Threads:

The observant may have noticed in the above diagram two coroutines are both consuming items from the "task events" queue - which wouldn't work. This is because I'm using queue in the loosest-possible sense. Really coroutines desire a publisher-subscriber interface, if only we had a system in Cylc for this pattern ...oh wait ZMQ.

ZMQ can be used with in-proc communication to serve as a PUB-SUB queueing system for our coroutines.

A very interesting side-effect is that once a coroutine is implemented in this way it could be run as an asynchronous coroutine using asyncio, however it can be very easily run in its own thread, or its own process, or even a remote process. A free benefit of the implementation with interesting potential, e.g. multi-processing speedup for large busy workflows, remote execution of xtriggers, etc.

How We Get There:

I can't tell yet how difficult this will be, it could turn out to be quite simple, many of the queues are already in place.

It doesn't all have to be done at once, we can mode code into "top-level" coroutines which run out of lock-step with the main loop (but in the same thread) one function at a time. Here is a suggested pathway:

Proof of concept
- Migrate main loop plugins to a top-level coroutine running out of lock-step with the main loop.
- Should be quick and simple.
XTriggers - xtriggers: re-implement as async functions #3497
- More advanced
- Involves queues
Workflow Commands
- This involves interactions with the server which runs in its own thread.
- Note: At present the command queue is processed multiple times per main-loop to make workflows more responsive to user commands.
ZMQ
- POC for replacing queues with ZMQ
- Should be fairly straight-forward as queue items will already be serialise-able.
- But will involve string conversions.
Task Pool
- Time to start on the hard stuff.
- This is where we start to actually need ZMQ/PUB-SUB
The Rest

I would tentatively suggest that we should aim to get 1, 2 and 3 into Cylc8 as both are on the pathway in other ways.

Beyond that the rest relates to event-driven scheduling which facilitates an efficient and responsive spawn-on-demand solution.

Hurdles:

This will removed the dependency of the order in which the main-loop proceeds:
- This should be harmless but there may be edge cases.
- This will likely break functional tests (potentially for invalid reasons).
We can prioritise items in queues, but I don't think we can prioritise coroutines themselves.
- At the moment we process server commands multiple times per main-loop for responsiveness.
- This effectively raises their prioritisation.
- I'm not sure we would have a mechanism for this with coroutines.
The guaranteed window of consistency will change:
- E.G: at the moment the main loop CANNOT proceed to the next iteration until DB writes have been performed
- This narrows down the window of inconsistency to a very short period.
- With coroutines, unless we can prioritise DB write the window could potentially be longer, especially in busy workflows.
- The window might still be sufficiently short not to be an issue, there will always be a window after all.

The text was updated successfully, but these errors were encountered:

dwsutherland · 2020-02-07T04:10:48Z

Useful/Vital info/reference for this development:
https://docs.python.org/3/library/asyncio-dev.html

In particular, the IO blocking bits:
https://docs.python.org/3/library/asyncio-dev.html#running-blocking-code
and referenced:
https://docs.python.org/3/library/asyncio-eventloop.html#executing-code-in-thread-or-process-pools

With this is mind, it's not necessary to use PUB/SUB (although sufficient), blocking bits like socket.recv can just wait in a separate thread/process for the send (using REQ/RES, or ROUTER/DEALER), or loop with async recv no-wait option (depending on the pattern).

oliver-sanders added the efficiency For notable efficiency improvements label Feb 7, 2020

oliver-sanders added this to the cylc-9 milestone Feb 7, 2020

oliver-sanders mentioned this issue Feb 7, 2020

make main loop more asynchronous #3495

Closed

oliver-sanders mentioned this issue Jul 15, 2020

zmq: push inferface for suite status #3330

Open

oliver-sanders mentioned this issue Jul 26, 2022

subprocpool: convert to async #5017

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

main loop: break the lock-step - assimilate the main loop into coroutines #3498

main loop: break the lock-step - assimilate the main loop into coroutines #3498

oliver-sanders commented Feb 7, 2020

dwsutherland commented Feb 7, 2020

main loop: break the lock-step - assimilate the main loop into coroutines #3498

main loop: break the lock-step - assimilate the main loop into coroutines #3498

Comments

oliver-sanders commented Feb 7, 2020

dwsutherland commented Feb 7, 2020