RFC: Asynchronous Comm Messaging #433

twavv · 2019-04-16T15:44:21Z

Proposal

I'd like to add an optional flag to comm messages that would allow them to be processed asynchronously.

It would change comm data messages to look like

{
  'comm_id' : 'u-u-i-d',
  'data' : {},
  'async': false
}

where 'async': false would be optional/the default. I don't think the open/close messages need to be async.

This would allow comms, when using kernels that support this attribute, to send messages that are processed even while code is executing in the kernel. It would be up to the comm to play nice and avoid race conditions.

Use Case

I'm trying to implement a way to get data back from the browser synchronously when using WebIO.jl. Consider this code.

s = Scope()
display(s(node(:p, "Hello, world!")))
fetch(evaljs(s, js"navigator.platform"))

Using frontends other than IJulia/Jupyter Notebook/Jupyter Lab, you get a result:

Dict{String,Any} with 4 entries:
  "requestId" => "10997170149360157050"
  "request"   => "eval"
  "type"      => "response"
  "result"    => "Linux x86_64"

Using IJulia, it hangs forever. This is roughly what happens.

The frontend sends the chunk of code to execute to IJulia
IJulia begins executing (kernel becomes busy)
IJulia sends a comm message (resulting from evaljs)
We fetch the future returned by evaljs
WebIO waits for a response on the comm
The browser executes the JavaScript code sends the response to the comm

but then! IJulia can't process the message from the comm because it's still busy, waiting for the evaljs future to resolve, and the Jupyter protocol dictates that messages should be handled serially. So the response from the browser sits in the queue forever, is never processed, and so we end up waiting forever for the fetch to complete (or until the user interrupts the kernel).

Semantics

To be non-breaking, respecting the async keyword must be completely optional. It would be up to the comm to know whether or not its messages will be executed asynchronously (e.g. by checking the version of the kernel they're running under) and working either way. Importantly, this could be implemented right now and all kernels would still be compliant with the latest version of the protocol.

The text was updated successfully, but these errors were encountered:

twavv · 2019-04-16T15:58:01Z

(ping @minrk - don't mean to bug but would appreciate a cursory glance)

twavv · 2019-07-25T04:40:44Z

Ping @minrk @ellisonbg @Carreau @gnestor @jasongrout

Would just love to get some eyes on this. :^)

Carreau · 2019-07-26T00:11:32Z

I'm not super aware of how Coms are working, so I'll defer to those with more experience. @SylvainCorlay , @martinRenou, @maartenbreddels.

SylvainCorlay · 2019-07-26T00:35:49Z

Unfortunately, this is not really possible to add this async because it would break the contract that shell channel messages are processed in order by the kernel.

A good way to wait for user interaction with comms may be to do the same as what is done for Jupyter interactive widgets. See e.g. https://ipywidgets.readthedocs.io/en/latest/examples/Widget Asynchronous.html#Waiting-for-user-interaction

twavv · 2019-07-26T00:50:49Z

Is that an absolutely hard requirement? This would be an opt-in kind of thing so if the client doesn't support it if would work exactly the same.

I'm also not clear how the ipywidgets example works - does the sequence of events look like

Frontend sends code request, starts executing
Code awaits future
Frontend sends update widget request
Task is resumed
Code result is sent?

Because if so isn't that violating that the shell handles messages in order (it handles the comm data message out of order, at least wrt when the requests finish).

SylvainCorlay · 2019-08-01T22:44:21Z

Is that an absolutely hard requirement? This would be an opt-in kind of thing so if the client doesn't support it if would work exactly the same.

Yes, it is a requirement. Comm messages are queued with execution requests. In your example, you need to release the current execution request to process the comm message.

SylvainCorlay · 2019-08-01T22:48:53Z

Because if so isn't that violating that the shell handles messages in order (it handles the comm data message out of order, at least wrt when the requests finish).

We consider that the first request has "finished processing" the execution request as soon as it awaits the future.

twavv · 2019-08-01T23:01:30Z

Ah - I didn't look at the ipywidgets example closely. I understand.

Yes, it is a requirement. Comm messages are queued with execution requests.

Isn't this just an implementation detail of ipykernel? I don't see why it needs to be baked into the kernel protocol itself and allowing it could enable the kind of functionality in that example in a significantly less hacky manner (i.e. dropping the use of ensure_future to do something after the request finishes - sort of akin to using async/await rather than callbacks in JS land).

The current status quo means that any code would have to essentially be wrapped in an async def and then ensure_future'd which I think is way less than ideal (people writing the code in the notebook shouldn't have to worry about that kind of thing).

SylvainCorlay · 2019-08-01T23:18:13Z

It is specified in the protocol in that both execution requests and comm messages go through the shell channel, and are therefore processed in order. Hence, any kernel properly implementing the protocol will have that queuing behavior even when they don't have the same concurrency model as ipykernel.

SylvainCorlay · 2019-08-01T23:18:28Z

Note: There are other channels in the protocol such as the control channel which is used for e.g. shutdown and interupt messages, and done in a way that the shutdown request is not queued behind execution requests. We now use the control channel for debug messages so that we can e.g. add a breakpoint to a loop while code is running and have the debugger interupt the execution at the next iteration. (Note that for that, we had to re-write a kernel which uses threading instead of an event loop).

But I really don't think it is appropriate at all for user messages.

SylvainCorlay · 2019-08-01T23:18:40Z

Your pain point here is due to the different paradigmes between code running as the result of an execution request and webio code which seem to be in the main execution flow.

twavv · 2019-08-01T23:25:31Z

It is specified in the protocol in that both execution requests and comm messages go through the shell channel, and are therefore processed in order. Hence, any kernel properly implementing the protocol will have that queuing behavior even when they don't have the same concurrency model as ipykernel.

Fair enough but what I'm proposing is opt-in (via some attribute on the message itself) which means that kernels that don't understand this feature would continue to be compliant (that is - the spec would say that honoring the async flag is optional and dependent on the kernel) and the existing behavior would be preserved for all existing code that doesn't explicitly set the async flag.

Your pain point here is due to the different paradigmes between code running as the result of an execution request and webio code which seem to be in the main execution flow.

I'm not sure what you mean by this. The particular use case would be for something like training a machine learning model where you could switch between viewing the error and accuracy plots or update some other parameter while a process is ongoing.

I think the widgets example that you linked to is a prime example of this use case where you want to wait for input before doing something which should be considered part of the currently executing code-execution request.

jasongrout · 2019-08-01T23:33:31Z

It sounds like you may want some sort of new introspection messages - probably on the control channel, like the new debugging messages.

SylvainCorlay · 2019-08-01T23:38:25Z

It is specified in the protocol in that both execution requests and comm messages go through the shell channel, and are therefore processed in order. Hence, any kernel properly implementing the protocol will have that queuing behavior even when they don't have the same concurrency model as ipykernel.

Fair enough but what I'm proposing is opt-in (via some attribute on the message itself) which means that kernels that don't understand this feature would continue to be compliant (that is - the spec would say that honoring the async flag is optional and dependent on the kernel) and the existing behavior would be preserved for all existing code that doesn't explicitly set the async flag.

Messages on a socket are processed in order. An "async" attribute in the content cannot really change that!

Your pain point here is due to the different paradigms between code running as the result of an execution request and webio code which seem to be in the main execution flow.

I'm not sure what you mean by this.

I mean that with the jupyter protocol, if you want a long-running processing to regularly send updates to (or get content from) the front-end, it needs to adopt some concurrency strategy to not be blocking, either by using the kernel event loop or running in a thread.

SylvainCorlay · 2019-08-01T23:40:55Z

new introspection messages - probably on the control channel, like the new debugging messages.

You can't really do that because you will need to start imposing a concurrency model on the control with respect to shell, while it is not constrained at the moment.

Kernels based on event loops (such as ipykernel, or the slicer3d xeus-based kernel) would not be compliant anymore, in that you would still need for the currently processed message to be completed before processing the first next message...

jasongrout · 2019-08-01T23:43:13Z

You could ask on the control channel "Give me the value of this variable as of now" where "now" means whenever the kernel can process that message. No guarantees about when that is, but the kernel does its best effort as soon as possible. Just like shutdown and interrupt messages.

SylvainCorlay · 2019-08-01T23:44:57Z

You could ask on the control channel "Give me the value of this variable as of now" where "now" means whenever the kernel can process that message. No guarantees about when that is, but the kernel does its best effort as soon as possible. Just like shutdown and interrupt messages.

OK, that makes sense. Although it would still not solve @travigd's problem.

jasongrout · 2019-08-01T23:57:50Z

You could also imagine a debugging-type message that would stop the main thread, change a value, and keep running, right?

SylvainCorlay · 2019-08-02T00:00:52Z

You could also imagine a debugging-type message that would stop the main thread, change a value, and keep running, right?

Absolutely, although debuging kernels are only able to interupt / break into running code only if they have a threading concurrency model (like xeus-python), not an event loop model.

And I don't think it would be a sensible way for people to build widgets-based UIs, they will get crazy race conditions.

twavv · 2019-08-02T00:02:04Z

It sounds like you may want some sort of new introspection messages - probably on the control channel, like the new debugging messages.

It seems to me like this should be solved on the shell level, and I'm really just trying to use comms (I don't really wanna reach into the control socket too and WebIO itself doesn't really have a concept of separate channels - it models communication as just being a single bidirectional pipe).

Messages on a socket are processed in order. An "async" attribute in the content cannot really change that!

My impression was that the shell socket can handle messages out of order because it's a router ZMQ socket, though I might be wrong.

SylvainCorlay · 2019-08-02T00:12:30Z

My impression was that the shell socket can handle messages out of order because it's a router ZMQ socket, though I might be wrong.

I don't think you can handle out of order, but regardless, it only processes one message at a time.

Even if the response from the front-end is the first in the queue, in your example, the response from the frontend is queued until after the execution request has completed as you described in the initial message.

twavv · 2019-08-02T01:09:32Z

it only processes one message at a time

I'm not 100% clear what "it" is referring to (the ZMQ socket, the kernel, something else?), but I don't think this is a technical limitation as far as the socket is concerned. The ZMQ socket can have multiple requests in flight and respond to them in any order. This is a property of the dealer/router setup (which is fully asynchronous and doesn't inherently impose any request-reply semantics).

The kernel (as far as I understand) is currently setup such that requests are handled serially (i.e. the current request must finish before the next request can begin) but this doesn't seem to be a Jupyter protocol limitation.

twavv · 2019-08-02T01:15:17Z

Which is to say that, given a router socket "shell" and a dealer socket "frontend", it's valid (from a ZMQ POV and subject to the "as far as I understand" and "it worked for me when I tried it in a REPL" qualifications) for this to occur:

frontend sends request 100 to the shell
frontend sends request 101 to the shell
shell replies with response to 101
shell replies with response to 100

(though in the specific case I'm advocating for, the shell wouldn't actually reply to 101 because it would be a comm message which doesn't generate shell replies).

twavv · 2020-05-12T21:22:12Z

Bump

This is still causing lots of issues. :')

MSeal · 2020-05-20T17:26:47Z

Sorry I was't really tied into this thread originally. Can I ask why the front-end doesn't buffer sending request 101 to the shell until request 100 has completed? In jupyter the clients typically need to control the flow of requests to match the kernels state response rather than the other way around. I also might be misunderstanding the issue, so forgive me if that's off the mark.

jasongrout · 2020-05-20T18:20:11Z

Often a frontend will batch a list of requests to the server. For example, executing all cells in a notebook immediately sends all execution requests to the server. You can then close your notebook, go home, etc., and come back later and have all executions done.

twavv · 2020-05-20T19:30:50Z

Sorry I was't really tied into this thread originally. Can I ask why the front-end doesn't buffer sending request 101 to the shell until request 100 has completed? In jupyter the clients typically need to control the flow of requests to match the kernels state response rather than the other way around. I also might be misunderstanding the issue, so forgive me if that's off the mark.

Jason is right, and in general, a kernel can have a large queue of messages waiting to be processed.

In my case, that's exactly what I don't want to happen. I want to be able to have a comm message be handled while something else is occupying the "main" control flow (e.g., update a parameter for a plot while a network is training and have the plot reflect that change).

jasongrout · 2020-05-20T20:14:24Z

Note that comm messages can have side effects, can generate output on the iopub channel, can spawn return comm messages on the shell channel, etc. They are essentially restricted request execute messages.

twavv · 2020-05-20T20:16:59Z

Sure, but why is that a concern here? It would be up to the implementing kernel to make sure the resulting messages have the correct parentId set in the IOPub message headers to indicate whether or not the message resulted from the comm.

jasongrout · 2020-05-20T20:22:28Z

Side effects means that it can affect kernel state, which affects later execution requests. If you have request A, comm message, request B, the execution of request B can change depending on what happens in the comm message processing.

They are essentially restricted request execute messages.

This is essentially why we order them with execute messages - they are essentially execution messages, just to a specific object in the kernel rather than the kernel itself.

jasongrout · 2020-05-20T20:27:52Z

Part of the issue here is that comm messages can invoke arbitrary code execution.

What is the underlying thing you are trying to accomplish? Is there a new message that is side-effect free that could be implemented, which could be reordered with execution messages?

The debug messages on the control channel give some examples of messages that can 'jump the queue' and are executed immediately, and could involve requests for variable state, which would be side-effect free.

twavv · 2020-05-20T20:31:30Z

I don't want to skip the side effects, I want to execute arbitrary code.

I don't fully understand why side effects are a bad thing here.

the execution of request B can change depending on what happens in the comm message processing.

Why is that bad? One thing that comes to mind is that it enables notebooks that might not be reproducible by others, but that's not always a goal of people that are using notebooks (and there are lots of ways to make your notebooks non-reproducible besides this).

jasongrout · 2020-05-20T20:42:21Z

Why is that bad? One thing that comes to mind is that it enables notebooks that might not be reproducible by others, but that's not always a goal of people that are using notebooks (and there are lots of ways to make your notebooks non-reproducible besides this).

It's making notebooks not reproducible by you in consecutive runs either. It's deliberately introducing a race condition where there wasn't one before. That doesn't make it "evil", but it does fundamentally change the contract we've had between frontends and backends that things are processed in order, which is a pretty fundamental assumption in the Jupyter protocol. If we are rolling that assumption back, why not make the async field available on execute requests as well? That's effectively what we would be doing.

MSeal · 2020-05-20T20:46:40Z

It's making notebooks not reproducible by you in consecutive runs either. It's deliberately introducing a race condition where there wasn't one before.

👍 to that statement. If you're trying to execute independent snippets that are completely unrelated, which be the only time this is safe to violate, maybe they shouldn't be in the same notebook / kernel execution queue.

twavv · 2020-05-20T21:12:01Z

fundamentally change the contract we've had between frontends and backends

The actual change I'm proposing makes this functionality opt-in. Both the frontend and backend would have to opt-in. The frontend by setting an async flag and the backend by honoring it.

It's making notebooks not reproducible by you in consecutive runs either

Just because something could be misused, I don't think that's a terribly strong argument against it. I don't think it should be used to direct a computation in any meaningful way, for example.

Really, the kinds of things I'm envisioning are:

Being able to switch "views" of a network's training while it's in progress (whereas this isn't currently possible since the comm msg to change the view won't be handled until the network is done training)
Being able to inspect things in real-time. I don't mean code debugging, but rather... in a course I help run, we do the pretty standard "train a network on the MNIST data set" and at the end, the students can draw their own digits and see that they get classified correctly. It'd be cool™ to enable that while the thing is in progress too.

As an aside, thank both of you for engaging in this discussion. Obviously I'm here to champion what I think would be a pretty neat™ addition to the Jupyter protocol, but I understand that my motivations are different than yours. :^)

MSeal · 2020-05-20T22:59:25Z

Being able to switch "views" of a network's training while it's in progress (whereas this isn't currently possible since the comm msg to change the view won't be handled until the network is done training)

I do think that you'll also run into obscure race conditions within the kernel's object state if the objects you're manipulating are not built to be async / thread safe. There's a number of common libraries in most languages that would fail this test in this instance if you manipulate them from two contexts at once.

Just because something could be misused

I think what Jason was pointing out was that this will often result in notebooks that cannot be rerun and Jupyter already gets a lot of mistrust for allowing out of order cell execution where it also leads to such issues, further reducing the reproducibility of the tooling. It might be a risk factor we don't want to take on in the open source solution even as an opt-in pattern.

Being able to inspect things in real-time...

That is a cool use-case. I do feel support for real-time read-only inspection of objects could be a lot better in Jupyter. I'd need to think more about how to achieve this without disrupting existing contracts if possible.

As an aside, thank both of you for engaging in this discussion. Obviously I'm here to champion what I think would be a pretty neat™ addition to the Jupyter protocol, but I understand that my motivations are different than yours. :^)

We really appreciate that you have that attitude about proposing a change. Thank you.

twavv · 2020-05-20T23:04:53Z

I do think that you'll also run into obscure race conditions within the kernel's object state if the objects you're manipulating are not built to be async / thread safe. There's a number of common libraries in most languages that would fail this test in this instance if you manipulate them from two contexts at once.

I think this should be left up to the code that is trying to be async. For example, if ipywidgets wanted to implement this, they would have to be the ones to add the async flag in the JS side and make sure that all the updates that happen are concurrent-safe.

I think what Jason was pointing out was that this will often result in notebooks that cannot be rerun and Jupyter already gets a lot of mistrust for allowing out of order cell execution where it also leads to such issues, further reducing the reproducibility of the tooling. It might be a risk factor we don't want to take on in the open source solution even as an opt-in pattern.

I really don't envision this being used to do weird things like change a computation mid-execution by executing some code and people would actually have to build out the kernel extensions that do such a thing anyway.

jasongrout · 2020-05-20T23:16:14Z

Really, the kinds of things I'm envisioning are:

For both of these usecases - can you do the network training in a separate thread and leave the main thread available to handle comm messages?

jasongrout · 2020-05-20T23:18:44Z

I do feel support for real-time read-only inspection of objects could be a lot better in Jupyter. I'd need to think more about how to achieve this without disrupting existing contracts if possible.

I've wanted read-only retrieving of object values for years and years as well. I experimented at one point briefly years ago with doing this on a thread. I'm hoping that the new debugger work that is planned (supporting a variable viewer) will be able to enable some of this, at least at a low level.

twavv · 2020-05-20T23:26:24Z

For both of these usecases - can you do the network training in a separate thread and leave the main thread available to handle comm messages?

This is very much what I'd like to do, but that's incompatible with the Jupyter spec since code execution requests have to finish before other things are able to be handled. Or did you mean I could launch the training an async task so that Jupyter doesn't consider the code to be executing? That's less than ideal since you don't get notifications that the code has completed and doesn't work if you're trying to run cells in order.

without disrupting existing contracts

I want to emphasize that this specific proposal is fully backwards compatible and does not break existing contracts. Both the frontend and backend would have to opt in to this behavior, and specific notebook extensions would have to add the async: true flag to be handled asynchronously.

jasongrout · 2020-05-20T23:42:28Z

Or did you mean I could launch the training an async task so that Jupyter doesn't consider the code to be executing?

Yes.

That's less than ideal since you don't get notifications that the code has completed and doesn't work if you're trying to run cells in order.

Yes. You'd have to make the rest of the cells async or aware of the out-of-order execution.

I want to emphasize that this specific proposal is fully backwards compatible and does not break existing contracts. Both the frontend and backend would have to opt in to this behavior, and specific notebook extensions would have to add the async: true flag to be handled asynchronously.

Yes, in theory. Of course, in practice, there would need to be at least one implementation (presumably in the reference ipykernel) and some extensions using it, and working out the details and ramifications of how this interplays with the current strong assumptions in the code and user expectations about messages being processed in order. This also introduces the maintenance burden of educating users and responding to surely many questions about why things are executing differently now. I would be careful to not underestimate this work and ongoing maintenance/education/support.

I think the next step here would be a much more detailed proposal of exactly what the execution paradigm would look like with this new capability in various scenarios and a writeup of the pros/cons of the ramifications of the change and a broad class of capabilities it enables. Ideally with working code changes to, say, ipykernel and a comm client, to test these assumptions. Since this would be changing fundamental core assumptions about how the protocol works, we want to move very, very carefully.

jasongrout · 2020-05-20T23:49:06Z

One of the things not clear to me in the above discussion is the concurrency model of a kernel allowing this (going back to Sylvain's points). Exactly what does that look like to enable this async processing, and what are the ramifications of having that concurrency model?

mlucool mentioned this issue Nov 1, 2022

Comms with async functions #868

Open

lionel- mentioned this issue Jan 5, 2024

Support for frontend-side RPC methods posit-dev/positron#2019

Merged

RFC: Asynchronous Comm Messaging #433

RFC: Asynchronous Comm Messaging #433

Comments

twavv commented Apr 16, 2019

Proposal

Use Case

Semantics

twavv commented Apr 16, 2019

twavv commented Jul 25, 2019

Carreau commented Jul 26, 2019

SylvainCorlay commented Jul 26, 2019

twavv commented Jul 26, 2019

SylvainCorlay commented Aug 1, 2019

SylvainCorlay commented Aug 1, 2019

twavv commented Aug 1, 2019

SylvainCorlay commented Aug 1, 2019

SylvainCorlay commented Aug 1, 2019

SylvainCorlay commented Aug 1, 2019

twavv commented Aug 1, 2019

jasongrout commented Aug 1, 2019

SylvainCorlay commented Aug 1, 2019

SylvainCorlay commented Aug 1, 2019 • edited Loading

jasongrout commented Aug 1, 2019

SylvainCorlay commented Aug 1, 2019

jasongrout commented Aug 1, 2019

SylvainCorlay commented Aug 2, 2019

twavv commented Aug 2, 2019

SylvainCorlay commented Aug 2, 2019 • edited Loading

twavv commented Aug 2, 2019

twavv commented Aug 2, 2019

twavv commented May 12, 2020

MSeal commented May 20, 2020

jasongrout commented May 20, 2020

twavv commented May 20, 2020

jasongrout commented May 20, 2020

twavv commented May 20, 2020

jasongrout commented May 20, 2020 • edited Loading

jasongrout commented May 20, 2020

twavv commented May 20, 2020

jasongrout commented May 20, 2020

MSeal commented May 20, 2020

twavv commented May 20, 2020

MSeal commented May 20, 2020

twavv commented May 20, 2020

jasongrout commented May 20, 2020

jasongrout commented May 20, 2020

twavv commented May 20, 2020

jasongrout commented May 20, 2020

jasongrout commented May 20, 2020

SylvainCorlay commented Aug 1, 2019 •

edited

Loading

SylvainCorlay commented Aug 2, 2019 •

edited

Loading

jasongrout commented May 20, 2020 •

edited

Loading