Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Asynchronous Comm Messaging #433

Open
twavv opened this issue Apr 16, 2019 · 42 comments
Open

RFC: Asynchronous Comm Messaging #433

twavv opened this issue Apr 16, 2019 · 42 comments

Comments

@twavv
Copy link
Contributor

twavv commented Apr 16, 2019

Proposal

I'd like to add an optional flag to comm messages that would allow them to be processed asynchronously.

It would change comm data messages to look like

{
  'comm_id' : 'u-u-i-d',
  'data' : {},
  'async': false
}

where 'async': false would be optional/the default. I don't think the open/close messages need to be async.

This would allow comms, when using kernels that support this attribute, to send messages that are processed even while code is executing in the kernel. It would be up to the comm to play nice and avoid race conditions.

Use Case

I'm trying to implement a way to get data back from the browser synchronously when using WebIO.jl. Consider this code.

s = Scope()
display(s(node(:p, "Hello, world!")))
fetch(evaljs(s, js"navigator.platform"))

Using frontends other than IJulia/Jupyter Notebook/Jupyter Lab, you get a result:

Dict{String,Any} with 4 entries:
  "requestId" => "10997170149360157050"
  "request"   => "eval"
  "type"      => "response"
  "result"    => "Linux x86_64"

Using IJulia, it hangs forever. This is roughly what happens.

  • The frontend sends the chunk of code to execute to IJulia
  • IJulia begins executing (kernel becomes busy)
  • IJulia sends a comm message (resulting from evaljs)
  • We fetch the future returned by evaljs
  • WebIO waits for a response on the comm
  • The browser executes the JavaScript code sends the response to the comm

but then! IJulia can't process the message from the comm because it's still busy, waiting for the evaljs future to resolve, and the Jupyter protocol dictates that messages should be handled serially. So the response from the browser sits in the queue forever, is never processed, and so we end up waiting forever for the fetch to complete (or until the user interrupts the kernel).

Semantics

To be non-breaking, respecting the async keyword must be completely optional. It would be up to the comm to know whether or not its messages will be executed asynchronously (e.g. by checking the version of the kernel they're running under) and working either way. Importantly, this could be implemented right now and all kernels would still be compliant with the latest version of the protocol.

@twavv
Copy link
Contributor Author

twavv commented Apr 16, 2019

(ping @minrk - don't mean to bug but would appreciate a cursory glance)

@twavv
Copy link
Contributor Author

twavv commented Jul 25, 2019

Ping @minrk @ellisonbg @Carreau @gnestor @jasongrout

Would just love to get some eyes on this. :^)

@Carreau
Copy link
Member

Carreau commented Jul 26, 2019

I'm not super aware of how Coms are working, so I'll defer to those with more experience. @SylvainCorlay , @martinRenou, @maartenbreddels.

@SylvainCorlay
Copy link
Member

Unfortunately, this is not really possible to add this async because it would break the contract that shell channel messages are processed in order by the kernel.

A good way to wait for user interaction with comms may be to do the same as what is done for Jupyter interactive widgets. See e.g. https://ipywidgets.readthedocs.io/en/latest/examples/Widget Asynchronous.html#Waiting-for-user-interaction

@twavv
Copy link
Contributor Author

twavv commented Jul 26, 2019

Is that an absolutely hard requirement? This would be an opt-in kind of thing so if the client doesn't support it if would work exactly the same.

I'm also not clear how the ipywidgets example works - does the sequence of events look like

  • Frontend sends code request, starts executing
  • Code awaits future
  • Frontend sends update widget request
  • Task is resumed
  • Code result is sent?

Because if so isn't that violating that the shell handles messages in order (it handles the comm data message out of order, at least wrt when the requests finish).

@SylvainCorlay
Copy link
Member

Is that an absolutely hard requirement? This would be an opt-in kind of thing so if the client doesn't support it if would work exactly the same.

Yes, it is a requirement. Comm messages are queued with execution requests. In your example, you need to release the current execution request to process the comm message.

@SylvainCorlay
Copy link
Member

Because if so isn't that violating that the shell handles messages in order (it handles the comm data message out of order, at least wrt when the requests finish).

We consider that the first request has "finished processing" the execution request as soon as it awaits the future.

@twavv
Copy link
Contributor Author

twavv commented Aug 1, 2019

Ah - I didn't look at the ipywidgets example closely. I understand.

Yes, it is a requirement. Comm messages are queued with execution requests.

Isn't this just an implementation detail of ipykernel? I don't see why it needs to be baked into the kernel protocol itself and allowing it could enable the kind of functionality in that example in a significantly less hacky manner (i.e. dropping the use of ensure_future to do something after the request finishes - sort of akin to using async/await rather than callbacks in JS land).

The current status quo means that any code would have to essentially be wrapped in an async def and then ensure_future'd which I think is way less than ideal (people writing the code in the notebook shouldn't have to worry about that kind of thing).

@SylvainCorlay
Copy link
Member

It is specified in the protocol in that both execution requests and comm messages go through the shell channel, and are therefore processed in order. Hence, any kernel properly implementing the protocol will have that queuing behavior even when they don't have the same concurrency model as ipykernel.

@SylvainCorlay
Copy link
Member

Note: There are other channels in the protocol such as the control channel which is used for e.g. shutdown and interupt messages, and done in a way that the shutdown request is not queued behind execution requests. We now use the control channel for debug messages so that we can e.g. add a breakpoint to a loop while code is running and have the debugger interupt the execution at the next iteration. (Note that for that, we had to re-write a kernel which uses threading instead of an event loop).

But I really don't think it is appropriate at all for user messages.

@SylvainCorlay
Copy link
Member

Your pain point here is due to the different paradigmes between code running as the result of an execution request and webio code which seem to be in the main execution flow.

@twavv
Copy link
Contributor Author

twavv commented Aug 1, 2019

It is specified in the protocol in that both execution requests and comm messages go through the shell channel, and are therefore processed in order. Hence, any kernel properly implementing the protocol will have that queuing behavior even when they don't have the same concurrency model as ipykernel.

Fair enough but what I'm proposing is opt-in (via some attribute on the message itself) which means that kernels that don't understand this feature would continue to be compliant (that is - the spec would say that honoring the async flag is optional and dependent on the kernel) and the existing behavior would be preserved for all existing code that doesn't explicitly set the async flag.

Your pain point here is due to the different paradigmes between code running as the result of an execution request and webio code which seem to be in the main execution flow.

I'm not sure what you mean by this. The particular use case would be for something like training a machine learning model where you could switch between viewing the error and accuracy plots or update some other parameter while a process is ongoing.

I think the widgets example that you linked to is a prime example of this use case where you want to wait for input before doing something which should be considered part of the currently executing code-execution request.

@jasongrout
Copy link
Member

It sounds like you may want some sort of new introspection messages - probably on the control channel, like the new debugging messages.

@SylvainCorlay
Copy link
Member

It is specified in the protocol in that both execution requests and comm messages go through the shell channel, and are therefore processed in order. Hence, any kernel properly implementing the protocol will have that queuing behavior even when they don't have the same concurrency model as ipykernel.

Fair enough but what I'm proposing is opt-in (via some attribute on the message itself) which means that kernels that don't understand this feature would continue to be compliant (that is - the spec would say that honoring the async flag is optional and dependent on the kernel) and the existing behavior would be preserved for all existing code that doesn't explicitly set the async flag.

Messages on a socket are processed in order. An "async" attribute in the content cannot really change that!

Your pain point here is due to the different paradigms between code running as the result of an execution request and webio code which seem to be in the main execution flow.

I'm not sure what you mean by this.

I mean that with the jupyter protocol, if you want a long-running processing to regularly send updates to (or get content from) the front-end, it needs to adopt some concurrency strategy to not be blocking, either by using the kernel event loop or running in a thread.

@SylvainCorlay
Copy link
Member

SylvainCorlay commented Aug 1, 2019

new introspection messages - probably on the control channel, like the new debugging messages.

You can't really do that because you will need to start imposing a concurrency model on the control with respect to shell, while it is not constrained at the moment.

Kernels based on event loops (such as ipykernel, or the slicer3d xeus-based kernel) would not be compliant anymore, in that you would still need for the currently processed message to be completed before processing the first next message...

@jasongrout
Copy link
Member

You could ask on the control channel "Give me the value of this variable as of now" where "now" means whenever the kernel can process that message. No guarantees about when that is, but the kernel does its best effort as soon as possible. Just like shutdown and interrupt messages.

@SylvainCorlay
Copy link
Member

You could ask on the control channel "Give me the value of this variable as of now" where "now" means whenever the kernel can process that message. No guarantees about when that is, but the kernel does its best effort as soon as possible. Just like shutdown and interrupt messages.

OK, that makes sense. Although it would still not solve @travigd's problem.

@jasongrout
Copy link
Member

You could also imagine a debugging-type message that would stop the main thread, change a value, and keep running, right?

@SylvainCorlay
Copy link
Member

You could also imagine a debugging-type message that would stop the main thread, change a value, and keep running, right?

Absolutely, although debuging kernels are only able to interupt / break into running code only if they have a threading concurrency model (like xeus-python), not an event loop model.

And I don't think it would be a sensible way for people to build widgets-based UIs, they will get crazy race conditions.

@twavv
Copy link
Contributor Author

twavv commented Aug 2, 2019

It sounds like you may want some sort of new introspection messages - probably on the control channel, like the new debugging messages.

It seems to me like this should be solved on the shell level, and I'm really just trying to use comms (I don't really wanna reach into the control socket too and WebIO itself doesn't really have a concept of separate channels - it models communication as just being a single bidirectional pipe).

Messages on a socket are processed in order. An "async" attribute in the content cannot really change that!

My impression was that the shell socket can handle messages out of order because it's a router ZMQ socket, though I might be wrong.

@SylvainCorlay
Copy link
Member

SylvainCorlay commented Aug 2, 2019

My impression was that the shell socket can handle messages out of order because it's a router ZMQ socket, though I might be wrong.

I don't think you can handle out of order, but regardless, it only processes one message at a time.

Even if the response from the front-end is the first in the queue, in your example, the response from the frontend is queued until after the execution request has completed as you described in the initial message.

@twavv
Copy link
Contributor Author

twavv commented Aug 2, 2019

it only processes one message at a time

I'm not 100% clear what "it" is referring to (the ZMQ socket, the kernel, something else?), but I don't think this is a technical limitation as far as the socket is concerned. The ZMQ socket can have multiple requests in flight and respond to them in any order. This is a property of the dealer/router setup (which is fully asynchronous and doesn't inherently impose any request-reply semantics).

The kernel (as far as I understand) is currently setup such that requests are handled serially (i.e. the current request must finish before the next request can begin) but this doesn't seem to be a Jupyter protocol limitation.

@twavv
Copy link
Contributor Author

twavv commented Aug 2, 2019

Which is to say that, given a router socket "shell" and a dealer socket "frontend", it's valid (from a ZMQ POV and subject to the "as far as I understand" and "it worked for me when I tried it in a REPL" qualifications) for this to occur:

  • frontend sends request 100 to the shell
  • frontend sends request 101 to the shell
  • shell replies with response to 101
  • shell replies with response to 100

(though in the specific case I'm advocating for, the shell wouldn't actually reply to 101 because it would be a comm message which doesn't generate shell replies).

@twavv
Copy link
Contributor Author

twavv commented May 12, 2020

Bump

This is still causing lots of issues. :')

@MSeal
Copy link
Contributor

MSeal commented May 20, 2020

Sorry I was't really tied into this thread originally. Can I ask why the front-end doesn't buffer sending request 101 to the shell until request 100 has completed? In jupyter the clients typically need to control the flow of requests to match the kernels state response rather than the other way around. I also might be misunderstanding the issue, so forgive me if that's off the mark.

@jasongrout
Copy link
Member

Often a frontend will batch a list of requests to the server. For example, executing all cells in a notebook immediately sends all execution requests to the server. You can then close your notebook, go home, etc., and come back later and have all executions done.

@twavv
Copy link
Contributor Author

twavv commented May 20, 2020

Sorry I was't really tied into this thread originally. Can I ask why the front-end doesn't buffer sending request 101 to the shell until request 100 has completed? In jupyter the clients typically need to control the flow of requests to match the kernels state response rather than the other way around. I also might be misunderstanding the issue, so forgive me if that's off the mark.

Jason is right, and in general, a kernel can have a large queue of messages waiting to be processed.

In my case, that's exactly what I don't want to happen. I want to be able to have a comm message be handled while something else is occupying the "main" control flow (e.g., update a parameter for a plot while a network is training and have the plot reflect that change).

@jasongrout
Copy link
Member

Note that comm messages can have side effects, can generate output on the iopub channel, can spawn return comm messages on the shell channel, etc. They are essentially restricted request execute messages.

@twavv
Copy link
Contributor Author

twavv commented May 20, 2020

Sure, but why is that a concern here? It would be up to the implementing kernel to make sure the resulting messages have the correct parentId set in the IOPub message headers to indicate whether or not the message resulted from the comm.

@jasongrout
Copy link
Member

jasongrout commented May 20, 2020

Side effects means that it can affect kernel state, which affects later execution requests. If you have request A, comm message, request B, the execution of request B can change depending on what happens in the comm message processing.

They are essentially restricted request execute messages.

This is essentially why we order them with execute messages - they are essentially execution messages, just to a specific object in the kernel rather than the kernel itself.

@jasongrout
Copy link
Member

Part of the issue here is that comm messages can invoke arbitrary code execution.

What is the underlying thing you are trying to accomplish? Is there a new message that is side-effect free that could be implemented, which could be reordered with execution messages?

The debug messages on the control channel give some examples of messages that can 'jump the queue' and are executed immediately, and could involve requests for variable state, which would be side-effect free.

@twavv
Copy link
Contributor Author

twavv commented May 20, 2020

I don't want to skip the side effects, I want to execute arbitrary code.

I don't fully understand why side effects are a bad thing here.

the execution of request B can change depending on what happens in the comm message processing.

Why is that bad? One thing that comes to mind is that it enables notebooks that might not be reproducible by others, but that's not always a goal of people that are using notebooks (and there are lots of ways to make your notebooks non-reproducible besides this).

@jasongrout
Copy link
Member

Why is that bad? One thing that comes to mind is that it enables notebooks that might not be reproducible by others, but that's not always a goal of people that are using notebooks (and there are lots of ways to make your notebooks non-reproducible besides this).

It's making notebooks not reproducible by you in consecutive runs either. It's deliberately introducing a race condition where there wasn't one before. That doesn't make it "evil", but it does fundamentally change the contract we've had between frontends and backends that things are processed in order, which is a pretty fundamental assumption in the Jupyter protocol. If we are rolling that assumption back, why not make the async field available on execute requests as well? That's effectively what we would be doing.

@MSeal
Copy link
Contributor

MSeal commented May 20, 2020

It's making notebooks not reproducible by you in consecutive runs either. It's deliberately introducing a race condition where there wasn't one before.

👍 to that statement. If you're trying to execute independent snippets that are completely unrelated, which be the only time this is safe to violate, maybe they shouldn't be in the same notebook / kernel execution queue.

@twavv
Copy link
Contributor Author

twavv commented May 20, 2020

fundamentally change the contract we've had between frontends and backends

The actual change I'm proposing makes this functionality opt-in. Both the frontend and backend would have to opt-in. The frontend by setting an async flag and the backend by honoring it.

It's making notebooks not reproducible by you in consecutive runs either

Just because something could be misused, I don't think that's a terribly strong argument against it. I don't think it should be used to direct a computation in any meaningful way, for example.

Really, the kinds of things I'm envisioning are:

  • Being able to switch "views" of a network's training while it's in progress (whereas this isn't currently possible since the comm msg to change the view won't be handled until the network is done training)
  • Being able to inspect things in real-time. I don't mean code debugging, but rather... in a course I help run, we do the pretty standard "train a network on the MNIST data set" and at the end, the students can draw their own digits and see that they get classified correctly. It'd be cool™ to enable that while the thing is in progress too.

As an aside, thank both of you for engaging in this discussion. Obviously I'm here to champion what I think would be a pretty neat™ addition to the Jupyter protocol, but I understand that my motivations are different than yours. :^)

@MSeal
Copy link
Contributor

MSeal commented May 20, 2020

Being able to switch "views" of a network's training while it's in progress (whereas this isn't currently possible since the comm msg to change the view won't be handled until the network is done training)

I do think that you'll also run into obscure race conditions within the kernel's object state if the objects you're manipulating are not built to be async / thread safe. There's a number of common libraries in most languages that would fail this test in this instance if you manipulate them from two contexts at once.

Just because something could be misused

I think what Jason was pointing out was that this will often result in notebooks that cannot be rerun and Jupyter already gets a lot of mistrust for allowing out of order cell execution where it also leads to such issues, further reducing the reproducibility of the tooling. It might be a risk factor we don't want to take on in the open source solution even as an opt-in pattern.

Being able to inspect things in real-time...

That is a cool use-case. I do feel support for real-time read-only inspection of objects could be a lot better in Jupyter. I'd need to think more about how to achieve this without disrupting existing contracts if possible.

As an aside, thank both of you for engaging in this discussion. Obviously I'm here to champion what I think would be a pretty neat™ addition to the Jupyter protocol, but I understand that my motivations are different than yours. :^)

We really appreciate that you have that attitude about proposing a change. Thank you.

@twavv
Copy link
Contributor Author

twavv commented May 20, 2020

I do think that you'll also run into obscure race conditions within the kernel's object state if the objects you're manipulating are not built to be async / thread safe. There's a number of common libraries in most languages that would fail this test in this instance if you manipulate them from two contexts at once.

I think this should be left up to the code that is trying to be async. For example, if ipywidgets wanted to implement this, they would have to be the ones to add the async flag in the JS side and make sure that all the updates that happen are concurrent-safe.

I think what Jason was pointing out was that this will often result in notebooks that cannot be rerun and Jupyter already gets a lot of mistrust for allowing out of order cell execution where it also leads to such issues, further reducing the reproducibility of the tooling. It might be a risk factor we don't want to take on in the open source solution even as an opt-in pattern.

I really don't envision this being used to do weird things like change a computation mid-execution by executing some code and people would actually have to build out the kernel extensions that do such a thing anyway.

@jasongrout
Copy link
Member

Really, the kinds of things I'm envisioning are:

For both of these usecases - can you do the network training in a separate thread and leave the main thread available to handle comm messages?

@jasongrout
Copy link
Member

I do feel support for real-time read-only inspection of objects could be a lot better in Jupyter. I'd need to think more about how to achieve this without disrupting existing contracts if possible.

I've wanted read-only retrieving of object values for years and years as well. I experimented at one point briefly years ago with doing this on a thread. I'm hoping that the new debugger work that is planned (supporting a variable viewer) will be able to enable some of this, at least at a low level.

@twavv
Copy link
Contributor Author

twavv commented May 20, 2020

For both of these usecases - can you do the network training in a separate thread and leave the main thread available to handle comm messages?

This is very much what I'd like to do, but that's incompatible with the Jupyter spec since code execution requests have to finish before other things are able to be handled. Or did you mean I could launch the training an async task so that Jupyter doesn't consider the code to be executing? That's less than ideal since you don't get notifications that the code has completed and doesn't work if you're trying to run cells in order.

without disrupting existing contracts

I want to emphasize that this specific proposal is fully backwards compatible and does not break existing contracts. Both the frontend and backend would have to opt in to this behavior, and specific notebook extensions would have to add the async: true flag to be handled asynchronously.

@jasongrout
Copy link
Member

Or did you mean I could launch the training an async task so that Jupyter doesn't consider the code to be executing?

Yes.

That's less than ideal since you don't get notifications that the code has completed and doesn't work if you're trying to run cells in order.

Yes. You'd have to make the rest of the cells async or aware of the out-of-order execution.

I want to emphasize that this specific proposal is fully backwards compatible and does not break existing contracts. Both the frontend and backend would have to opt in to this behavior, and specific notebook extensions would have to add the async: true flag to be handled asynchronously.

Yes, in theory. Of course, in practice, there would need to be at least one implementation (presumably in the reference ipykernel) and some extensions using it, and working out the details and ramifications of how this interplays with the current strong assumptions in the code and user expectations about messages being processed in order. This also introduces the maintenance burden of educating users and responding to surely many questions about why things are executing differently now. I would be careful to not underestimate this work and ongoing maintenance/education/support.

I think the next step here would be a much more detailed proposal of exactly what the execution paradigm would look like with this new capability in various scenarios and a writeup of the pros/cons of the ramifications of the change and a broad class of capabilities it enables. Ideally with working code changes to, say, ipykernel and a comm client, to test these assumptions. Since this would be changing fundamental core assumptions about how the protocol works, we want to move very, very carefully.

@jasongrout
Copy link
Member

One of the things not clear to me in the above discussion is the concurrency model of a kernel allowing this (going back to Sylvain's points). Exactly what does that look like to enable this async processing, and what are the ramifications of having that concurrency model?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants