Dynamic client endpoints #676

antlai-temporal · 2024-01-25T05:00:21Z

What was changed

Clients can now dynamically update certain options fields such as the target url, the TLS configuration, the origin, or the keep alive settings, without creating a new client.

Why?

This will allow, e.g., updating a TLS certificate without restarting workers, or losing state associated with the client, such as, eager workflow start tracking, or metrics...

Checklist

Closes
[Feature Request] Dynamic client for workers #477
How was this tested:

Integration tests and unit tests

cretz · 2024-01-25T13:36:51Z

client/src/lib.rs

+/// set field to a tombstone specified in the field description.
+#[derive(Clone, Debug, Default, derive_builder::Builder)]
+#[non_exhaustive]
+pub struct ClientOptionsUpdate {


Can the entire ClientOptions struct be reused instead? Not every field has to be optional. You can require the callers provide the same data they did for original client options if they don't want to change it. The lang side doesn't need help here.

(would also be neat if some of the runtime options like headers was set/updated via this common mechanism and didn't require reconnect, but that doesn't have to be now)

I think this probably makes decent sense too. I imagine the lang-side API will typically involve the user copying some existing options and changing them, in which case this all works out fine.

These are the only fields that I can change dynamically, because they are associated with the endpoint, and I think it makes it more clear what can be changed with a separate struct. Also, fields like target_url are not optional in ClientOptions but need to be optional in update, or remember the original value set, which is a pain with concurrent updates...

If we go with swapping out the whole options behind a mutex which I mentioned in another comment, then the other fields should be changeable too I think, which could work out nicely.

Things like retry_config are used by the Retry wrapper that I cannot change. client_name or client_version or identity may be used in metrics, we probably don't want to change them, and I cannot see a use case to change them dynamically... We could fail the update if they don't match.
I still think that is easier in the API to do deltas with only the changeable attributes, but internally we could use ClientOptions instead of a changeset, if we are ok with a mutex on options. We would still need some state for the updates that does not belong in options, such as the channel, the version id,... but we could probably lock them independently from options...

Why can't they be changed? I agree if they can't be changed it makes sense to accept something that only allows changing what can be. But, I'm not sure I follow why we couldn't change those. They are accessed each time a request is made - seems like changing them would affect subsequent requests and work properly?

for retry_config we cache it inside:
impl RetryClient {
/// Use the provided retry config with the provided client
pub fn new(client: SG, retry_config: RetryConfig) -> Self {
Self {
client,
retry_config: Arc::new(retry_config),
}
}
}
we could change that if needed, but not sure if it is that important to dynamically change retry_config...

The others we can, but the question is whether we should, if it is used in metrics, or server side, it may be confusing, and
I don't see much value of changing them dynamically...

We don't have to actually expose those to end users, but I take your point. Especially since the RetryClient is a level higher. I'm fine with having just the diff struct.

cretz · 2024-01-25T13:45:35Z

client/src/lib.rs

+ /// If the update fails, the client does not rollback to the previous configuration, and future
+ /// connections will fail.


This is rough. Why can't we only apply the update after the new endpoint is connected and works? Can't you do an atomic swap of the endpoint where there is no downtime and it only contains a successfully connected endpoint?

Agreed. I would call this try_update_options and only swap if the connection succeeds. Also I find leaving the old options to be scary since now things aren't in sync.

It'd be fine to put a mutex around the options as well, so that they can be updated too. This also fits better with the idea of using the ClientOptions directly rather than ClientOptionsUpdate

I've been looking into the atomic swap reusing the id for the endpoint, but the problem is that it s anything but atomic.
When you reuse the id the endpoint gets in a "preparing" queue, the old endpoint still active, when the new one gets connected the swap happens. The problem is that this is all asynchronous, and if there is a problem with the new endpoint there is no visible error, and the old one continuous forever. It is also impossible to know when this swap will happen, or validate that it happened without sending a carefully crafted request. The endpoints are also created in lazy mode, so they need to be part of channel to validate them.
I think the root problem is that this mechanism was designed for load balancing, not dynamic changes, and if you are adding more (similar) endpoints with different targets, you can be more relaxed about when exactly changes happened.
For this reason, I'm forcing the client to fail when there are problems, otherwise they get silently ignored...

The problem is that this is all asynchronous

Yeah I see that with the balance channel and makes sense. But can we wait to even send the insert until we have connected the endpoint? And can we send the remove after the insert? So logic would be:

endpoint = connect_new() channel_tx.send(insert(endpoint)) channel_tx.send(remove(old_endpoint_seq))

Also curious, if you send an insert then a remove, is it guaranteed that the next call will be on the inserted? And in the case of what's there now w/ remove-then-insert, is it possible for multi-threaded use that there is a time where no endpoint is there?

I think the root problem is that this mechanism was designed for load balancing, not dynamic changes

I wonder if we can have our own "mutex_channel" or something, or if that's too much (EDIT: I see ::new is crate-private)

The endpoints are also created in lazy mode, so they need to be part of channel to validate them.

This is unfortunate but if we can't eagerly validate, ok. I now wonder if we should go higher than channel. What I mean is if we can swap our own channel for a completely new one instead of updating the channel.

I don't think that you can easily swap the channel without creating a new client, and the problem of creating new clients is that there is a lot of shared state associated to them, like eager workflow start workers, or metrics,... that need to be moved across atomically.
The semantics of the control channel are really weak. Remove happens immediately, but inserts are queued in a "preparing" queue, so there is no guarantee that after sending insert ok you are getting the new options.
To avoid the gap you can reuse the id in insert, and then you don't need a remove, but there is no guarantee when, or if, the swap will happen.

The problem is that it is "lazy". It will create the connection with the first request, and we want to make sure that this connection is not created until we finish the upgrade.

You should attempt a get-system-info call on the client before swapping it. IMO we want to make sure the connection is created and is a known config to be successful even way before we swap it out. We don't want to swap out lazy or a when given a bad client, the worker will just silently fail for a minute (our default retry iirc) then fatal out.

Before the swap we can check the connection, but that doesn't help for lazy. Internally Tonic defers the connection until it gets the first request. And even if we do a get-system-info, there is a race with someone else doing the first call after dropping the lock, if they grabbed it before the update started, and getting the error...

Before the swap we can check the connection, but that doesn't help for lazy

This is why I'm suggesting a higher level client swap. Don't swap things at the Tonic level, swap the connected Temporal client.

there is a race with someone else doing the first call after dropping the lock, if they grabbed it before the update started, and getting the error

I think we can accept/document this race. If you're really concerned, we can use a separate lock for the entire client update process to prevent concurrent updates and put a short timeout on the get-system-info call, but of course we should never hold a lock during regular client operations.

I would probably do this:

pub struct ConfiguredClient<C> { client_and_options: ArcSwap<ClientAndOptions>, headers: Arc<RwLock<HashMap<String, String>>>, /// Capabilities as read from the `get_system_info` RPC call made on client connection capabilities: Option<get_system_info_response::Capabilities>, workers: Arc<SlotManager>, } pub struct ClientAndOptions<C> { client: C, options: ClientOptions }

If ArcSwap doesn't work out for whatever reason (it might be hard to use load() references in all cases) then you can do the Arc<RwLock<Arc<ClientAndOptions>>> approach

Sushisource · 2024-01-25T18:11:35Z

client/src/lib.rs

@@ -87,6 91,9 @@ static LONG_POLL_METHOD_NAMES: [&str; 3] = [
 const LONG_POLL_TIMEOUT: Duration = Duration::from_secs(70);
 const OTHER_CALL_TIMEOUT: Duration = Duration::from_secs(30);

+/// Buffer size for the channel that listens to change events
+const DEFAULT_BUFFER_SIZE: usize = 1024;


I suspect this can probably be a lot smaller (though I doubt it matters a whole lot either way).

I just copied the constant used internally by tonic, frankly, no idea about the right size... But, yes, being a control channel, that looks large...

Sushisource · 2024-01-25T18:17:39Z

client/src/lib.rs

+/// set field to a tombstone specified in the field description.
+#[derive(Clone, Debug, Default, derive_builder::Builder)]
+#[non_exhaustive]
+pub struct ClientOptionsUpdate {


I think this probably makes decent sense too. I imagine the lang-side API will typically involve the user copying some existing options and changing them, in which case this all works out fine.

Sushisource · 2024-01-25T18:31:57Z

client/src/lib.rs

+
+/// Metadata to enable the dynamic configuration of a client.
+#[derive(Clone, Debug)]
+pub struct DynamicUpdateInfo {


Doesn't need to be pub

They may need to access the changeset to get the current options when there is no mutable ref to the client.

It's not accessible as it stands anyway. It's stored as a private field and there's no accessor.

Good point. We need to decide whether to get rid of changeset and update ClientOptions directly (with mutex), or not, and then either add the accessor or make everything private.

Sushisource · 2024-01-25T18:35:36Z

client/src/lib.rs

+ /// If the update fails, the client does not rollback to the previous configuration, and future
+ /// connections will fail.


Agreed. I would call this try_update_options and only swap if the connection succeeds. Also I find leaving the old options to be scary since now things aren't in sync.

It'd be fine to put a mutex around the options as well, so that they can be updated too. This also fits better with the idea of using the ClientOptions directly rather than ClientOptionsUpdate

antlai-temporal · 2024-01-25T19:17:31Z

Agreed. I would call this try_update_options and only swap if the connection succeeds. Also I find leaving the old options to be scary since now things aren't in sync.

This is what I tried first, but as I explained above, it is all asynchronous, with lazy endpoints, and there is no error callback, so it is difficult to make the update only when it works...
Leaving the old options is really ugly, agreed. I checked that we don't use the possibly outdated fields in any critical manner but there is no guarantee that something will eventually break. It is also unrealistic that we will always have a mutable ref to client unless we add a top level mutex, which would mess up backwards compatibility.
Adding a lock on ClientOptions does not completely solve the problem, you need to hold both locks (also for update field), to safely do the update, and locks need to be acquired once, and in the correct order, to avoid deadlocks. I thought that could be quite intrusive too...

If we can live with occasional inconsistencies, we could lock them independently... That would improve on what we have when we have no mutable ref...

It'd be fine to put a mutex around the options as well, so that they can be updated too. This also fits better with the idea of using
the ClientOptions directly rather than ClientOptionsUpdate
See discussion above...
The problem is that you need to hold both locks to keep it consistent

antlai-temporal added 2 commits January 24, 2024 20:42

Adding dynamic updates to client options

3d770d2

Merge remote-tracking branch 'upstream/master' into make-client-dynamic

5128702

antlai-temporal requested a review from a team as a code owner January 25, 2024 05:00

cretz reviewed Jan 25, 2024

View reviewed changes

Sushisource reviewed Jan 25, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dynamic client endpoints #676

Dynamic client endpoints #676

antlai-temporal commented Jan 25, 2024

cretz Jan 25, 2024 •

edited

Loading

Sushisource Jan 25, 2024

antlai-temporal Jan 25, 2024

Sushisource Jan 25, 2024

antlai-temporal Jan 25, 2024

Sushisource Jan 25, 2024

antlai-temporal Jan 25, 2024

Sushisource Jan 25, 2024

cretz Jan 25, 2024

Sushisource Jan 25, 2024

antlai-temporal Jan 25, 2024

cretz Jan 25, 2024 •

edited

Loading

antlai-temporal Jan 25, 2024

antlai-temporal Jan 26, 2024

cretz Jan 26, 2024

antlai-temporal Jan 26, 2024

cretz Jan 29, 2024 •

edited

Loading

Sushisource Jan 29, 2024

Sushisource Jan 25, 2024

antlai-temporal Jan 25, 2024

Sushisource Jan 25, 2024

Sushisource Jan 25, 2024

antlai-temporal Jan 25, 2024

Sushisource Jan 25, 2024

antlai-temporal Jan 25, 2024

Sushisource Jan 25, 2024

antlai-temporal commented Jan 25, 2024

		/// If the update fails, the client does not rollback to the previous configuration, and future
		/// connections will fail.

Dynamic client endpoints #676

Are you sure you want to change the base?

Dynamic client endpoints #676

Conversation

antlai-temporal commented Jan 25, 2024

What was changed

Why?

Checklist

cretz Jan 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cretz Jan 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cretz Jan 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

antlai-temporal commented Jan 25, 2024

cretz Jan 25, 2024 •

edited

Loading

cretz Jan 25, 2024 •

edited

Loading

cretz Jan 29, 2024 •

edited

Loading