Shutdown at specific epoch when applying new consensus rules #948

justinmoon · 2022-11-21T15:05:37Z

When making an upgrade, we need all peers to shutdown at the same epoch. Otherwise, different peers might end up in different states.

Then everyone switches out fedimintd and applies the new rules at the same time.

jkitman · 2022-12-16T17:05:05Z

Peers can signal a desire to shutdown via a ConsensusItem, perhaps passing in an optional consensus config hash, and when a threshold of peers are signaling in consensus, we automatically shutdown.

dpc · 2022-12-16T17:31:21Z

Shutdown and then get restarted automatically by systemd? I don't think the shutdown changes anything here. If anything the fedimintd should just refuse to generate any new epochs and wait (for operator, or some other things unblocking it).

I guess we either invest into writing proper "migration" logic, where fedimintd can support both the previous and new consensus rules, and switch at a given (probably ConsensusItem driven) epoch automatically (more work for fedimint devs and module developers, smoother experience for the guardians) or just stop and have all the guardians coordinate simultanous restart.

I wonder if we could even have an easy hybrid. We could have two sets of consensus ports. Before upgrade the guardian would run the fedimintd like this:

fedimintd-0.14 --upgrade-new-bin fedimintd-0.15

After reaching upgrade epoch, fedimintd-0.14 would restart itself. On start it would detect that it is already at upgrade epoch and start operating in a read-only mode on an alternative set of ports, it would also start fedimintd-0.15 that would take over normal operation.

Each fedimintd would know that if it is connecting to a peer that reports higher versions then their own, they should try alternative set of ports.

This way the old version can stick around indefinitely and allow any peers that were down etc. to finish syncing to the upgrade epoch when possible. While at no point we need a binary that can support both old and new protocol.

After guardians confirm that all peers are up to date, they can turn off the previous and run just

fedimintd-0.15

.

It gives very similar user (guardian) experience, while as fedimintd devs we only ever need to test all peers running the same version at the time.

For the end users this will probably not work, but if we have alternative set of ports for user rpcs as well, then the users detecting unsupported peer version, can still try to connect to the old version and at least get some read-only responses and information about the upgrade. Thought I still think clients will have to support at least a window of two versions at the same time to allow smooth upgrades.

jkitman · 2022-12-17T15:03:40Z

After reaching upgrade epoch, fedimintd-0.14 would restart itself. On start it would detect that it is already at upgrade epoch and start operating in a read-only mode on an alternative set of ports, it would also start fedimintd-0.15 that would take over normal operation.

Yeah, I was imagining something like that would be good to minimize downtime, since the guardians might not be able to manually coordinate the upgrade within a short time window.

elsirion · 2022-12-17T17:56:02Z

I don't think we should focus on uncoordinated upgrades too much yet. Being able to upgrade at all would be a huge improvement and is far lower complexity. Adding automatic switchover or even running two federations in parallel can build on top of that and would require far more work right now imo.

jkitman · 2022-12-26T13:27:08Z

@elsirion What do you think of this for the MVP upgrade:

Guardians modify the consensus config to change version number or other consensus values
Hash of config is sent as a consensus item ConsensusUpgrade(hash)
Once a threshold of peers is signaling for the same upgrade, automatically stop consensus at that epoch and shutdown
If version changed or other items such as pubkeys, start the new binary and do any required DB migration
Guardians connect to each other with the new consensus config and validate with peers it is correct

For version upgrades, we could eventually automate the steps with an upgrade script.

elsirion · 2022-12-31T14:48:52Z

@jkitman First modifying config and then stopping consensus at a certain epoch doesn't work imo: you need the old cfg to keep running till that epoch. We don't need to automate this for now, we can just have an admin command that tells the fedimintd to shut down after epoch x:

Every guardian tells their fedimintd to stop at epoch x (e.g. during epoch ~x-100 so that there's enough time)
All fedimintds stop at epoch x
Everyone upgrades the config and possibly fedimint binaries
fedimintd is restarted, applies DB migrations and continues to run from epoch x 1 with new consensus rules

There will be downtime, but there's so much less that can go wrong. Specifying and testing a water tight version of your on-the-fly upgrade approach will be much more involved imo and could introduce quite some instability before we even have an MVP.

jkitman · 2023-01-03T15:28:15Z

@elsirion I meant modify a copy of the consensus config, rather than modify the existing config in-place. A script could replace the existing config with the modified one once enough guardians agree.

jkitman · 2023-03-09T14:10:01Z

@elsirion Was working on this and realized we don't really have an authenticated way for a guardian to send messages to their fedimintd server, do we?

dpc · 2023-03-09T17:28:52Z

@elsirion Was working on this and realized we don't really have an authenticated way for a guardian to send messages to their fedimintd server, do we?

My previous attempt at that #1171 . I never had time to get back to it, but there's a discussion about the design there. @jkitman

elsirion · 2023-03-09T19:53:25Z

Ah, so we did post something on GH! I just didn't search in PRs … I opened #1841 to track it.

justinmoon · 2023-11-16T17:46:50Z

dev call: @joschisan says we might close this issue as "not planned". this would mean we can't make consensus changes in backwards-compatible way without changing minor consensus version (#3571)

justinmoon added the consensus label Jan 20, 2023

jkitman mentioned this issue Mar 9, 2023

Support shutting-down at an epoch #1842

Merged

This was referenced Mar 14, 2023

Auth api #1904

Merged

Consensus upgrade #1932

Merged

justinmoon mentioned this issue Apr 1, 2024

Allow shutdown at a given block height #3572

Closed

elsirion mentioned this issue Apr 2, 2024

Ability to do consensus upgrades #4753

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shutdown at specific epoch when applying new consensus rules #948

Shutdown at specific epoch when applying new consensus rules #948

justinmoon commented Nov 21, 2022 •

edited

Loading

jkitman commented Dec 16, 2022 •

edited

Loading

dpc commented Dec 16, 2022 •

edited

Loading

jkitman commented Dec 17, 2022 •

edited

Loading

elsirion commented Dec 17, 2022

jkitman commented Dec 26, 2022

elsirion commented Dec 31, 2022

jkitman commented Jan 3, 2023

jkitman commented Mar 9, 2023

dpc commented Mar 9, 2023

elsirion commented Mar 9, 2023 •

edited

Loading

justinmoon commented Nov 16, 2023 •

edited

Loading

Shutdown at specific epoch when applying new consensus rules #948

Shutdown at specific epoch when applying new consensus rules #948

Comments

justinmoon commented Nov 21, 2022 • edited Loading

jkitman commented Dec 16, 2022 • edited Loading

dpc commented Dec 16, 2022 • edited Loading

jkitman commented Dec 17, 2022 • edited Loading

elsirion commented Dec 17, 2022

jkitman commented Dec 26, 2022

elsirion commented Dec 31, 2022

jkitman commented Jan 3, 2023

jkitman commented Mar 9, 2023

dpc commented Mar 9, 2023

elsirion commented Mar 9, 2023 • edited Loading

justinmoon commented Nov 16, 2023 • edited Loading

justinmoon commented Nov 21, 2022 •

edited

Loading

jkitman commented Dec 16, 2022 •

edited

Loading

dpc commented Dec 16, 2022 •

edited

Loading

jkitman commented Dec 17, 2022 •

edited

Loading

elsirion commented Mar 9, 2023 •

edited

Loading

justinmoon commented Nov 16, 2023 •

edited

Loading