Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shutdown at specific epoch when applying new consensus rules #948

Open
justinmoon opened this issue Nov 21, 2022 · 11 comments
Open

Shutdown at specific epoch when applying new consensus rules #948

justinmoon opened this issue Nov 21, 2022 · 11 comments

Comments

@justinmoon
Copy link
Contributor

justinmoon commented Nov 21, 2022

When making an upgrade, we need all peers to shutdown at the same epoch. Otherwise, different peers might end up in different states.

Then everyone switches out fedimintd and applies the new rules at the same time.

@jkitman
Copy link
Contributor

jkitman commented Dec 16, 2022

Peers can signal a desire to shutdown via a ConsensusItem, perhaps passing in an optional consensus config hash, and when a threshold of peers are signaling in consensus, we automatically shutdown.

@dpc
Copy link
Contributor

dpc commented Dec 16, 2022

Shutdown and then get restarted automatically by systemd? I don't think the shutdown changes anything here. If anything the fedimintd should just refuse to generate any new epochs and wait (for operator, or some other things unblocking it).

I guess we either invest into writing proper "migration" logic, where fedimintd can support both the previous and new consensus rules, and switch at a given (probably ConsensusItem driven) epoch automatically (more work for fedimint devs and module developers, smoother experience for the guardians) or just stop and have all the guardians coordinate simultanous restart.

I wonder if we could even have an easy hybrid. We could have two sets of consensus ports. Before upgrade the guardian would run the fedimintd like this:

fedimintd-0.14 --upgrade-new-bin fedimintd-0.15

After reaching upgrade epoch, fedimintd-0.14 would restart itself. On start it would detect that it is already at upgrade epoch and start operating in a read-only mode on an alternative set of ports, it would also start fedimintd-0.15 that would take over normal operation.

Each fedimintd would know that if it is connecting to a peer that reports higher versions then their own, they should try alternative set of ports.

This way the old version can stick around indefinitely and allow any peers that were down etc. to finish syncing to the upgrade epoch when possible. While at no point we need a binary that can support both old and new protocol.

After guardians confirm that all peers are up to date, they can turn off the previous and run just

fedimintd-0.15

.

It gives very similar user (guardian) experience, while as fedimintd devs we only ever need to test all peers running the same version at the time.

For the end users this will probably not work, but if we have alternative set of ports for user rpcs as well, then the users detecting unsupported peer version, can still try to connect to the old version and at least get some read-only responses and information about the upgrade. Thought I still think clients will have to support at least a window of two versions at the same time to allow smooth upgrades.

@jkitman
Copy link
Contributor

jkitman commented Dec 17, 2022

After reaching upgrade epoch, fedimintd-0.14 would restart itself. On start it would detect that it is already at upgrade epoch and start operating in a read-only mode on an alternative set of ports, it would also start fedimintd-0.15 that would take over normal operation.

Yeah, I was imagining something like that would be good to minimize downtime, since the guardians might not be able to manually coordinate the upgrade within a short time window.

@elsirion
Copy link
Contributor

I don't think we should focus on uncoordinated upgrades too much yet. Being able to upgrade at all would be a huge improvement and is far lower complexity. Adding automatic switchover or even running two federations in parallel can build on top of that and would require far more work right now imo.

@jkitman
Copy link
Contributor

jkitman commented Dec 26, 2022

@elsirion What do you think of this for the MVP upgrade:

  1. Guardians modify the consensus config to change version number or other consensus values
  2. Hash of config is sent as a consensus item ConsensusUpgrade(hash)
  3. Once a threshold of peers is signaling for the same upgrade, automatically stop consensus at that epoch and shutdown
  4. If version changed or other items such as pubkeys, start the new binary and do any required DB migration
  5. Guardians connect to each other with the new consensus config and validate with peers it is correct

For version upgrades, we could eventually automate the steps with an upgrade script.

@elsirion
Copy link
Contributor

@jkitman First modifying config and then stopping consensus at a certain epoch doesn't work imo: you need the old cfg to keep running till that epoch. We don't need to automate this for now, we can just have an admin command that tells the fedimintd to shut down after epoch x:

  1. Every guardian tells their fedimintd to stop at epoch x (e.g. during epoch ~x-100 so that there's enough time)
  2. All fedimintds stop at epoch x
  3. Everyone upgrades the config and possibly fedimint binaries
  4. fedimintd is restarted, applies DB migrations and continues to run from epoch x 1 with new consensus rules

There will be downtime, but there's so much less that can go wrong. Specifying and testing a water tight version of your on-the-fly upgrade approach will be much more involved imo and could introduce quite some instability before we even have an MVP.

@jkitman
Copy link
Contributor

jkitman commented Jan 3, 2023

@elsirion I meant modify a copy of the consensus config, rather than modify the existing config in-place. A script could replace the existing config with the modified one once enough guardians agree.

@jkitman
Copy link
Contributor

jkitman commented Mar 9, 2023

@elsirion Was working on this and realized we don't really have an authenticated way for a guardian to send messages to their fedimintd server, do we?

@dpc
Copy link
Contributor

dpc commented Mar 9, 2023

@elsirion Was working on this and realized we don't really have an authenticated way for a guardian to send messages to their fedimintd server, do we?

My previous attempt at that #1171 . I never had time to get back to it, but there's a discussion about the design there. @jkitman

@elsirion
Copy link
Contributor

elsirion commented Mar 9, 2023

Ah, so we did post something on GH! I just didn't search in PRs … I opened #1841 to track it.

This was referenced Mar 14, 2023
@justinmoon
Copy link
Contributor Author

justinmoon commented Nov 16, 2023

dev call: @joschisan says we might close this issue as "not planned". this would mean we can't make consensus changes in backwards-compatible way without changing minor consensus version (#3571)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants