Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature: Allow forcing processing an epoch #2318

Merged
merged 2 commits into from
Apr 26, 2023

Conversation

jkitman
Copy link
Contributor

@jkitman jkitman commented Apr 24, 2023

Fixes #2309

Note we upload the entire serialized epoch (which can be downloaded from the API), because it's both easier to get from the API and doesn't require additional P2P calls (we don't really know who to query).

@jkitman jkitman marked this pull request as ready for review April 24, 2023 20:50
@jkitman jkitman requested review from a team as code owners April 24, 2023 20:50
@codecov
Copy link

codecov bot commented Apr 24, 2023

Codecov Report

Patch coverage: 29.75% and project coverage change: -1.14 ⚠️

Comparison is base (88bec9e) 59.87% compared to head (4aec216) 58.74%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2318       /-   ##
==========================================
- Coverage   59.87%   58.74%   -1.14%     
==========================================
  Files         158      159        1     
  Lines       33438    34224      786     
==========================================
  Hits        20022    20105       83     
- Misses      13416    14119      703     
Impacted Files Coverage Δ
fedimint-bin-tests/src/main.rs 0.10% <0.00%> (-0.01%) ⬇️
fedimint-cli/src/lib.rs 5.24% <0.00%> (-0.26%) ⬇️
fedimint-core/src/admin_client.rs 86.36% <0.00%> (-10.25%) ⬇️
fedimint-core/src/encoding/mod.rs 88.54% <0.00%> (-2.15%) ⬇️
fedimint-server/src/consensus/mod.rs 88.03% <75.00%> ( 1.15%) ⬆️
fedimint-server/src/net/api.rs 91.54% <91.30%> (-0.05%) ⬇️
fedimint-server/src/consensus/server.rs 94.94% <92.30%> (-0.35%) ⬇️

... and 37 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@dpc
Copy link
Contributor

dpc commented Apr 24, 2023

LGTM

How will the flow look from the cli? Do we need another PR with fedimint-cli command to fetch the epoch from one peer, and then force the other to chew it? I guess the fetch part might be there already.

Comment on lines 467 to 476
async fn force_process_epoch(&mut self, outcome: EpochOutcome) {
let convert = ConsensusOutcomeConversion::from(outcome).0;
match self.process_outcome(convert).await {
Ok(_) => {}
Err(err) => warn!("Unable to force process epoch {:?}", err),
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we also need to tell HBBFT that it can accept the new epoch now and do we need to ask peers to trigger more epochs to rejoin? Or is it assumed that after that API call the guardian node has to be restarted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Realized we already handle that here https://github.com/fedimint/fedimint/blob/master/fedimint-server/src/consensus/server.rs#L334-L335

If an epoch is in the past, it won't be processed, if it's in the future we'll download missing epochs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a test for it though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this fixes the problem, this comment wasn't about downloading old epochs.

Let's suppose there f servers are permanently lost and t (A, B, C) are trying to restart but one of them (A) is at a wrong epoch:

  • A, B, C start HBBFT at their respective last epochs
  • A, B, C send rejoin requests and generate HBBFT messages
    • A sends message for epoch n
    • B, C send messages for epoch n 1
    • No consensus is found on the epoch, A stays on epoch n
  • No progress is made since only 2 of the required 3 are sending contributions for the correct epoch
  • A forces the processing of epoch n 1, but does not tell HBBFT about it, so no new messages are generated for epoch n 1 and HBBFT remains stuck

I think this is solved by restarting fedimintd afterwards, but ideally there'd be a more elegant solution.

@jkitman jkitman enabled auto-merge (squash) April 25, 2023 17:40
bitcoin.mine_blocks(100).await;
fed.run_consensus_epochs(1).await;

// We cannot process a past each and reverse the block height
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean epoch?

Comment on lines 467 to 476
async fn force_process_epoch(&mut self, outcome: EpochOutcome) {
let convert = ConsensusOutcomeConversion::from(outcome).0;
match self.process_outcome(convert).await {
Ok(_) => {}
Err(err) => warn!("Unable to force process epoch {:?}", err),
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this fixes the problem, this comment wasn't about downloading old epochs.

Let's suppose there f servers are permanently lost and t (A, B, C) are trying to restart but one of them (A) is at a wrong epoch:

  • A, B, C start HBBFT at their respective last epochs
  • A, B, C send rejoin requests and generate HBBFT messages
    • A sends message for epoch n
    • B, C send messages for epoch n 1
    • No consensus is found on the epoch, A stays on epoch n
  • No progress is made since only 2 of the required 3 are sending contributions for the correct epoch
  • A forces the processing of epoch n 1, but does not tell HBBFT about it, so no new messages are generated for epoch n 1 and HBBFT remains stuck

I think this is solved by restarting fedimintd afterwards, but ideally there'd be a more elegant solution.

@jkitman
Copy link
Contributor Author

jkitman commented Apr 25, 2023

@elsirion Right, forgot about that point. See updated code. I need a way to simulate a server restart more easily I think.

@douglaz
Copy link
Contributor

douglaz commented Apr 25, 2023

I tried running on this branch:

❯ fedimint-cli --password pass0  api status --peer-id 0
{
  "error": "CliError",
  "kind": "GeneralFailure",
  "message": "RPC call failed: ErrorObject { code: ServerError(401), message: \"Request missing required authorization\", data: None }",
  "raw_error": "RPC call failed: ErrorObject { code: ServerError(401), message: \"Request missing required authorization\", data: None }"
}

Was this supposed to work?

@jkitman
Copy link
Contributor Author

jkitman commented Apr 25, 2023

@douglaz I've switch to using env vars instead of params to make it easier to call the admin API.

Try setting FM_SALT_PATH, FM_PASSWORD, and FM_OUR_ID

Copy link
Contributor

@elsirion elsirion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good now :)

@jkitman jkitman merged commit d70b0cc into fedimint:master Apr 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Manual recovery option for failed federations
4 participants