Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix inconsistent state between WAL and saved Snapshot #3584

Open
wants to merge 2 commits into
base: release-2.2
Choose a base branch
from

Conversation

zghh
Copy link

@zghh zghh commented Aug 11, 2022

Type of change

  • Bug fix

Description

It is a bug on etcd.

When an orderer has not participated in the consensus for some time and crashes during the process of writing the latest snapshot to the file, a panic error occurs after the restart.

This bug has been fixed in etcd, but not in fabric.

This PR fixes it into the fabric.

@@ -80,7 80,7 @@ func TestOpenWAL(t *testing.T) {
for i := 0; i < 10; i {
store.Store(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand how this test demonstrates the problem you describe.

Can you reproduce the problem in a unit test?

@@ -395,3 395,82 @@ func TestApplyOutOfDateSnapshot(t *testing.T) {
assertFileCount(t, 12, 1)
})
}

func TestAbortWhenWritingSnapshot(t *testing.T) {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I add the unit test to reproduce the problem.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks but I meant a unit test that uses an instance of a Fabric Raft chain, in orderer/consensus/etcdraft/chain.go, not a unit test that uses pure etcd.io/raft packages.

We need to be sure that a Fabric Raft chain instance can encounter the bespoken problem, so that:

  1. We know we really have a problem, because it might be that Fabric sidesteps this problem via some mechanism and etcd has this problem.
  2. If the problem occurs in a later point due to a code change, a unit test will notify us.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem can be reproduced with the following steps:

  1. Orderer A has not participated in consensus for a while.

  2. Other orderers generate new blocks in consensus.

  3. Other orderers generate a new snapshot.

  4. Orderer A back to normal and receives the new snapshot from other orderers.

  5. Orderer A persists the new snapshot but crashes before calling rs.saveSnap(snapshot).

  6. Orderer A restarts.

@zghh zghh requested a review from yacovm August 11, 2022 12:21
@yacovm
Copy link
Contributor

yacovm commented Aug 11, 2022

@Param-S @jiangyaoguo what is your opinion on this?

@zghh
Copy link
Author

zghh commented Aug 16, 2022

What's the progress for this PR?

@yacovm
Copy link
Contributor

yacovm commented Aug 16, 2022

What's the progress for this PR?

@Param-S and @tock-ibm are looking at it, please standby :)

if err := rs.wal.Save(hardstate, entries); err != nil {
return err
}

if !raft.IsEmptySnap(snapshot) {
if err := rs.saveSnap(snapshot); err != nil {
Copy link
Contributor

@Param-S Param-S Aug 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, we need to swap the order of writing the snapshot entries and snapshot file(needs change in saveSnap) as it is done in etcdserver https://github.com/etcd-io/etcd/blob/6c2f5dc78af6b6970d48cecaac515c58a91efca8/server/storage/storage.go#L66

@Param-S
Copy link
Contributor

Param-S commented Aug 18, 2022

Able to recreate the issue by terminating the orderer process just after saving walsnap entries and before saving snap file.

func (rs *RaftStorage) saveSnap(snap raftpb.Snapshot) error {

if err := rs.wal.SaveSnapshot(walsnap); err != nil {
return errors.Errorf("failed to save snapshot to WAL: %s", err)
}

terminate the process here

if err := rs.snap.SaveSnap(snap); err != nil {
return errors.Errorf("failed to save snapshot to disk: %s", err)
}

2022-08-18 04:04:19.431 PDT 031a PANI [orderer.consensus.etcdraft] loadState -> 5 state.commit 15 is out of range [0, 3] channel=test-system-channel-name node=5
> [unrecovered-panic] runtime.fatalpanic() /usr/local/go/src/runtime/panic.go:1065 (hits goroutine(1):1 total:1) (PC: 0x441b00)
Warning: debugging optimized function
	runtime.curg._panic.arg: interface {}(string) "5 state.commit 15 is out of range [0, 3]"

I will continue to go through the PR changes more and update here.

@zghh
Copy link
Author

zghh commented Sep 1, 2022

What's the progress? @Param-S

@Param-S
Copy link
Contributor

Param-S commented Sep 2, 2022

I could not spend time on this last week. I will work on this next couple of days & confirm.

@zghh
Copy link
Author

zghh commented Oct 17, 2022

What's the progress for this PR?

@yacovm
Copy link
Contributor

yacovm commented Oct 24, 2022

What's the progress for this PR?

I'm sorry, but we're just too busy to review it.
We will get there eventually but unfortunately I cannot commit to a deadline.

@denyeart
Copy link
Contributor

@Mergifyio rebase

@mergify
Copy link

mergify bot commented Oct 28, 2022

rebase

✅ Branch has been successfully rebased

@denyeart
Copy link
Contributor

Trying to get checks re-triggered... let me try to Close and Re-open.

@denyeart denyeart closed this Oct 28, 2022
@denyeart denyeart reopened this Oct 28, 2022
@zghh
Copy link
Author

zghh commented Feb 15, 2023

What's the progress for this PR? @Param-S

@denyeart
Copy link
Contributor

@zghh @Param-S

This one has been stale for some time, what is the status of it?

A few questions that may help to clarify the severity:

Why was this opened against release-2.2 instead of main branch? Is the problem resolved on main branch and release-2.5 already given the upgrade of etcdraft in those branches?

What is the overall impact after the problem occurs? The Description says a "panic error occurs after the restart". Will the panic occur after every subsequent restart? Is there any resolution of the problem, or the orderer node must be abandoned and a new orderer node created to replace it?

Has this problem been observed in practice?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants