Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Still return continuous WAL entries when running into ErrSliceOutOfRange #19095

Merged
merged 1 commit into from
Jan 8, 2025

Conversation

ahrtr
Copy link
Member

@ahrtr ahrtr commented Dec 21, 2024

@ahrtr
Copy link
Member Author

ahrtr commented Dec 21, 2024

Confirmed that this PR can fix the error in #19038 (comment). @siyuanfoundation please let me know if you can still reproduce it in your environment.

Copy link

codecov bot commented Dec 21, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 68.71%. Comparing base (40b856e) to head (152de1f).
Report is 24 commits behind head on main.

Additional details and impacted files
Files with missing lines Coverage Δ
server/storage/wal/wal.go 57.88% <100.00%> (ø)

... and 24 files with indirect coverage changes

@@            Coverage Diff             @@
##             main   #19095       /-   ##
==========================================
- Coverage   68.77%   68.71%   -0.06%     
==========================================
  Files         420      420              
  Lines       35642    35642              
==========================================
- Hits        24513    24492      -21     
- Misses       9703     9719       16     
- Partials     1426     1431        5     

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 40b856e...152de1f. Read the comment docs.

@siyuanfoundation
Copy link
Contributor

I can confirm this fixes the failure in #19038. Thank you @ahrtr !

@ahrtr
Copy link
Member Author

ahrtr commented Dec 24, 2024

I can confirm this fixes the failure in #19038. Thank you @ahrtr !

Thanks for the confirmation.

Can we get this merged firstly? PTAL cc @serathius

@ahrtr
Copy link
Member Author

ahrtr commented Dec 24, 2024

cc @fuweid @ivanvc @jmhbnz

@siyuanfoundation
Copy link
Contributor

siyuanfoundation commented Jan 3, 2025

After syncing my repo, I just found the robustness test still fails even with this fix. Because validatePersistedRequestMatchClientRequests requires the lastOp to be persisted, partial WAL entries would not work for this check.
I got the error of:

last succesful client write {"Type":"txn","LeaseGrant":null,"LeaseRevoke":null,"Range":null,"Txn":{"Conditions":null,"OperationsOnSuccess":[{"Type":"put-operation","Range":{"Start":"","End":"","Limit":0},"Put":{"Key":"tombstone","Value":{"Value":"true","Hash":0},"LeaseID":0},"Delete":{"Key":""}}],"OperationsOnFailure":null},"Defragment":null,"Compact":null} was not persisted, required to validate

@ahrtr
Copy link
Member Author

ahrtr commented Jan 4, 2025

After syncing my repo, I just found the robustness test still fails even with this fix. Because validatePersistedRequestMatchClientRequests requires the lastOp to be persisted, partial WAL entries would not work for this check. I got the error of:

last succesful client write {"Type":"txn","LeaseGrant":null,"LeaseRevoke":null,"Range":null,"Txn":{"Conditions":null,"OperationsOnSuccess":[{"Type":"put-operation","Range":{"Start":"","End":"","Limit":0},"Put":{"Key":"tombstone","Value":{"Value":"true","Hash":0},"LeaseID":0},"Delete":{"Key":""}}],"OperationsOnFailure":null},"Defragment":null,"Compact":null} was not persisted, required to validate

@siyuanfoundation how often did you see this error? Or in other words, is it easy to reproduce this error?

If I understood it correctly, the robustness test error means that the last client write which already got successful response, but it wasn't persisted in WAL file. Please let me know if I misunderstood it.

Each time when we see an issue, the first thing is to figure out whether it's a real issue from end user perspective. can you manually double check whether the last successful client write was persisted in the WAL files of majorities members, and also the bbolt db?

Also I see that robustness test might not process the WAL records correctly, the longest one might not be he correct one. As long as the WAL records were not committed yet, they may be overwritten by following WAL records.

if len(memberRequests) > len(persistedRequests) {
persistedRequests = memberRequests

I regard it as a test issue for now, please raise a separate issue to track it. Thanks.

@serathius
Copy link
Member

Also I see that robustness test might not process the WAL records correctly, the longest one might not be he correct one. As long as the WAL records were not committed yet, they may be overwritten by following WAL records.

You are right that normally longest WAL is not necessarily include the longest commit sequence, however in robustness test we explicitly make a single additional transaction after the test is finished, this should ensure that there are no any other uncommitted transactions. We require the transaction to succeed and later use it to assert that WAL is complete.

Copy link
Member

@serathius serathius left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change looks good, however I haven't validated how it works with repair.

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahrtr, serathius

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@siyuanfoundation
Copy link
Contributor

After syncing my repo, I just found the robustness test still fails even with this fix. Because validatePersistedRequestMatchClientRequests requires the lastOp to be persisted, partial WAL entries would not work for this check. I got the error of:

last succesful client write {"Type":"txn","LeaseGrant":null,"LeaseRevoke":null,"Range":null,"Txn":{"Conditions":null,"OperationsOnSuccess":[{"Type":"put-operation","Range":{"Start":"","End":"","Limit":0},"Put":{"Key":"tombstone","Value":{"Value":"true","Hash":0},"LeaseID":0},"Delete":{"Key":""}}],"OperationsOnFailure":null},"Defragment":null,"Compact":null} was not persisted, required to validate

@siyuanfoundation how often did you see this error? Or in other words, is it easy to reproduce this error?

If I understood it correctly, the robustness test error means that the last client write which already got successful response, but it wasn't persisted in WAL file. Please let me know if I misunderstood it.

Each time when we see an issue, the first thing is to figure out whether it's a real issue from end user perspective. can you manually double check whether the last successful client write was persisted in the WAL files of majorities members, and also the bbolt db?

Also I see that robustness test might not process the WAL records correctly, the longest one might not be he correct one. As long as the WAL records were not committed yet, they may be overwritten by following WAL records.

if len(memberRequests) > len(persistedRequests) {
persistedRequests = memberRequests

I regard it as a test issue for now, please raise a separate issue to track it. Thanks.

This is the same error as below before this PR

failed to read WAL, cannot be repaired, err: wal: slice bounds out of range, snapshot[Index: 0, Term: 0], current entry[Index: 7931, Term: 4], len(ents): 7189

I can reproduce the error for MemberDowngrade failpoint at least 10% of the time.
There does not seem to be any problem with the member data. Although in my local tests, I cannot find all the WAL files even with --max-wals=0 --max-snapshots=0.
The top of the log dump is like

Snapshot:
term=4 index=14302 nodes=[ac4ec652f10e5b49 bf19ae4419db00dc eabdbb777cf498cb] confstate={"voters":[12416079282240904009,13770228943176794332,16914881897345358027],"auto_leave":false}
Start dumping log entries from snapshot.
WAL metadata:
nodeID=eabdbb777cf498cb clusterID=b3bc0c1919fe5d7e term=4 commitIndex=14526 vote=eabdbb777cf498cb
WAL entries: 225
lastIndex=14527
term         index      type    data
   4         14303      norm    header:<

@ahrtr
Copy link
Member Author

ahrtr commented Jan 8, 2025

@siyuanfoundation I am a little confused, probably I did not say it clearly.

There are two errors. One is #19038 (comment), and it's already confirmed that this PR can fix it. Please let me know if can still see the error with the patch included in this PR.

The second error is #19095 (comment). A successful client write must have been persisted in the WAL files at least majority of the members, and probably also in bbolt DB. This is exactly I was requesting to double confirm manually as mentioned in #19095 (comment).

Also

You are right that normally longest WAL is not necessarily include the longest commit sequence, however in robustness test we explicitly make a single additional transaction after the test is finished, this should ensure that there are no any other uncommitted transactions. We require the transaction to succeed and later use it to assert that WAL is complete.

@serathius thx for the clarification. But theoretically it's still possible that the longest one isn't the correct one. The single additional successful transaction you mentioned only guarantees that majority members have the correct WAL data.

Also note since the current robustness test always reads WAL data starting from a snapshot {0, 0} as mentioned in #19038 (comment), so if there is gap, as mentioned in #19038 (comment) and #19038 (comment), in the WAL file of each member, then you definitely will see the error last succesful client write .... was not persisted, required to validate. It's a test issue which needs to be resolved.

FYI. the last successful client write:

// Ensure that last operation succeeds
_, err = cc.Put(ctx, "tombstone", "true")
require.NoErrorf(t, err, "Last operation failed, validation requires last operation to succeed")

{
  "Type": "txn",
  "LeaseGrant": null,
  "LeaseRevoke": null,
  "Range": null,
  "Txn": {
    "Conditions": null,
    "OperationsOnSuccess": [
      {
        "Type": "put-operation",
        "Range": {
          "Start": "",
          "End": "",
          "Limit": 0
        },
        "Put": {
          "Key": "tombstone",
          "Value": {
            "Value": "true",
            "Hash": 0
          },
          "LeaseID": 0
        },
        "Delete": {
          "Key": ""
        }
      }
    ],
    "OperationsOnFailure": null
  },
  "Defragment": null,
  "Compact": null
}

@ahrtr
Copy link
Member Author

ahrtr commented Jan 8, 2025

Let me merge this PR firstly, since it's already confirmed that it can resolve the first error. Regarding the second error, it should be a test issue.

@ahrtr ahrtr merged commit 00e5b65 into etcd-io:main Jan 8, 2025
34 checks passed
@ahrtr ahrtr deleted the wal_20241221 branch January 8, 2025 14:15
@siyuanfoundation
Copy link
Contributor

@siyuanfoundation I am a little confused, probably I did not say it clearly.

There are two errors. One is #19038 (comment), and it's already confirmed that this PR can fix it. Please let me know if can still see the error with the patch included in this PR.

The statement it's already confirmed that this PR can fix it is no longer true. Before applying this PR, I am seeing the WAL error, with this PR, the same downgrade robustness test no longer passes like I tested before, but changed the error msg to not finding the last commit. The two errors are the same under the hood because of missing WAL entries in the persisted file.

@ahrtr
Copy link
Member Author

ahrtr commented Jan 8, 2025

but changed the error msg to not finding the last commit. The two errors are the same under the hood because of missing WAL entries in the persisted file.

They are not the same error. Even without this PR, the robustness test still has the second error. The reason why you did not see it before is that it's hidden by the first error. If you really understood my previous comment, the robustness test's way of reading WAL is wrong.

@ahrtr
Copy link
Member Author

ahrtr commented Jan 8, 2025

Just raised #19147

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

4 participants