-
Notifications
You must be signed in to change notification settings - Fork 711
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recovery process can result in old snapshot getting sent to node #1717
Comments
@dwco-z - this may not be solvable through code, and may require some changes to the Recovery process steps. |
@otoolep do you mean maybe I should adjust the way I'm doing the recovery? e.g., adding some random timeout during each node's startup to reduce the chances of the race condition? |
Not sure, perhaps blow away snapshots in every directory except one node. Not advocating this right now, I need to look into the code more. |
@dwco-z -- maybe you can run the latest code, which will be available once this build completes: https://ci.appveyor.com/project/otoolep/rqlite/builds/49333481 I couldn't reproduce the issue locally, so your help would be good. Re-run your test, but with |
Sure, will do |
OK, as I suspect, more snapshot reaping going on. It's not immediately clear to me how to solve this issue. I'll need to think about it. |
Couple of ideas, for my own reference:
|
Ok, anytime you need me to test changes just let me know. I believe you could not reproduce the bug because of some timing difference as you mentioned before. |
Thanks, no surprises, what I expected is happening. |
Sure |
I've actually merged the PR to master, because until we get to the bottom of this, getting an older snapshot breaks rqlite. The Raft paper does reference this possibility (and suggests ways to deal with it), but either the Raft layer itself is sending it incorrectly, or rqlite needs to handle it better. I want to know which it is before removing this check. |
That makes sense. As soon as AppVeyor completes I'll send you the logs. |
OK, thanks, logs are good to see. However, this issue is quite complex, with a lot of moving parts. Unless I can get it to happen locally, it's going to be difficult to solve. It's very timing-dependent and I don't know if it's a bug in the Raft code, a bug in rqlite code, or just that recovering a cluster like this it at the edge of what Raft and rqlite support (Recovery is not something that is described in the Raft paper). I'll continue to see if I can reproduce this locally, but unless I do, I do not expect to solve this anytime soon. A more robust Recovery process may simply be to always recover one node first, and then join new nodes to it. |
@dwco-z -- when you run your script, does it exit automatically when the issue happens? Or are you just watching the logs? |
It exit automatically
|
I'll try running on my linux machine with python 3.8 to see if I can reproduce it there |
I pushed some changes to rqlite-throubleshooting to make it compatible with linux, and I was able to hit the issue in a linux mint using the latest release rqlite-v8.22.2-linux-amd64.tar.gz |
Great, it's happening for me now with your latest script. |
Took about 1-2 minutes. |
Nice! |
Sometimes the script is exiting, but the logs don't show the issue. Just an FYI. |
The logs are helping to understand this issue. I'm not sure yet, but the RecoverNode documentation (which I got from Consul) does state that uncommitted entries and data loss can result from this process. When the issue happens I am seeing confusing elections taking place, and the logs are not entirely clear -- it could be the side-effects of this process. It may be that the RecoverNode process is not entirely bulletproof, and is a best-effort thing. Even the Raft docs say they've had issues in the past with this area of the code. I will continue to think about this, and have some ideas for a fix, but I'm reluctant to make changes at this time that affect the core of rqlite for an edge case (which may even be a best effort thing). The only truly robust way to recover a cluster that has suffered a big loss is to take the SQLite database out from under a node (or use a backup) and rebuild the cluster. |
Judging from the node3.log, it seems node3 received a interrupt signal [rqlited] 2024/03/06 15:30:07 received signal "interrupt", but for some reason it cannot stop because it seems to be trying to contact the other two nodes that stopped, so 30 seconds later at 15:30:37 the RqliteManager.stop_rqlited timeouts and fallback to process.terminate() which for some reason didn't stop everything until the next start so the port was still being hold, I guess changing to process.kill() might avoid this |
One thing I find odd is that although snapshot and readyz?sync are two completely different concepts as you already mentioned, I cannot reproduce without the sync flag (I tried 260 times) while if I make other changes like inserting a random timeout before each node's start, I can still hit the issue eventually |
Sync simply changes the timing of the whole test substantially, that is my suspicion for the difference. Sync absolutely is a read-only operation, and does not change the operation of a node. |
So this is what is happening, according to logs from my latest run of this:
So basically all 3 nodes decide to start an election fairly quickly due to Then the nodes are stopped, and a recovery process is triggered. As part of the Recovery process, each node creates a snapshot.
Node 2 then triggers an election, bumps its term to 5, and wins with votes from itself and Node 3 (Node 1 hasn't finished its recovery so doesn't vote). Node 2 starts replication of logs to other nodes (via And then we're stuck. This is because rqlite assumes that any snapshot received from another node, with instructions to install that snapshot, will be from index and term at least equal to that of the latest snapshot the receiving node has (this was made explicit on master recently, but was implicit in the Snapshot reaping logic). This is not true on Node 3. One solution to this issue is to remove this assumption, and allow "older" snapshots to be installed. To allow this the Snapshot-reaping process would need to be changed so that it can reap snapshots both older and newer (in terms of index and term) than the most recently installed snapshot (today it just reaps everything but the "newest in index and term", which wipes out the most recently installed snapshot). The Raft paper indicates that this should be OK because it is Node 2 that is sending the snapshot, Node 2 is leader, and the Raft paper says that Node 2 couldn't become Leader in the first place if it didn't have all the log entries that every other node had too (section 5.4.1). So in that sense it should be "safe" and no data should be lost. I'll think about this more, but I think Snapshot Reaping needs to be changed from "reap all but the latest snapshot" to "reap all but the most recently installed snapshot". This should solve this problem. |
Oh, now I can see why the timing is relevant in leading to this series of events, yeah that makes sense |
Yeah, very timing dependent, that's why it takes a few attempts. I can't rush into fixing this however. While in principle it's fairly easy to see a possible fix, the hard parts are dealing with crash scenarios. rqlite must handle crashing at anytime, particularly while taking a snapshot. There is important logic that rqlite runs at start-up, checking for inconsistencies. That logic today assumes that the most recent snapshot it finds is the desired one. That assumption could be changed by any fix for this particular issue -- and that might involve even more complicated consistency checking code, which can result in new bugs. Since crashing during snapshotting is much more likely to happen than this issue ( |
By crashing I also mean "exiting hard unexpectedly", perhaps due to |
I've updated the node-recovery docs due to this issue. |
@dwco-z -- if my reasoning is correct here, I suspect you'll see much robust behavior if you increase the election timeout, maybe try 5 seconds. We want to reduce the rate of elections during the recovery process. |
@dwco-z -- correction, increase the election timeout for 2 out of the 3 nodes, thereby ensuring 1 node will be elected much faster than the other two nodes. Make two of the nodes, say, 10 seconds, and leave the other at 2 seconds. Yes, this is lots of special casing, but Node Recovery, by its nature, is meant to be something done under very controlled conditions. |
I understand your point of view. I updated the rqlite-throubleshooting code with your recommendation of having one of the nodes to use a smaller raft-timeout (2s) while the others use a higher timeout (10s), but I couldn't notice any difference regarding making it more stable, it seems to be hitting the issue with roughly the same frequency. I'll try different settings to see if it improves. |
That's interesting. I would definitely be interested in seeing the logs from such setup, where the problem happens. The logs should show clearly that the other 2 nodes take much longer to trigger an election. |
Can you try bumping the timeout to 30s on two of the nodes? Run the test again, send me the logs if you hit it. |
OK, thanks, well clearly I don't fully understand the innards of some of the Raft code because that flag is not having the effect I want, at least not for the very first election. The appears to be going straight to candidate. I'll just need to keep digging into the Raft code. |
Started with #1711
This may be another race between nodes coming up during the Recovery process and pushing various snapshots to each other, with no guarantees when they arrive and are installed -- and the reap removing older snapshots.
The text was updated successfully, but these errors were encountered: