-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC for supporting checkpoints/clones #278
base: main
Are you sure you want to change the base?
Conversation
rfcs/0004-checkpoints.md
Outdated
|
||
## Current Implementation | ||
|
||
SlateDB runs a Writer, a Compactor - optionally in a separate process, and a Garbage Collector (GC) - also optionally in a separate process. The writer writes data to L0 SSTs. The Compactor then goes and compacts these L0 SSTs into Sorted Runs (SR) under the “compacted” directory, and then further compacts those SRs as they accumulate. Updates to the current set of L0s and Sorted Runs are synchronized between the Writer and Compactor through updates to the centralized manifest. When reading, the Writer gets a reference to a copy of the metadata for the current set of sorted runs and searches them for the requested data. Meanwhile, the GC scans the SSTs in Object Storage and deletes any SSTs that are not referenced by the current manifest and are older than some threshold. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: GC currently deletes items that are SSTs older than a threshold, there is an issue tracking to delete it threshold
minutes after it has become inactive. #204 . This aspect is relevant for checkpoints.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack. I think we can actually build on this proposal to implement the functionality proposed in #204. Left a comment there.
|
||
In this RFC, we propose adding checkpoints and clones to address this limitation. | ||
- Clients will be able to establish cheap db checkpoints that reference the current version of SlateDB’s manifest. | ||
- Writers will use checkpoints to “pin” a version of the database for reads to avoid the GC from deleting the SSTs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what would be the intention for writers to pin a version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intent was to avoid the GC deleting the SSTs used to serve reads. The GC section of the doc details this - the RFC proposes that the GC will be extended to look for SSTs that are referenced by a checkpoint and to not delete those SSTs. By maintaining a checkpoint we prevent the SSTs in the writer's current view of the db from being deleted.
rfcs/0004-checkpoints.md
Outdated
… | ||
} | ||
/// Called to destroy the database at the given path. If `soft` is true, This method will set the destroyed_at_s field in the manifest. The GC will clean up the db after some time has passed, and all checkpoints have either been deleted or expired. As part of cleaning up the db, the GC will also remove the database’s checkpoint from the source database. If `soft` is false, then all cleanup will be performed by the call to this method. If `soft` is false, the destroy will return SlateDbError::InvalidDeletion if there are any remaining non-expired checkpoints. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would we know that there isn't any other app using the checkpoint in the source db?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm assuming you mean another db instance? This is covered in the clone
section, but when initializing a clone the new db will leave its own checkpoint that uses the same manifest id as the checkpoint used to create the clone. This checkpoint would ensure this db does not get deleted.
rfcs/0004-checkpoints.md
Outdated
Usage: slatedb --path <PATH> create-checkpoint [OPTIONS] | ||
Options: | ||
-l, --lifetime <LIFETIME> Optionally specify a lifetime for the created checkpoint. You can specify the lifetime in a human-friendly format that uses years/days/min/s, e.g. "7days 30min". The checkpoint's expiry time will be set to the current wallclock time plus the specified lifetime. If the lifetime is not specified, then the checkpoint is set with no expiry and must be explicitly removed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what would be convention for human-friendly format? I see an example, looking for a link or more details about the full definition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking we'd use this crate: https://docs.rs/humantime/latest/humantime/. But it doesn't look like it's actively maintained and doesn't document the format rigorously. Maybe we can just roll our own format or just support specifying the expiry in seconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now I think it's fine to just document the format accepted by humantime (e.g. [NUMdays] [NUMmin] [NUMs]). Seems pretty user-friendly to me.
rfcs/0004-checkpoints.md
Outdated
|
||
### Writers | ||
|
||
Writers will create checkpoints to “pin” the set of SRs currently being used to serve reads. As the writer consumes manifest updates, it will re-establish its checkpoint to point to the latest manifest version. Eventually, when we support range scans, the writer will retain older checkpoints to ensure that long-lived scans can function without interference from the GC. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would the writer always persist the checkpoint for range scans? I think you were proposing to keep it only in memory.
Also, wouldn't range scans use sequence number like mechanism instead of checkpoints, so that it can i) Avoid a write for every range scan and ii) Read uncommitted data in range scans.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would the writer always persist the checkpoint for range scans? I think you were proposing to keep it only in memory.
The writer always maintains a checkpoint in the manifest for its current view of the database. I was thinking the range scans could hold references to the checkpoint and then the writer can drop the checkpoint when all references have dropped.
Also, wouldn't range scans use sequence number like mechanism instead of checkpoints, so that it can i) Avoid a write for every range scan and ii) Read uncommitted data in range scans.
The range scans would still use a sequence number to isolate the scan from concurrent writes. The intent of the checkpoint is to protect the scan from the garbage collector. This doesn't require a write for every range scan. The writer maintains a checkpoint for its current view of the manifest, and range scans will use and share this checkpoint. Uncommitted data is read directly from the in-memory table so it doesn't need to be tied to a checkpoint
rfcs/0004-checkpoints.md
Outdated
``` | ||
table ManifestV1 { | ||
// Optional path to a source database from which this database was cloned | ||
source_path: string; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we are adding writable clones, should we think about a DB id? It is possible for mistakes to happen, where two DBs are using the same path. A DB id would help us to at least find out that the error happened.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's a user provided ID, how would it provide safer access than the path?
rfcs/0004-checkpoints.md
Outdated
- compactor_epoch: copied from the source checkpoint. | ||
- wal_id_last_compacted: copied from the source checkpoint | ||
- wal_id_last_seen: copied from the source checkpoint | ||
- l0: copied from the source checkpoint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to copy l0
? Isn't it possible to use it from the source manifest?
I see SST Path resolution in next section. Isn't it needed only for the first manifest? With that, if we don't copy l0, compacted over to new manifest, wouldn't it be possible to not do SST path resolution?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to copy l0? Isn't it possible to use it from the source manifest?
I don't mean that the actual L0 files are copied. I'm just saying that the contents of the l0 field are copied into the new manifest.
I see SST Path resolution in next section. Isn't it needed only for the first manifest? With that, if we don't copy l0, compacted over to new manifest, wouldn't it be possible to not do SST path resolution?
The SST files themselves are not copied and remain at their original path in the object store. The clone would need to be able to find SST files from the parent db when doing reads or compacting.
rfcs/0004-checkpoints.md
Outdated
|
||
If the reader is opened without a checkpoint, it will periodically re-establish a checkpoint by re-running the initialization process described above, with the following modifications: | ||
- It will delete the checkpoint that it had previously created (when we support range scans, we’ll need to reference count the checkpoint). | ||
- It will purge any in-memory tables with content from WAL SSTs lower than the new `wal_id_last_compacted`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does it know which entries to purge? Once it is written to memtable, do we still know the wal id?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the memtables all maintain the last WAL ID written to that memtable (see ImmutableMemtable
).
- It will delete the checkpoint that it had previously created (when we support range scans, we’ll need to reference count the checkpoint). | ||
- It will purge any in-memory tables with content from WAL SSTs lower than the new `wal_id_last_compacted`. | ||
|
||
#### Garbage Collector |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is running GC loop once available in admin tool? With the functionality added here, it would be a good one to add.
It can be run periodically by something like a lambda function, instead of an always running process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good idea. I can include that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it's useful, but I'm going to leave it out of the RFC as I think it's somewhat orthogonal and is useful generally beyond just dealing with checkpoints/clones. I don't think it really requires it's own RFC, so I'll just file an issue to track implementing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rfcs/0004-checkpoints.md
Outdated
1. Update the source db's manifest by removing the checkpoint. | ||
2. Update the db's manifest by removing the `source_checkpoint` field. | ||
|
||
Observe that users can now configure a single GC process that can manage GC for multiple databases that use soft deletes. Whenever a new database is created, the user needs to spawn a new GC task for that database. When the GC task completes deleting a database, then the task exits. For now, it's left up to the user to spawn GC tasks for databases that they have created. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you expand more? How would we configure GC process to handle multiple paths?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually wasn't proposing an explicit mechanism for doing this as part of the RFC. I'm just pointing out that as a user you should be able to spawn GC tasks by creating instances of GarbageCollectorOrchestrator
for each db and calling run
when a new db instance is created. I suppose we could also come up with some discovery mechanism for detecting dbs under some path and spawning the tasks automatically, but I didn't feel like this was necessary to bring in scope for this RFC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added few comments.
It looks good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. I can't really find any issues with the design. Just left tons of questions and a few nits.
rfcs/0004-checkpoints.md
Outdated
- l0_last_compacted: copied from the source checkpoint | ||
- compacted: copied from the source checkpoint | ||
- checkpoints: contains a single entry with the writer’s checkpoint. | ||
5. If CAS fails, go back to (1) (TODO: we can skip step 2 - we just need to re-validate that the source checkpoint is still valid because the GC could clean it up) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't we need some mechanism for ensuring a GC process for the old slate instance doesn't clean up SSTs referenced from the cloned database once the checkpoint it cloned from expires? IIUC, a GC process that is triggered on the original manifest path would only consider manfiests from the original instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I think I missed this step tucked between all the other comments:
Create a new checkpoint in the source database with the
manifest_id
assource_checkpoint
.
I guess the new instance will maintain the expiry of that source checkpoint?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the new instance will maintain the expiry of that source checkpoint?
No, the clone would create a checkpoint with no expiry and only delete that checkpoint as part of destroying the clone.
6076c66
to
badd43b
Compare
5. If CAS fails, go back to (1) (TODO: we can skip step 2 - we just need to re-validate that the source checkpoint is | ||
still valid because the GC could clean it up) | ||
|
||
1. Read the current manifest M, which may not be present. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should open a GH issue for a Fizzbee proof for this.
2. Read the current parent manifest M_p at `parent_path`. | ||
3. If M is present: | ||
1. If the `parent` field is empty, exit with error. | ||
2. If `parent.initialized`, then go to step 10. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to check parent.parent.initialized
somewhere in this protocol. The situation I'm thinking of is where we clone a clone, which has not yet been fully initialized. I think we should barf in that case because the manifest might not be consistent and the parent's WALs might be missing. Actually, looking at steps (9) and (10), it looks like it's still possible for us to miss some WAL files in this case. I'm thinking of something like:
- clone c1 starts with parent p1
- clone c1 works its way all the way through step (9)
- clone c2 starts with parent c1
- clone c2 works its way all the way through step (10)
- clone c1 runs step (10)
I think in this scenario, clone c2 ends up with an incomplete WAL, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in general we should just block creating a checkpoint of a given db if initialized
is false. That should protect us from taking a clone of a partially initialized manifest. For the scenario you outlined I was thinking step 10 for c2 would fail because it would fail to find a WAL file that it expects - c2 should expect that all WAL SSTs between wal_id_last_compacted
and wal_id_last_seen
are present. It feels a bit brittle though - I think swapping step 9 and 10 and adding the initialized
check solves that problem. I'll make these updates.
No description provided.