RFC for supporting checkpoints/clones #278

rodesai · 2024-10-23T06:06:00Z

No description provided.

rfcs/0004-checkpoints.md

vigneshc · 2024-10-26T22:29:53Z

rfcs/0004-checkpoints.md

+
+## Current Implementation
+
+SlateDB runs a Writer, a Compactor - optionally in a separate process, and a Garbage Collector (GC) - also optionally in a separate process. The writer writes data to L0 SSTs. The Compactor then goes and compacts these L0 SSTs into Sorted Runs (SR) under the “compacted” directory, and then further compacts those SRs as they accumulate. Updates to the current set of L0s and Sorted Runs are synchronized between the Writer and Compactor through updates to the centralized manifest. When reading, the Writer gets a reference to a copy of the metadata for the current set of sorted runs and searches them for the requested data. Meanwhile, the GC scans the SSTs in Object Storage and deletes any SSTs that are not referenced by the current manifest and are older than some threshold.


Nit: GC currently deletes items that are SSTs older than a threshold, there is an issue tracking to delete it threshold minutes after it has become inactive. #204 . This aspect is relevant for checkpoints.

Ack. I think we can actually build on this proposal to implement the functionality proposed in #204. Left a comment there.

vigneshc · 2024-10-26T22:32:35Z

rfcs/0004-checkpoints.md

+
+In this RFC, we propose adding checkpoints and clones to address this limitation.
+- Clients will be able to establish cheap db checkpoints that reference the current version of SlateDB’s manifest.
+- Writers will use checkpoints to “pin” a version of the database for reads to avoid the GC from deleting the SSTs.


what would be the intention for writers to pin a version?

The intent was to avoid the GC deleting the SSTs used to serve reads. The GC section of the doc details this - the RFC proposes that the GC will be extended to look for SSTs that are referenced by a checkpoint and to not delete those SSTs. By maintaining a checkpoint we prevent the SSTs in the writer's current view of the db from being deleted.

rfcs/0004-checkpoints.md

vigneshc · 2024-10-26T22:42:37Z

rfcs/0004-checkpoints.md

+        …
+    }
+
+    /// Called to destroy the database at the given path. If `soft` is true, This method will set the destroyed_at_s field in the manifest. The GC will clean up the db after some time has passed, and all checkpoints have either been deleted or expired. As part of cleaning up the db, the GC will also remove the database’s checkpoint from the source database. If `soft` is false, then all cleanup will be performed by the call to this method. If `soft` is false, the destroy will return SlateDbError::InvalidDeletion if there are any remaining non-expired checkpoints.


How would we know that there isn't any other app using the checkpoint in the source db?

I'm assuming you mean another db instance? This is covered in the clone section, but when initializing a clone the new db will leave its own checkpoint that uses the same manifest id as the checkpoint used to create the clone. This checkpoint would ensure this db does not get deleted.

rfcs/0004-checkpoints.md

vigneshc · 2024-10-26T22:48:38Z

rfcs/0004-checkpoints.md

+Usage: slatedb --path <PATH> create-checkpoint [OPTIONS]
+
+Options:
+  -l, --lifetime <LIFETIME>  Optionally specify a lifetime for the created checkpoint. You can specify the lifetime in a human-friendly format that uses years/days/min/s, e.g. "7days 30min". The checkpoint's expiry time will be set to the current wallclock time plus the specified lifetime. If the lifetime is not specified, then the checkpoint is set with no expiry and must be explicitly removed


what would be convention for human-friendly format? I see an example, looking for a link or more details about the full definition.

I was thinking we'd use this crate: https://docs.rs/humantime/latest/humantime/. But it doesn't look like it's actively maintained and doesn't document the format rigorously. Maybe we can just roll our own format or just support specifying the expiry in seconds.

For now I think it's fine to just document the format accepted by humantime (e.g. [NUMdays] [NUMmin] [NUMs]). Seems pretty user-friendly to me.

vigneshc · 2024-10-26T22:51:57Z

rfcs/0004-checkpoints.md

+
+### Writers
+
+Writers will create checkpoints to “pin” the set of SRs currently being used to serve reads. As the writer consumes manifest updates, it will re-establish its checkpoint to point to the latest manifest version. Eventually, when we support range scans, the writer will retain older checkpoints to ensure that long-lived scans can function without interference from the GC.


Would the writer always persist the checkpoint for range scans? I think you were proposing to keep it only in memory.
Also, wouldn't range scans use sequence number like mechanism instead of checkpoints, so that it can i) Avoid a write for every range scan and ii) Read uncommitted data in range scans.

Would the writer always persist the checkpoint for range scans? I think you were proposing to keep it only in memory.

The writer always maintains a checkpoint in the manifest for its current view of the database. I was thinking the range scans could hold references to the checkpoint and then the writer can drop the checkpoint when all references have dropped.

Also, wouldn't range scans use sequence number like mechanism instead of checkpoints, so that it can i) Avoid a write for every range scan and ii) Read uncommitted data in range scans.

The range scans would still use a sequence number to isolate the scan from concurrent writes. The intent of the checkpoint is to protect the scan from the garbage collector. This doesn't require a write for every range scan. The writer maintains a checkpoint for its current view of the manifest, and range scans will use and share this checkpoint. Uncommitted data is read directly from the in-memory table so it doesn't need to be tied to a checkpoint

vigneshc · 2024-10-26T22:54:31Z

rfcs/0004-checkpoints.md

+```
+table ManifestV1 {
+   // Optional path to a source database from which this database was cloned
+   source_path: string;


Now that we are adding writable clones, should we think about a DB id? It is possible for mistakes to happen, where two DBs are using the same path. A DB id would help us to at least find out that the error happened.

If it's a user provided ID, how would it provide safer access than the path?

rfcs/0004-checkpoints.md

vigneshc · 2024-10-26T23:07:26Z

rfcs/0004-checkpoints.md

+    - compactor_epoch: copied from the source checkpoint.
+    - wal_id_last_compacted: copied from the source checkpoint
+    - wal_id_last_seen: copied from the source checkpoint
+    - l0: copied from the source checkpoint


Why do we need to copy l0? Isn't it possible to use it from the source manifest?

I see SST Path resolution in next section. Isn't it needed only for the first manifest? With that, if we don't copy l0, compacted over to new manifest, wouldn't it be possible to not do SST path resolution?

Why do we need to copy l0? Isn't it possible to use it from the source manifest?

I don't mean that the actual L0 files are copied. I'm just saying that the contents of the l0 field are copied into the new manifest.

I see SST Path resolution in next section. Isn't it needed only for the first manifest? With that, if we don't copy l0, compacted over to new manifest, wouldn't it be possible to not do SST path resolution?

The SST files themselves are not copied and remain at their original path in the object store. The clone would need to be able to find SST files from the parent db when doing reads or compacting.

rfcs/0004-checkpoints.md

vigneshc · 2024-10-26T23:38:14Z

rfcs/0004-checkpoints.md

+
+If the reader is opened without a checkpoint, it will periodically re-establish a checkpoint by re-running the initialization process described above, with the following modifications:
+- It will delete the checkpoint that it had previously created (when we support range scans, we’ll need to reference count the checkpoint).
+- It will purge any in-memory tables with content from WAL SSTs lower than the new `wal_id_last_compacted`.


How does it know which entries to purge? Once it is written to memtable, do we still know the wal id?

Yes, the memtables all maintain the last WAL ID written to that memtable (see ImmutableMemtable).

rfcs/0004-checkpoints.md

vigneshc · 2024-10-26T23:42:02Z

rfcs/0004-checkpoints.md

+- It will delete the checkpoint that it had previously created (when we support range scans, we’ll need to reference count the checkpoint).
+- It will purge any in-memory tables with content from WAL SSTs lower than the new `wal_id_last_compacted`.
+
+#### Garbage Collector


Is running GC loop once available in admin tool? With the functionality added here, it would be a good one to add.
It can be run periodically by something like a lambda function, instead of an always running process.

That's a good idea. I can include that.

I agree it's useful, but I'm going to leave it out of the RFC as I think it's somewhat orthogonal and is useful generally beyond just dealing with checkpoints/clones. I don't think it really requires it's own RFC, so I'll just file an issue to track implementing it.

vigneshc · 2024-10-26T23:45:22Z

rfcs/0004-checkpoints.md

+        1. Update the source db's manifest by removing the checkpoint.
+        2. Update the db's manifest by removing the `source_checkpoint` field.
+
+Observe that users can now configure a single GC process that can manage GC for multiple databases that use soft deletes. Whenever a new database is created, the user needs to spawn a new GC task for that database. When the GC task completes deleting a database, then the task exits. For now, it's left up to the user to spawn GC tasks for databases that they have created.


can you expand more? How would we configure GC process to handle multiple paths?

I actually wasn't proposing an explicit mechanism for doing this as part of the RFC. I'm just pointing out that as a user you should be able to spawn GC tasks by creating instances of GarbageCollectorOrchestrator for each db and calling run when a new db instance is created. I suppose we could also come up with some discovery mechanism for detecting dbs under some path and spawning the tasks automatically, but I didn't feel like this was necessary to bring in scope for this RFC.

vigneshc

Added few comments.
It looks good!

criccomini

Looks great. I can't really find any issues with the design. Just left tons of questions and a few nits.

rfcs/0004-checkpoints.md

agavra · 2024-10-29T21:36:46Z

rfcs/0004-checkpoints.md

+    - l0_last_compacted: copied from the source checkpoint
+    - compacted: copied from the source checkpoint
+    - checkpoints: contains a single entry with the writer’s checkpoint.
+5. If CAS fails, go back to (1) (TODO: we can skip step 2 - we just need to re-validate that the source checkpoint is still valid because the GC could clean it up)


don't we need some mechanism for ensuring a GC process for the old slate instance doesn't clean up SSTs referenced from the cloned database once the checkpoint it cloned from expires? IIUC, a GC process that is triggered on the original manifest path would only consider manfiests from the original instance.

Ah I think I missed this step tucked between all the other comments:

Create a new checkpoint in the source database with the manifest_id as source_checkpoint.

I guess the new instance will maintain the expiry of that source checkpoint?

I guess the new instance will maintain the expiry of that source checkpoint?

No, the clone would create a checkpoint with no expiry and only delete that checkpoint as part of destroying the clone.

criccomini · 2024-10-30T23:26:05Z

rfcs/0004-checkpoints.md

-5. If CAS fails, go back to (1) (TODO: we can skip step 2 - we just need to re-validate that the source checkpoint is
-   still valid because the GC could clean it up)
-
+1. Read the current manifest M, which may not be present.


Should open a GH issue for a Fizzbee proof for this.

rfcs/0004-checkpoints.md

criccomini · 2024-10-31T00:16:37Z

rfcs/0004-checkpoints.md

+2. Read the current parent manifest M_p at `parent_path`.
+3. If M is present:
+   1. If the `parent` field is empty, exit with error.
+   2. If `parent.initialized`, then go to step 10.


I think we need to check parent.parent.initialized somewhere in this protocol. The situation I'm thinking of is where we clone a clone, which has not yet been fully initialized. I think we should barf in that case because the manifest might not be consistent and the parent's WALs might be missing. Actually, looking at steps (9) and (10), it looks like it's still possible for us to miss some WAL files in this case. I'm thinking of something like:

clone c1 starts with parent p1

clone c1 works its way all the way through step (9)

clone c2 starts with parent c1

clone c2 works its way all the way through step (10)

clone c1 runs step (10)

I think in this scenario, clone c2 ends up with an incomplete WAL, right?

I think in general we should just block creating a checkpoint of a given db if initialized is false. That should protect us from taking a clone of a partially initialized manifest. For the scenario you outlined I was thinking step 10 for c2 would fail because it would fail to find a WAL file that it expects - c2 should expect that all WAL SSTs between wal_id_last_compacted and wal_id_last_seen are present. It feels a bit brittle though - I think swapping step 9 and 10 and adding the initialized check solves that problem. I'll make these updates.

dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. documentation Improvements or additions to documentation labels Oct 23, 2024

hachikuji reviewed Oct 23, 2024

View reviewed changes

rfcs/0004-checkpoints.md Outdated Show resolved Hide resolved

hachikuji reviewed Oct 23, 2024

View reviewed changes

rfcs/0004-checkpoints.md Show resolved Hide resolved

hachikuji reviewed Oct 23, 2024

View reviewed changes

rfcs/0004-checkpoints.md Show resolved Hide resolved

criccomini mentioned this pull request Oct 24, 2024

DRAFT: RFC for range queries #280

Open