-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat: V2.2 Support configuration of hash function for source checksums #181
Conversation
src/v2.rs
Outdated
match self { | ||
Self::Sha256 => Sha256::digest(bytes).as_slice().to_vec(), | ||
Self::Crc32 => crc32fast::hash(bytes).to_be_bytes().into(), | ||
Self::XxHash => todo!(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to reviewer: waiting for confirmation on the general design before finishing this up.
Cargo.toml
Outdated
@@ -46,6 46,7 @@ serde_json = "1" | |||
sha2 = "0.10.1" | |||
thiserror = "1.0.30" | |||
url = "2.2.2" | |||
crc32fast = "1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question to reviewer: do we want to add the different hashes as optional dependencies? FWIW, the current design will degrade to no checksuming (instead of failing the parsing) if the configured hash function is not available
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems fine if they are both already deps of CLI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so. For example, in Deno we have no need for this dep atm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made them all optional, and kept the "sha256" feature as default feature for backwards compatibility.
|
||
let options_header_length_pos = options_header.len(); | ||
const OPTIONS_HEADER_LENGTH_SIZE: usize = size_of::<u32>(); | ||
options_header.extend_from_slice(&[0; OPTIONS_HEADER_LENGTH_SIZE]); // Reserve for length |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to reviewer: right now length is known, but the nature of Options
is envisioned as variable length.
ESZIP_V2_1_MAGIC.to_vec() | ||
} else { | ||
ESZIP_V2_MAGIC.to_vec() | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lucacasonato Can you confirm this is no longer needed?
src/v2.rs
Outdated
#[default] | ||
Sha256 = 0, | ||
Crc32 = 1, | ||
XxHash = 2, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should choose one way to do hashing (on or off) then remove the options? (in the CLI it could assume the eszips have no hashing) Obviously backwards compatibility will require sha256 for old eszips.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might be able to chose the best hash function for today's context (use cases, hashing state of the art, etc), but we are unable to predict how the context might change in the future. IMO the explicit & flexible approach is worth it in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks good to me, but Luca should review this one before landing.
Cargo.toml
Outdated
@@ -46,6 46,7 @@ serde_json = "1" | |||
sha2 = "0.10.1" | |||
thiserror = "1.0.30" | |||
url = "2.2.2" | |||
crc32fast = "1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so. For example, in Deno we have no need for this dep atm
Cargo.toml
Outdated
@@ -46,6 50,8 @@ serde_json = "1" | |||
sha2 = "0.10.1" | |||
thiserror = "1.0.30" | |||
url = "2.2.2" | |||
crc32fast = "1" | |||
xxhash-rust = {version = "0.8", features = ["xxh3"]} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Deno we use twox-hash
. This seems to have a BSL license.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added twox-hash as xxhash feature, and kept xxhash-rust as xxhash3 feature.
I've confirmed twox-hash's xxhash3 implementation is outdated and not producing the same hash as xxhash-rust's. In my machine twox-hash's xxhash64 is faster than xxhash-rust's xxhash3, but different infrastructure might have different results. For this reason, I suggest to keep both.
src/v2.rs
Outdated
NoChecksum = 0, | ||
Sha256 = 1, | ||
Crc32 = 2, | ||
XxHash = 3, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should pick just one option between Crc32 and XxHash for now? Is there a clear winner between them or does it depend on module size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated the benchmark to include xxhash and no-checksum (see PR description). XXHash seems consistently faster. IMO if we make them optional dependencies it should be fine to keep them, but I'm also fine just chosing one.
#[derive(Debug, Clone, Copy, PartialEq, Eq)] | ||
#[repr(u8)] | ||
pub enum Checksum { | ||
NoChecksum = 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, thanks for adding this! It should speed up running code using deno compile
.
Cargo.toml
Outdated
@@ -46,6 46,7 @@ serde_json = "1" | |||
sha2 = "0.10.1" | |||
thiserror = "1.0.30" | |||
url = "2.2.2" | |||
crc32fast = "1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems fine if they are both already deps of CLI.
benches/source_hash_function.rs
Outdated
async fn parse_sha256(bytes: &[u8]) -> EszipV2 { | ||
let (eszip, fut) = EszipV2::parse(BufReader::new(AllowStdIo::new(bytes))) | ||
.await | ||
.unwrap(); | ||
fut.await.unwrap(); | ||
eszip | ||
} | ||
|
||
async fn parse_crc32(bytes: &[u8]) -> EszipV2 { | ||
let (eszip, fut) = EszipV2::parse(BufReader::new(AllowStdIo::new(bytes))) | ||
.await | ||
.unwrap(); | ||
fut.await.unwrap(); | ||
eszip | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These functions are the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They have been removed
@@ -1,6 1,6 @@ | |||
{ | |||
"tasks": { | |||
"build": "cp LICENSE js/LICENSE && RUSTFLAGS=--cfg=web_sys_unstable_apis deno run -A jsr:@deno/[email protected] --out js", | |||
"build": "cp LICENSE js/LICENSE && RUSTFLAGS=--cfg=web_sys_unstable_apis deno run -A jsr:@deno/[email protected] --all-features --out js", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dsherret @lucacasonato FYI I propose to enable all checksums for the wasm build
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but I think this is too many hashing functions to support. I think we should have SHA256 and then settle on one faster alternative (probably xxh3 since it seems fastest).
3 => Self::XxHash, | ||
#[cfg(feature = "xxhash3")] | ||
4 => Self::XxHash3, | ||
_ => return None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps this should error when it finds something it doesn't recognize? Otherwise it might lead to corrupt data being loaded I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is to have a forward-compatible design that falls back to no checksuming when encountering an unknown hash function. This applies both to decoding an eszip built with a newer version of the library, and built with a hash function not enabled for your library.
This does not lead to corrupt data thanks to being explicit about the checksum size in the options header. Even if the library does not know how to validate it, it still knows how long the checksum is. Users are expected to use EszipV2::is_checksumed()
and EszipV2::should_be_checksumed()
to ensure the content has been checksumed, if they rely on this guarantee.
See
Lines 2444 to 2482 in fa49cb5
#[cfg(feature = "sha256")] | |
#[tokio::test] | |
#[should_panic] | |
async fn v2_2_unknown_checksum_function_degrades_to_no_checksum() { | |
// checksum 255; checksum_size 32 | |
let option_bytes = &[0, 255, 1, 32]; | |
let futuristic_options = [ | |
4_u32.to_be_bytes().as_slice(), | |
option_bytes, | |
&<sha2::Sha256 as sha2::Digest>::digest(option_bytes).as_slice(), | |
] | |
.concat(); | |
let mut eszip = main_eszip().await; | |
// Using sha256/32Bytes as mock hash. | |
eszip.set_checksum(Checksum::Sha256); | |
let bytes = eszip.into_bytes(); | |
let existing_options_size = std::mem::size_of::<u32>() | |
std::mem::size_of::<u8>() * 4 | |
<sha2::Sha256 as sha2::Digest>::output_size(); | |
let options_start = ESZIP_V2_2_MAGIC.len(); | |
let new_bytes = [ | |
&bytes[..options_start], | |
futuristic_options.as_slice(), | |
&bytes[options_start existing_options_size..], | |
] | |
.concat(); | |
let (new_eszip, fut) = EszipV2::parse(BufReader::new(new_bytes.as_slice())) | |
.await | |
.unwrap(); | |
fut.await.unwrap(); | |
assert_eq!(new_eszip.options.checksum, None); | |
assert_eq!(new_eszip.options.checksum_size(), Some(32)); | |
assert!(!new_eszip.is_checksumed()); | |
assert!(new_eszip.should_be_checksumed()); | |
// This should panic, as cannot re-encode without setting an explicit checksum configuration | |
new_eszip.into_bytes(); | |
} |
#[repr(u8)] | ||
pub enum Checksum { | ||
NoChecksum = 0, | ||
#[cfg(feature = "sha256")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we should always compile with the enum variants and instead just error when a cargo feature is not enabled when serializing or deserializing with a variant that doesn't have its feature enabled? For example, when reading an eszip that is crc32 we could error saying to enable the crc32
cargo feature in order to read that eszip.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #181 (comment)
Related: denoland/eszip#181 eszip < v0.69.0 hashes all its contents to ensure data integrity. This feature is not necessary in Deno CLI as the binary integrity guarantee is deemed an external responsibility (ie it is to be assumed that, if necessary, the compiled binary will be checksumed externally prior to being executed). eszip >= v0.69.0 no longer performs this checksum by default. This reduces the cold-start time of the compiled binaries, proportionally to their size.
Related: denoland/eszip#181 eszip < v0.69.0 hashes all its contents to ensure data integrity. This feature is not necessary in Deno CLI as the binary integrity guarantee is deemed an external responsibility (ie it is to be assumed that, if necessary, the compiled binary will be checksumed externally prior to being executed). eszip >= v0.69.0 no longer performs this checksum by default. This reduces the cold-start time of the compiled binaries, proportionally to their size.
Eszip v2 performs a checksum of its contents. Currently it uses SHA256 to produce the checksum hash. This cryptographic hash function is very computationally expensive. As long as the goal of the hash is to validate the integrity of the source and not it's authenticity, using a cryptographic hash function is unnecessary, and there are alternatives better suited for the job, like CRC32.
This is becoming an issue now that npm support has bloated eszips considerably, and it is not uncommon to see >100MB eszips for simple projects, specially if they use npm. Here's a comparison of encoding/decoding an eszip using the different checsum options, over the size of the eszip (Apple M1 Pro. Benchmark included in the PR):
This PR proposes a mechanism to be able to chose the hash function to use for the checksums, and to declare this hash function in the header of the binary encoding.