Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: V2.2 Support configuration of hash function for source checksums #181

Merged
merged 15 commits into from
May 15, 2024

Conversation

arnauorriols
Copy link
Member

@arnauorriols arnauorriols commented Apr 25, 2024

Eszip v2 performs a checksum of its contents. Currently it uses SHA256 to produce the checksum hash. This cryptographic hash function is very computationally expensive. As long as the goal of the hash is to validate the integrity of the source and not it's authenticity, using a cryptographic hash function is unnecessary, and there are alternatives better suited for the job, like CRC32.

This is becoming an issue now that npm support has bloated eszips considerably, and it is not uncommon to see >100MB eszips for simple projects, specially if they use npm. Here's a comparison of encoding/decoding an eszip using the different checsum options, over the size of the eszip (Apple M1 Pro. Benchmark included in the PR):

image

image

This PR proposes a mechanism to be able to chose the hash function to use for the checksums, and to declare this hash function in the header of the binary encoding.

@CLAassistant
Copy link

CLAassistant commented Apr 25, 2024

CLA assistant check
All committers have signed the CLA.

src/v2.rs Outdated
match self {
Self::Sha256 => Sha256::digest(bytes).as_slice().to_vec(),
Self::Crc32 => crc32fast::hash(bytes).to_be_bytes().into(),
Self::XxHash => todo!(),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to reviewer: waiting for confirmation on the general design before finishing this up.

Cargo.toml Outdated
@@ -46,6 46,7 @@ serde_json = "1"
sha2 = "0.10.1"
thiserror = "1.0.30"
url = "2.2.2"
crc32fast = "1"
Copy link
Member Author

@arnauorriols arnauorriols Apr 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question to reviewer: do we want to add the different hashes as optional dependencies? FWIW, the current design will degrade to no checksuming (instead of failing the parsing) if the configured hash function is not available

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems fine if they are both already deps of CLI.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. For example, in Deno we have no need for this dep atm

Copy link
Member Author

@arnauorriols arnauorriols May 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made them all optional, and kept the "sha256" feature as default feature for backwards compatibility.


let options_header_length_pos = options_header.len();
const OPTIONS_HEADER_LENGTH_SIZE: usize = size_of::<u32>();
options_header.extend_from_slice(&[0; OPTIONS_HEADER_LENGTH_SIZE]); // Reserve for length
Copy link
Member Author

@arnauorriols arnauorriols Apr 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to reviewer: right now length is known, but the nature of Options is envisioned as variable length.

src/v2.rs Outdated Show resolved Hide resolved
src/v2.rs Show resolved Hide resolved
ESZIP_V2_1_MAGIC.to_vec()
} else {
ESZIP_V2_MAGIC.to_vec()
};
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lucacasonato Can you confirm this is no longer needed?

src/v2.rs Outdated
#[default]
Sha256 = 0,
Crc32 = 1,
XxHash = 2,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should choose one way to do hashing (on or off) then remove the options? (in the CLI it could assume the eszips have no hashing) Obviously backwards compatibility will require sha256 for old eszips.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might be able to chose the best hash function for today's context (use cases, hashing state of the art, etc), but we are unable to predict how the context might change in the future. IMO the explicit & flexible approach is worth it in this case.

@arnauorriols arnauorriols requested a review from dsherret May 7, 2024 14:34
Copy link
Member

@dsherret dsherret left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good to me, but Luca should review this one before landing.

Cargo.toml Outdated
@@ -46,6 46,7 @@ serde_json = "1"
sha2 = "0.10.1"
thiserror = "1.0.30"
url = "2.2.2"
crc32fast = "1"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. For example, in Deno we have no need for this dep atm

Cargo.toml Outdated
@@ -46,6 50,8 @@ serde_json = "1"
sha2 = "0.10.1"
thiserror = "1.0.30"
url = "2.2.2"
crc32fast = "1"
xxhash-rust = {version = "0.8", features = ["xxh3"]}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Deno we use twox-hash. This seems to have a BSL license.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added twox-hash as xxhash feature, and kept xxhash-rust as xxhash3 feature.

I've confirmed twox-hash's xxhash3 implementation is outdated and not producing the same hash as xxhash-rust's. In my machine twox-hash's xxhash64 is faster than xxhash-rust's xxhash3, but different infrastructure might have different results. For this reason, I suggest to keep both.

src/v2.rs Outdated
NoChecksum = 0,
Sha256 = 1,
Crc32 = 2,
XxHash = 3,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should pick just one option between Crc32 and XxHash for now? Is there a clear winner between them or does it depend on module size?

Copy link
Member Author

@arnauorriols arnauorriols May 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the benchmark to include xxhash and no-checksum (see PR description). XXHash seems consistently faster. IMO if we make them optional dependencies it should be fine to keep them, but I'm also fine just chosing one.

#[derive(Debug, Clone, Copy, PartialEq, Eq)]
#[repr(u8)]
pub enum Checksum {
NoChecksum = 0,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, thanks for adding this! It should speed up running code using deno compile.

Cargo.toml Outdated
@@ -46,6 46,7 @@ serde_json = "1"
sha2 = "0.10.1"
thiserror = "1.0.30"
url = "2.2.2"
crc32fast = "1"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems fine if they are both already deps of CLI.

Comment on lines 21 to 35
async fn parse_sha256(bytes: &[u8]) -> EszipV2 {
let (eszip, fut) = EszipV2::parse(BufReader::new(AllowStdIo::new(bytes)))
.await
.unwrap();
fut.await.unwrap();
eszip
}

async fn parse_crc32(bytes: &[u8]) -> EszipV2 {
let (eszip, fut) = EszipV2::parse(BufReader::new(AllowStdIo::new(bytes)))
.await
.unwrap();
fut.await.unwrap();
eszip
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These functions are the same?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They have been removed

@@ -1,6 1,6 @@
{
"tasks": {
"build": "cp LICENSE js/LICENSE && RUSTFLAGS=--cfg=web_sys_unstable_apis deno run -A jsr:@deno/[email protected] --out js",
"build": "cp LICENSE js/LICENSE && RUSTFLAGS=--cfg=web_sys_unstable_apis deno run -A jsr:@deno/[email protected] --all-features --out js",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dsherret @lucacasonato FYI I propose to enable all checksums for the wasm build

Copy link
Member

@dsherret dsherret left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but I think this is too many hashing functions to support. I think we should have SHA256 and then settle on one faster alternative (probably xxh3 since it seems fastest).

3 => Self::XxHash,
#[cfg(feature = "xxhash3")]
4 => Self::XxHash3,
_ => return None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this should error when it finds something it doesn't recognize? Otherwise it might lead to corrupt data being loaded I think.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is to have a forward-compatible design that falls back to no checksuming when encountering an unknown hash function. This applies both to decoding an eszip built with a newer version of the library, and built with a hash function not enabled for your library.

This does not lead to corrupt data thanks to being explicit about the checksum size in the options header. Even if the library does not know how to validate it, it still knows how long the checksum is. Users are expected to use EszipV2::is_checksumed() and EszipV2::should_be_checksumed() to ensure the content has been checksumed, if they rely on this guarantee.

See

eszip/src/v2.rs

Lines 2444 to 2482 in fa49cb5

#[cfg(feature = "sha256")]
#[tokio::test]
#[should_panic]
async fn v2_2_unknown_checksum_function_degrades_to_no_checksum() {
// checksum 255; checksum_size 32
let option_bytes = &[0, 255, 1, 32];
let futuristic_options = [
4_u32.to_be_bytes().as_slice(),
option_bytes,
&<sha2::Sha256 as sha2::Digest>::digest(option_bytes).as_slice(),
]
.concat();
let mut eszip = main_eszip().await;
// Using sha256/32Bytes as mock hash.
eszip.set_checksum(Checksum::Sha256);
let bytes = eszip.into_bytes();
let existing_options_size = std::mem::size_of::<u32>()
std::mem::size_of::<u8>() * 4
<sha2::Sha256 as sha2::Digest>::output_size();
let options_start = ESZIP_V2_2_MAGIC.len();
let new_bytes = [
&bytes[..options_start],
futuristic_options.as_slice(),
&bytes[options_start existing_options_size..],
]
.concat();
let (new_eszip, fut) = EszipV2::parse(BufReader::new(new_bytes.as_slice()))
.await
.unwrap();
fut.await.unwrap();
assert_eq!(new_eszip.options.checksum, None);
assert_eq!(new_eszip.options.checksum_size(), Some(32));
assert!(!new_eszip.is_checksumed());
assert!(new_eszip.should_be_checksumed());
// This should panic, as cannot re-encode without setting an explicit checksum configuration
new_eszip.into_bytes();
}

src/v2.rs Outdated Show resolved Hide resolved
#[repr(u8)]
pub enum Checksum {
NoChecksum = 0,
#[cfg(feature = "sha256")]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should always compile with the enum variants and instead just error when a cargo feature is not enabled when serializing or deserializing with a variant that doesn't have its feature enabled? For example, when reading an eszip that is crc32 we could error saying to enable the crc32 cargo feature in order to read that eszip.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arnauorriols arnauorriols merged commit abf63ed into main May 15, 2024
3 checks passed
dsherret pushed a commit to denoland/deno that referenced this pull request May 16, 2024
Related: denoland/eszip#181

eszip < v0.69.0 hashes all its contents to ensure data integrity. This
feature is not necessary in Deno CLI as the binary integrity guarantee
is deemed an external responsibility (ie it is to be assumed that, if
necessary, the compiled binary will be checksumed externally prior to
being executed).

eszip >= v0.69.0 no longer performs this checksum by default. This
reduces the cold-start time of the compiled binaries, proportionally to
their size.
bartlomieju pushed a commit to denoland/deno that referenced this pull request May 16, 2024
Related: denoland/eszip#181

eszip < v0.69.0 hashes all its contents to ensure data integrity. This
feature is not necessary in Deno CLI as the binary integrity guarantee
is deemed an external responsibility (ie it is to be assumed that, if
necessary, the compiled binary will be checksumed externally prior to
being executed).

eszip >= v0.69.0 no longer performs this checksum by default. This
reduces the cold-start time of the compiled binaries, proportionally to
their size.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants