Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checksum validation with hf_hub_download on model files. #2364

Open
JGSweets opened this issue Jul 1, 2024 · 1 comment
Open

Checksum validation with hf_hub_download on model files. #2364

JGSweets opened this issue Jul 1, 2024 · 1 comment

Comments

@JGSweets
Copy link

JGSweets commented Jul 1, 2024

Is your feature request related to a problem? Please describe.
After reviewing: #1738 and #2223 it looks like file checksums are only computed on the cache dir in specific conditions. Ideally, a user could knowingly force a checksum post download as well as on retrieval from cache to ensure integrity of the files with any usage.

It's possible I misunderstood the code or discussion though.

Describe the solution you'd like
Add an input arg and environment variable to enforce checksums on files for each hf_hub_download call on the retrieved files.

Describe alternatives you've considered
Pre-downloading files manually and manually checking file integrity before using the cached files.

@Wauplin
Copy link
Contributor

Wauplin commented Jul 2, 2024

Hi @JGSweets, thanks for opening the issue. The 2 PRs you've linked are only related to "downloading to a local directory", not the generic "downloading into the HF cache directory" workflow. If we add such a validation, we would do it for both. The main problem with checking the file integrity after a download is the time it takes to do it:

  • I don't want to do it by default, because of the extra overhead it would have for users (computing a sha256 file on GBs of data can takes minutes)
  • I'm a bit chill to add a new parameter as it would most likely not be consistently propagated to all libraries using huggingface_hub
  • what we could have is to provide an extra utilities that checks the sha256 of files separately. Users wanting extra security would have to execute it before loading files. Would something like this work for you?
  • why not an environment variable but I'm worried it would result in a more logic client side for a feature that will most likely be hidden/not used by end users.
    To be honest I'm still unsure we really need such a feature. We are already checking downloaded filesize to ensure all bytes have been retrieved from server (in case of network issues). Checking checksum is a security extra step to ensure the Hub is not corrupted but that could be checked on our side as well.

cc @Pierrci @julien-c in case you have other opinion

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants