Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

huggingface-cli scan-cache doesn't capture cached datasets #2218

Open
sealad886 opened this issue Apr 11, 2024 · 6 comments
Open

huggingface-cli scan-cache doesn't capture cached datasets #2218

sealad886 opened this issue Apr 11, 2024 · 6 comments
Labels
bug Something isn't working

Comments

@sealad886
Copy link
Contributor

Describe the bug

The cached location of datasets is variant depending on how you download them from Huggingface:

  1. Download using the CLI:
> huggingface-cli download 'wikimedia/wikisource' --repo-type dataset

In this case, the default location (I'll use MacOS since that's what I have, but I'm assuming some level of overall consistency here) is: $HOME/.cache/huggingface/hub/. In the above example, the directory created is datasets--wikimedia--wikisource such that:

datasets--wikimedia--wikisource
|--blobs
    --<blobs>
|--refs
    --<?> #only one file in mine anyway
|--snapshots
    |--<snapshot hash>
        --<symlinked content to blobs>
  1. Download using Huggingface datasets:
>>> from datasets import load_dataset
>>> ds = load_dataset('wikimedia/wikisource')

In this case, the default location is no longer controlled by the environment variable HF_HUB_CACHE. The naming convention is also slightly different. The default location is: $HOME/.cache/huggingface/datasets and the data structure is:

datasets
|--downloads
    --<shared blobs location>
|--wikimedia___wikisource     # note the 3 underscores
    --<symlinked content to downloads folder>

Using huggingface-cli scan-cache a user is unable to access the (actually useful) second cache location. I say "actually useful" because to date I haven't yet been able to figure out how to easily get a dataset cached with the CLI to be used in any models in code.

Other issues that may or may not need separate tickets

  1. Datasets will be downloaded twice if both methods are used.
  2. Datasets used by one download method are inaccessible (using standard tools and defaults) to the other method.
  3. You can't delete cached datasets in the second method using huggingface-cli delete-cache.

Reproduction

Well...use the code and examples above.

Logs

No response

System info

- huggingface_hub version: 0.22.2
- Platform: macOS-14.4.1-arm64-arm-64bit
- Python version: 3.12.2
- Running in iPython ?: No
- Running in notebook ?: No
- Running in Google Colab ?: No
- Token path ?: /Users/andrew/.cache/huggingface/token
- Has saved token ?: True
- Who am I ?: sealad886
- Configured git credential helpers: osxkeychain
- FastAI: N/A
- Tensorflow: N/A
- Torch: 2.2.2
- Jinja2: 3.1.3
- Graphviz: N/A
- keras: N/A
- Pydot: N/A
- Pillow: 10.3.0
- hf_transfer: 0.1.6
- gradio: 4.21.0
- tensorboard: N/A
- numpy: 1.26.4
- pydantic: 2.6.4
- aiohttp: 3.9.3
- ENDPOINT: https://huggingface.co
- HF_HUB_CACHE: /Users/andrew/.cache/huggingface/hub
- HF_ASSETS_CACHE: /Users/andrew/.cache/huggingface/assets
- HF_TOKEN_PATH: /Users/andrew/.cache/huggingface/token
- HF_HUB_OFFLINE: False
- HF_HUB_DISABLE_TELEMETRY: False
- HF_HUB_DISABLE_PROGRESS_BARS: None
- HF_HUB_DISABLE_SYMLINKS_WARNING: False
- HF_HUB_DISABLE_EXPERIMENTAL_WARNING: False
- HF_HUB_DISABLE_IMPLICIT_TOKEN: False
- HF_HUB_ENABLE_HF_TRANSFER: False
- HF_HUB_ETAG_TIMEOUT: 10
- HF_HUB_DOWNLOAD_TIMEOUT: 10
@sealad886 sealad886 added the bug Something isn't working label Apr 11, 2024
@sealad886
Copy link
Contributor Author

sealad886 commented Apr 11, 2024

The offending code can be found here, where the default cache location is sourced from environment variable HF_HUB_CACHE:
https://github.com/huggingface/huggingface_hub/blame/ebba9ef2c338149783978b489ec142ab122af42a/src/huggingface_hub/utils/_cache_manager.py#L500

I say 'offending code', but that's just the original commit of that code. It was how it was designed at the time, I suppose, but I imagine it was decided later to have a shared blob download location to allow for datasets that had shared files? I'm guessing...

@Wauplin
Copy link
Contributor

Wauplin commented Apr 11, 2024

Thanks for pointing that out @sealad886!

The datasets library is indeed managing its own cache and therefore not using the huggingface_hub cache. This problem has already been reported in our ecosystem but fixing it is not as straightforward as it seems -namely because datasets works with other providers as well. I will keep this issue open as long as the datasets <> huggingface_hub integration is not consistent. Stay tuned 😉

@lewtun
Copy link
Member

lewtun commented Aug 21, 2024

I've recently noticed that I'm unable to use huggingface-cli scan-cache to view datasets in my cache folder - see this Colab notebook for an example.

What seems to be happening is the following:

  • HF_DATASETS_CACHE points to ~/.cache/huggingface/datasets
  • HF_HUB_CACHE points to ~/.cache/huggingface/hub
  • Even if we move datasets to HF_HUB_CACHE , the cache scan fails because datasets are downloaded with a triple-underscore ___ between org / dataset name, while hfh looks for folder names like datasets--{org}--{dataset_name}. See this line

Is there a simple workaround in the setting of the env vars so that one can use huggingface-cli scan-cache and huggingface-cli delete-cache for both models and datasets?

@Wauplin
Copy link
Contributor

Wauplin commented Aug 21, 2024

Hi @lewtun thanks for the feedback. This is something specific to datasets internals that is getting fixed in huggingface/datasets#7105 by @lhoestq and @albertvillanova. Once released, all data will be downloaded to ~/.cache/huggingface/hub by default. The ~/.cache/huggingface/datasets will still be used but only for unzipping content / generating arrow files / etc. But all files downloaded from the Hub will be eligible to scan-cache and delete-cache :)

@lewtun
Copy link
Member

lewtun commented Aug 21, 2024

Thanks for the pointer @Wauplin ! If I'm not mistaken, there's still an issue with the cache directory naming in datasets from this line, which replaces / with ___, while huggingface-cli scan-cache looks for folder with this format: datasets--{org}--{dataset_name}

@lhoestq
Copy link
Member

lhoestq commented Aug 21, 2024

that only concerns the ~/.cache/huggingface/datasets cache used only for unzipping content / generating arrow files / etc. and not eligible for scan-cache

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants