Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[Doc] Update "Download files from the Hub" doc #948

Merged
merged 7 commits into from
Jul 26, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@

The `huggingface_hub` is a client library to interact with the Hugging Face Hub. The Hugging Face Hub is a platform with over 35K models, 4K datasets, and 2K demos in which people can easily collaborate in their ML workflows. The Hub works as a central place where anyone can share, explore, discover, and experiment with open-source Machine Learning.

With `huggingface_hub`, you can easily download and upload models, extract useful information from the Hub, and do much more. Some example use cases:
With `huggingface_hub`, you can easily download and upload models, datasets, and Spaces. You can extract useful information from the Hub, and do much more. Some example use cases:
* Downloading and caching files from a Hub repository.
* Creating repositories and uploading an updated model every few epochs.
* Extract metadata from all models that match certain criteria (e.g. models for `text-classification`).
Expand All @@ -22,7 +22,7 @@ We're partnering with cool open source ML libraries to provide free model hostin

The advantages are:

- Free model hosting for libraries and their users.
- Free model or dataset hosting for libraries and their users.
- Built-in file versioning, even with very large files, thanks to a git-based approach.
- Hosted inference API for all models publicly available.
- In-browser widgets to play with the uploaded models.
Expand Down
68 changes: 22 additions & 46 deletions docs/source/how-to-downstream.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -5,19 +5,21 @@ stored on the Hub. You can use these functions independently or integrate them i
own library, making it more convenient for your users to interact with the Hub. This
guide will show you how to:

* Specify a file to download from the Hub.
* Download and cache a file on your disk.
* Download and store a file from the Hub.
* Download all the files in a repository.

## Choose a file to download
## Download and store a file from the Hub

Use the `filename` parameter in the [`hf_hub_url`] function to retrieve the URL of a
specific file to download:
The [`hf_hub_download`] function is the main function for downloading files from the Hub.

It downloads the remote file, stores it on disk (in a version-aware way), and returns its local file path.

Use the `repo_id` and `filename` parameters to specify which file to download:

```python
>>> from huggingface_hub import hf_hub_url
>>> hf_hub_url(repo_id="lysandre/arxiv-nlp", filename="config.json")
'https://huggingface.co/lysandre/arxiv-nlp/resolve/main/config.json'
>>> from huggingface_hub import hf_hub_download
>>> hf_hub_download(repo_id="lysandre/arxiv-nlp", filename="config.json")
'/root/.cache/huggingface/hub/models--lysandre--arxiv-nlp/snapshots/894a9adde21d9a3e3843e6d5aeaaf01875c7fade/config.json'
```

<div class="flex justify-center">
Expand All @@ -30,58 +32,36 @@ branch name, a tag, or a commit hash. When using the commit hash, it must be the
full-length hash instead of a 7-character commit hash:

```python
>>> hf_hub_url(repo_id="lysandre/arxiv-nlp",
... filename="config.json",
... revision="877b84a8f93f2d619faa2a6e514a32beef88ab0a",
>>> hf_hub_download(
... repo_id="lysandre/arxiv-nlp",
... filename="config.json",
... revision="877b84a8f93f2d619faa2a6e514a32beef88ab0a",
... )
'https://huggingface.co/lysandre/arxiv-nlp/resolve/877b84a8f93f2d619faa2a6e514a32beef88ab0a/config.json'
'/root/.cache/huggingface/hub/models--lysandre--arxiv-nlp/snapshots/877b84a8f93f2d619faa2a6e514a32beef88ab0a/config.json'
```

To specify a file revision with the branch name:

```python
>>> hf_hub_url(repo_id="lysandre/arxiv-nlp", filename="config.json", revision="main")
>>> hf_hub_download(repo_id="lysandre/arxiv-nlp", filename="config.json", revision="main")
```

To specify a file revision with a tag identifier. For example, if you want `v1.0` of the
`config.json` file:

```python
>>> hf_hub_url(repo_id="lysandre/arxiv-nlp", filename="config.json", revision="v1.0")
>>> hf_hub_download(repo_id="lysandre/arxiv-nlp", filename="config.json", revision="v1.0")
```

## Download and store a file
## Construct a download URL

[`cached_download`] is used to download and cache a file on your local disk. Once a file
is stored in your cache, you don't have to redownload it the next time you use it.
[`cached_download`] is a hands-free solution for staying up to date with new file
versions. When a downloaded file is updated in the remote repository,
[`cached_download`] will automatically download and store it.

Begin by retrieving the file URL with [`hf_hub_url`], and then pass the specified URL to
[`cached_download`] to download the file:

```python
>>> from huggingface_hub import hf_hub_url, cached_download
>>> config_file_url = hf_hub_url("lysandre/arxiv-nlp", filename="config.json")
>>> cached_download(config_file_url)
'/home/lysandre/.cache/huggingface/hub/bc0e8cc2f8271b322304e8bb84b3b7580701d53a335ab2d75da19c249e2eeebb.066dae6fdb1e2b8cce60c35cc0f78ed1451d9b341c78de19f3ad469d10a8cbb1'
```

[`hf_hub_url`] and [`cached_download`] work hand-in-hand to download a file. This is
such a standard workflow that [`hf_hub_download`] is a wrapper that calls both of these
functions.

```python
>>> from huggingface_hub import hf_hub_download
>>> hf_hub_download(repo_id="lysandre/arxiv-nlp", filename="config.json")
```
In case you want to construct the URL used to download a file from a repo, you can use [`hf_hub_url`] which returns a URL.
Note that it is used internally by [`hf_hub_download`].

## Download an entire repository

[`snapshot_download`] downloads an entire repository at a given revision. Like
[`cached_download`], all downloaded files are cached on your local disk. However, even
if only a single file is updated, the entire repository will be redownloaded.
[`hf_hub_download`], all downloaded files are cached on your local disk.

Download a whole repository as shown in the following:

Expand All @@ -100,7 +80,7 @@ repository revision, use the `revision` parameter:
```

In general, it is usually better to download files with [`hf_hub_download`] - if you
already know the file name - to avoid redownloading an entire repository.
already know the file names you need.
[`snapshot_download`] is helpful when you are unaware of which files to download.

However, you don't always want to download the contents of an entire repository with
Expand Down Expand Up @@ -130,7 +110,3 @@ following example ignores the `.msgpack` and `.h5` file extensions:

Passing a regex can be especially useful when repositories contain files that are never
expected to be downloaded by [`snapshot_download`].

Note that passing `allow_regex` or `ignore_regex` does **not** prevent
[`snapshot_download`] from redownloading the entire model repository if an ignored file
is changed.
2 changes: 0 additions & 2 deletions docs/source/package_reference/file_download.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,6 @@

[[autodoc]] huggingface_hub.snapshot_download

[[autodoc]] huggingface_hub.cached_download
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we removing this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's deprecated so I wouldn't promote the method in the package reference anymore.

Do you feel strongly otherwise?


[[autodoc]] huggingface_hub.hf_hub_url

## Caching
Expand Down
78 changes: 31 additions & 47 deletions src/huggingface_hub/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,57 +2,16 @@

## Download files from the Hub

Three utility functions are provided to dowload files from the Hub. One
advantage of using them is that files are cached locally, so you won't have to
The `hf_hub_download()` function is the main function to download files from the Hub. One
advantage of using it is that files are cached locally, so you won't have to
download the files multiple times. If there are changes in the repository, the
files will be automatically downloaded again.

### `hf_hub_url`

`hf_hub_url()` returns the url we'll use to download the actual files:
`https://huggingface.co/julien-c/EsperBERTo-small/resolve/main/pytorch_model.bin`

Parameters:
- a `repo_id` (a user or organization name and a repo name seperated by a `/`, like `julien-c/EsperBERTo-small`)
- a `filename` (like `pytorch_model.bin`)
- an optional `subfolder`, corresponding to a folder inside the model repo
- an optional `repo_type`, such as `dataset` or `space`
- an optional Git revision id (can be a branch name, a tag, or a commit hash)

If you check out this URL's headers with a `HEAD` http request (which you can do
from the command line with `curl -I`) for a few different files, you'll see
that:
- small files are returned directly
- large files (i.e. the ones stored through
[git-lfs](https://git-lfs.github.com/)) are returned via a redirect to a
Cloudfront URL. Cloudfront is a Content Delivery Network, or CDN, that ensures
that downloads are as fast as possible from anywhere on the globe.

### `cached_download`

`cached_download()` takes the following parameters, downloads the remote file,
stores it to disk (in a versioning-aware way) and returns its local file path.

Parameters:
- a remote `url`
- a `cache_dir` which you can specify if you want to control where on disk the
files are cached.

A common use case is to download the files from a download url

```python
from huggingface_hub import hf_hub_url, cached_download
config_file_url = hf_hub_url("lysandre/arxiv-nlp", filename="config.json")
cached_download(config_file_url)
```

Check out the [source code](https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/file_download.py) and search for `cached_download` for all possible params (we'll create a real doc page
in the future).

### `hf_hub_download`

Since the use case of combining `hf_hub_url()` and `cached_download()` is very
common, we also provide a wrapper that calls both functions.
The function takes the following parameters, downloads the remote file,
stores it to disk (in a version-aware way) and returns its local file path.

Parameters:
- a `repo_id` (a user or organization name and a repo name, separated by `/`, like `julien-c/EsperBERTo-small`)
Expand All @@ -68,7 +27,7 @@ hf_hub_download("lysandre/arxiv-nlp", filename="config.json")

### `snapshot_download`

Using `hf_hub_download()` works well when you have a fixed repository structure;
Using `hf_hub_download()` works well when you know which files you want to download;
for example a model file alongside a configuration file, both with static names.
There are cases in which you will prefer to download all the files of the remote
repository at a specified revision. That's what `snapshot_download()` does. It
Expand All @@ -81,6 +40,28 @@ Parameters:
- a `cache_dir` which you can specify if you want to control where on disk the
files are cached

### `hf_hub_url`

Internally, the library uses `hf_hub_url()` to return the URL to download the actual files:
`https://huggingface.co/julien-c/EsperBERTo-small/resolve/main/pytorch_model.bin`


Parameters:
- a `repo_id` (a user or organization name and a repo name seperated by a `/`, like `julien-c/EsperBERTo-small`)
- a `filename` (like `pytorch_model.bin`)
- an optional `subfolder`, corresponding to a folder inside the model repo
- an optional `repo_type`, such as `dataset` or `space`
- an optional Git revision id (can be a branch name, a tag, or a commit hash)

If you check out this URL's headers with a `HEAD` http request (which you can do
from the command line with `curl -I`) for a few different files, you'll see
that:
- small files are returned directly
- large files (i.e. the ones stored through
[git-lfs](https://git-lfs.github.com/)) are returned via a redirect to a
Cloudfront URL. Cloudfront is a Content Delivery Network, or CDN, that ensures
that downloads are as fast as possible from anywhere on the globe.

<br>

## Publish files to the Hub
Expand Down Expand Up @@ -126,7 +107,10 @@ With the `HfApi` class there are methods to query models, datasets, and metrics
- `list_datasets()`
- `dataset_info()`
- `get_dataset_tags()`

- **Spaces**:
- `list_spaces()`
- `space_info()`

These lightly wrap around the API Endpoints. Documentation for valid parameters and descriptions can be found [here](https://huggingface.co/docs/hub/endpoints).


Expand Down