[Snapshot download] Allow to load local repo id with snapshot download #505

patrickvonplaten · 2021-11-29T10:45:58Z

This PR adds more function params to the snapshot download method. @LysandreJik @osanseviero - are there any reasons why we wouldn't add those?

patrickvonplaten · 2021-11-29T10:47:12Z

This PR is needed for the integration with kensho-technologies/pyctcdecode#32

LysandreJik · 2021-11-29T11:25:57Z

Nice! We'll need some tests and a bit of context :)

patrickvonplaten · 2021-11-29T11:31:10Z

@LysandreJik @osanseviero - I kind of need to be able to use the snapshot download method for local files only. I've added a working solution to this PR that works as follows:

Take repo_id and transform into a local folder name. E.g. kensho/lm_model into kensho__lm_model
Then list all cached local folder names, e.g. [".cache/hub/kensho__lm_model.12341234", ".cache/hub/kensho__lm_model.123411234"]
Take the folder hash that was last modified
Load all files from there

So the somewhat "magic" part here is to take the last modified folder name. Would such a logic be ok for you @LysandreJik ?

If yes, I'll add tests, comments and make everything clean

LysandreJik · 2021-11-29T11:58:46Z

Thanks for your contribution, @patrickvonplaten! I understand the need for a snapshot_download with a local_files_only parameter.

The only issue I can see here is that it does not play well at all with the revision parameter:

snapshot_download("xxx")
snapshot_download("xxx", revision="yyy")

snapshot_download("xxx", local_files_only=True)  # Will return the one with revision "yyy" even if the "xxx" should have been the one chosen as it was the latest commit

I can't think of a way to go around this limitation, which seems quite troublesome to me. It does seem like a very rare case to me so maybe putting a warning is enough, wdyt?

The following use-case doesn't work either:

snapshot_download("xxx", revision="zzz")
snapshot_download("xxx", revision="yyy")

snapshot_download("xxx", revision="zzz", local_files_only=True)  # Will return the one with revision "yyy"

julien-c · 2021-11-29T11:59:56Z

Out of curiosity, why does the canonical method of loading all files by name (like in transformers for instance) not work in your use case?

patrickvonplaten · 2021-11-29T12:17:32Z

Out of curiosity, why does the canonical method of loading all files by name (like in transformers for instance) not work in your use case?

The problem is that for https://github.com/kensho-technologies/pyctcdecode the language model is loaded as follows: https://github.com/kensho-technologies/pyctcdecode/blob/b50b5ae39ebb047c42765be0dce2c27137bd9474/pyctcdecode/language_model.py#L352

meaning that it is not loaded by a hard-coded name, but rather the "last file" that is found in the folder language_model. We could give the file a name, but then we also have the problem that the file can be both in .arpa format or .bin format. So we would require an ugly flag "format="bin" anymays to be able to load both kenLM.arpa or kenLM.bin.

patrickvonplaten · 2021-11-29T13:31:41Z

Thanks for your contribution, @patrickvonplaten! I understand the need for a snapshot_download with a local_files_only parameter.
The only issue I can see here is that it does not play well at all with the revision parameter:
snapshot_download("xxx")
snapshot_download("xxx", revision="yyy")

snapshot_download("xxx", local_files_only=True)  # Will return the one with revision "yyy" even if the "xxx" should have been the one chosen as it was the latest commit
I can't think of a way to go around this limitation, which seems quite troublesome to me. It does seem like a very rare case to me so maybe putting a warning is enough, wdyt?
The following use-case doesn't work either:
snapshot_download("xxx", revision="zzz")
snapshot_download("xxx", revision="yyy")

snapshot_download("xxx", revision="zzz", local_files_only=True)  # Will return the one with revision "yyy"
Ah yeah this is indeed a big problem. The revision is however nowhere locally saved (it's only encoded in the saved sha no?) -> so there is no way that we can retrieve the correct revision when we don't have internet connection no?

We could force the user to have to provide the repo structure when doing local_files_only or else throw a warning that the last updated folder is used? What do you think @LysandreJik ?

Or we save the revision in the cached name? As follows:

<model_id>.<revision>.<hash>

patrickvonplaten · 2021-11-29T14:18:35Z

@LysandreJik , @julien-c - I now went for the solution where we update the cache folder names to
<repo_id>.<revision>.<hash_of_folder>

I think that being able to load repos from a specific revision locally is quite important and that we should enable this functionality. At the same time, we cannot assume that the filenames in a repo are always hardcoded since some repositories
like https://github.com/kensho-technologies/pyctcdecode/blob/b50b5ae39ebb047c42765be0dce2c27137bd9474/pyctcdecode/language_model.py#L353 load non-hardcoded filenames.

I don't see another way as saving the revision into the name of the cached folder which can then be correctly looked up when local_files_only=True.
This change will require a new cached download for people using snapshot_download which I think is ok given that snapshot_download is not yet used that widely (IMO).

What do you think?

LysandreJik

Thanks for iterating on the implementation. I think there's either a small issue with the implementation or that we should align on the behavior of this method.

My understanding is that snapshot_download should always fetch the latest update on the default branch of the repository (main right now, but should be updated later to reflect the default git branch in the future) when no revision is passed. If a revision is passed and it is a branch, will fetch the latest update on that branch. If it is a tag or a commit hash, then will fetch the exact version referenced.

With local_files_only, this becomes a bit tricky: if we handle main (or any other branch name), then the aforementioned behavior cannot be strictly respected. Let's see why:

# Fetch repository on branch `main`
directory = snapshot_download("xxx")

# <Update of the repository>

directory = snapshot_download("xxx", local_files_only=True)

This will work correctly and return the correct directory.

# Fetch repository on branch `main`
directory = snapshot_download("xxx")

# <Update of the repository with commit #6fr4de on branch main>

directory = snapshot_download("xxx", revision="6fr4de")

directory = snapshot_download("xxx", local_files_only=True)

This will now return the directory with the main identifier, even though the second call to snapshot_download called a more recent update of that same branch (main).

TBH I think this is mostly a nit than an actual issue. In most cases, users that indicate the revision parameter in a snapshot_download will also indicate it in the same snapshot_download when calling it with local_files_only=True.

I guess to be explicit here, I'd vote to log a warning when not specifying a specific revision:

WARNING: You called `snapshot_download` without specifying an explicit revision and with no networking. Will return the latest local download of the `main` branch.

Another, smallish issue I have is that the following doesn't work:

snapshot_download("xxx", revision="main")
# Returns ~/.cache/xxx.main.<COMMIT_SHA>

snapshot_download("xxx", revision="main", local_files_only=True)
# Returns ~/.cache/xxx.main.<COMMIT_SHA>

snapshot_download("xxx", revision="<COMMIT_SHA>", local_files_only=True)
# ValueError: Cannot find the requested files in the cached path and outgoing traffic has been disabled. To enable model look-ups and downloads online, set 'local_files_only' to False

It seems that most OS (except Windows) have a maximum filename length of 255, and a maximum path of 4096 characters. With ~40 characters per hash and a maximum of two hashes per directory name (revision + file hash), it seems we should be good to go. Windows should work to since #40

src/huggingface_hub/snapshot_download.py

tests/test_snapshot_download.py

julien-c · 2021-11-30T17:59:17Z

I might be missing something, but why don't we enforce that if local_files_only=True then revision must be a full commit hash?

This way there's no potential indirection. If it's used mostly for offline tests, it will just work out of the box.

What am I missing here?

LysandreJik · 2021-11-30T18:07:12Z

Hmmm if we do that then we lose quite a big portion of the API I think, no? We're mostly dealing with edge cases here, but I think the following is more important than the rest:

snapshot_download("xxx")

# Should work
snapshot_download("xxx", local_files_only=True)

LysandreJik

This looks good to me, only left a few comments. Thanks for working on it and iterating, @patrickvonplaten!

LysandreJik · 2021-12-01T15:53:10Z

src/huggingface_hub/snapshot_download.py

+        if len(repo_folders) == 0:
+            raise ValueError(
+                "Cannot find the requested files in the cached path and outgoing traffic has been"
+                " disabled. To enable model look-ups and downloads online, set 'local_files_only'"
+                " to False."
+            )


src/huggingface_hub/snapshot_download.py

LysandreJik · 2021-12-01T16:00:48Z

src/huggingface_hub/snapshot_download.py

-    )
+        # 3) cached repos of format <repo_id>.<any-branch>.{revision}
+        # -> in this case {revision} also has to be a commit sha
+        repo_folders_branch_commit = glob(repo_folders_prefix + ".*." + revision)


And this isn't necessarily a branch + a commit, I'd say it's an explicit_revision + a folder_sha (even if the latter is ultimately a commit hash)

How about repo_folders_explicit_revision_and_sha ?

src/huggingface_hub/snapshot_download.py

LysandreJik · 2021-12-01T16:06:12Z

src/huggingface_hub/snapshot_download.py

+        if revision != model_info.sha:
+            storage_folder += f".{model_info.sha}"


I wonder if this is necessary or superfluous - does it hurt to keep things simple at the cost of having a duplicate of sha in the saved folder name?

After f2f explanation I understand why this is necessary :)

LysandreJik · 2021-12-01T16:08:58Z

tests/test_snapshot_download.py

+            )
+
+            # now load from cache and make sure warning to be raised
+            with self.assertWarns(Warning):


[Snapshotdownload] Add more parameter names

51a45ad

patrickvonplaten requested review from LysandreJik and osanseviero November 29, 2021 10:46

up

780c92e

patrickvonplaten changed the title ~~[Snapshotdownload] Add more parameter names~~ [Snapshot download] Allow to load local repo id with snapshot download Nov 29, 2021

fix style

dea93a4

patrickvonplaten mentioned this pull request Nov 29, 2021

[Integration with 🤗 Hugging Face] Add load_from_hub to BeamSearchDecoder kensho-technologies/pyctcdecode#32

Merged

patrickvonplaten added 3 commits November 29, 2021 14:56

add revision to cache

726121c

update tests

449f369

make style

7292e1a

patrickvonplaten requested a review from julien-c November 29, 2021 14:19

LysandreJik reviewed Nov 30, 2021

View reviewed changes

src/huggingface_hub/snapshot_download.py Outdated Show resolved Hide resolved

tests/test_snapshot_download.py Show resolved Hide resolved

patrickvonplaten added 3 commits November 30, 2021 18:17

improve snapshot download

b3470b5

add tests

3a3c11b

delete dummy folders

4558ccd

Remove tree

34a8a9c

LysandreJik approved these changes Dec 1, 2021

View reviewed changes

finish cette mer**

5f27771

LysandreJik merged commit 99bae57 into main Dec 1, 2021

LysandreJik deleted the add_more_input_params_to_snapshot_download branch December 1, 2021 16:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Snapshot download] Allow to load local repo id with snapshot download #505

[Snapshot download] Allow to load local repo id with snapshot download #505

patrickvonplaten commented Nov 29, 2021 •

edited

Loading

patrickvonplaten commented Nov 29, 2021

LysandreJik commented Nov 29, 2021 •

edited

Loading

patrickvonplaten commented Nov 29, 2021 •

edited

Loading

LysandreJik commented Nov 29, 2021 •

edited

Loading

julien-c commented Nov 29, 2021 •

edited

Loading

patrickvonplaten commented Nov 29, 2021 •

edited

Loading

patrickvonplaten commented Nov 29, 2021 •

edited

Loading

patrickvonplaten commented Nov 29, 2021

LysandreJik left a comment

julien-c commented Nov 30, 2021

LysandreJik commented Nov 30, 2021

LysandreJik left a comment

LysandreJik Dec 1, 2021

LysandreJik Dec 1, 2021

LysandreJik Dec 1, 2021

LysandreJik Dec 1, 2021

LysandreJik Dec 1, 2021

		if revision != model_info.sha:
		storage_folder += f".{model_info.sha}"

[Snapshot download] Allow to load local repo id with snapshot download #505

[Snapshot download] Allow to load local repo id with snapshot download #505

Conversation

patrickvonplaten commented Nov 29, 2021 • edited Loading

patrickvonplaten commented Nov 29, 2021

LysandreJik commented Nov 29, 2021 • edited Loading

patrickvonplaten commented Nov 29, 2021 • edited Loading

LysandreJik commented Nov 29, 2021 • edited Loading

julien-c commented Nov 29, 2021 • edited Loading

patrickvonplaten commented Nov 29, 2021 • edited Loading

patrickvonplaten commented Nov 29, 2021 • edited Loading

patrickvonplaten commented Nov 29, 2021

LysandreJik left a comment

Choose a reason for hiding this comment

julien-c commented Nov 30, 2021

LysandreJik commented Nov 30, 2021

LysandreJik left a comment

Choose a reason for hiding this comment

LysandreJik Dec 1, 2021

Choose a reason for hiding this comment

LysandreJik Dec 1, 2021

Choose a reason for hiding this comment

LysandreJik Dec 1, 2021

Choose a reason for hiding this comment

LysandreJik Dec 1, 2021

Choose a reason for hiding this comment

LysandreJik Dec 1, 2021

Choose a reason for hiding this comment

patrickvonplaten commented Nov 29, 2021 •

edited

Loading

LysandreJik commented Nov 29, 2021 •

edited

Loading

patrickvonplaten commented Nov 29, 2021 •

edited

Loading

LysandreJik commented Nov 29, 2021 •

edited

Loading

julien-c commented Nov 29, 2021 •

edited

Loading

patrickvonplaten commented Nov 29, 2021 •

edited

Loading

patrickvonplaten commented Nov 29, 2021 •

edited

Loading