Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[Snapshot download] Allow to load local repo id with snapshot download #505

Merged
merged 11 commits into from
Dec 1, 2021

Conversation

patrickvonplaten
Copy link
Contributor

@patrickvonplaten patrickvonplaten commented Nov 29, 2021

This PR adds more function params to the snapshot download method. @LysandreJik @osanseviero - are there any reasons why we wouldn't add those?

@patrickvonplaten
Copy link
Contributor Author

This PR is needed for the integration with kensho-technologies/pyctcdecode#32

@patrickvonplaten patrickvonplaten changed the title [Snapshotdownload] Add more parameter names [Snapshot download] Allow to load local repo id with snapshot download Nov 29, 2021
@LysandreJik
Copy link
Member

LysandreJik commented Nov 29, 2021

Nice! We'll need some tests and a bit of context :)

@patrickvonplaten
Copy link
Contributor Author

patrickvonplaten commented Nov 29, 2021

@LysandreJik @osanseviero - I kind of need to be able to use the snapshot download method for local files only. I've added a working solution to this PR that works as follows:

  • Take repo_id and transform into a local folder name. E.g. kensho/lm_model into kensho__lm_model
  • Then list all cached local folder names, e.g. [".cache/hub/kensho__lm_model.12341234", ".cache/hub/kensho__lm_model.123411234"]
  • Take the folder hash that was last modified
  • Load all files from there

So the somewhat "magic" part here is to take the last modified folder name. Would such a logic be ok for you @LysandreJik ?

If yes, I'll add tests, comments and make everything clean

@LysandreJik
Copy link
Member

LysandreJik commented Nov 29, 2021

Thanks for your contribution, @patrickvonplaten! I understand the need for a snapshot_download with a local_files_only parameter.

The only issue I can see here is that it does not play well at all with the revision parameter:

snapshot_download("xxx")
snapshot_download("xxx", revision="yyy")

snapshot_download("xxx", local_files_only=True)  # Will return the one with revision "yyy" even if the "xxx" should have been the one chosen as it was the latest commit

I can't think of a way to go around this limitation, which seems quite troublesome to me. It does seem like a very rare case to me so maybe putting a warning is enough, wdyt?

The following use-case doesn't work either:

snapshot_download("xxx", revision="zzz")
snapshot_download("xxx", revision="yyy")

snapshot_download("xxx", revision="zzz", local_files_only=True)  # Will return the one with revision "yyy"

@julien-c
Copy link
Member

julien-c commented Nov 29, 2021

Out of curiosity, why does the canonical method of loading all files by name (like in transformers for instance) not work in your use case?

@patrickvonplaten
Copy link
Contributor Author

patrickvonplaten commented Nov 29, 2021

Out of curiosity, why does the canonical method of loading all files by name (like in transformers for instance) not work in your use case?

The problem is that for https://github.com/kensho-technologies/pyctcdecode the language model is loaded as follows: https://github.com/kensho-technologies/pyctcdecode/blob/b50b5ae39ebb047c42765be0dce2c27137bd9474/pyctcdecode/language_model.py#L352

meaning that it is not loaded by a hard-coded name, but rather the "last file" that is found in the folder language_model. We could give the file a name, but then we also have the problem that the file can be both in .arpa format or .bin format. So we would require an ugly flag "format="bin" anymays to be able to load both kenLM.arpa or kenLM.bin.

@patrickvonplaten
Copy link
Contributor Author

patrickvonplaten commented Nov 29, 2021

Thanks for your contribution, @patrickvonplaten! I understand the need for a snapshot_download with a local_files_only parameter.
The only issue I can see here is that it does not play well at all with the revision parameter:

snapshot_download("xxx")
snapshot_download("xxx", revision="yyy")

snapshot_download("xxx", local_files_only=True)  # Will return the one with revision "yyy" even if the "xxx" should have been the one chosen as it was the latest commit

I can't think of a way to go around this limitation, which seems quite troublesome to me. It does seem like a very rare case to me so maybe putting a warning is enough, wdyt?
The following use-case doesn't work either:

snapshot_download("xxx", revision="zzz")
snapshot_download("xxx", revision="yyy")

snapshot_download("xxx", revision="zzz", local_files_only=True)  # Will return the one with revision "yyy"

Ah yeah this is indeed a big problem. The revision is however nowhere locally saved (it's only encoded in the saved sha no?) -> so there is no way that we can retrieve the correct revision when we don't have internet connection no?

We could force the user to have to provide the repo structure when doing local_files_only or else throw a warning that the last updated folder is used? What do you think @LysandreJik ?

Or we save the revision in the cached name? As follows:

<model_id>.<revision>.<hash>

@patrickvonplaten
Copy link
Contributor Author

@LysandreJik , @julien-c - I now went for the solution where we update the cache folder names to
<repo_id>.<revision>.<hash_of_folder>

I think that being able to load repos from a specific revision locally is quite important and that we should enable this functionality. At the same time, we cannot assume that the filenames in a repo are always hardcoded since some repositories
like https://github.com/kensho-technologies/pyctcdecode/blob/b50b5ae39ebb047c42765be0dce2c27137bd9474/pyctcdecode/language_model.py#L353 load non-hardcoded filenames.

I don't see another way as saving the revision into the name of the cached folder which can then be correctly looked up when local_files_only=True.
This change will require a new cached download for people using snapshot_download which I think is ok given that snapshot_download is not yet used that widely (IMO).

What do you think?

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating on the implementation. I think there's either a small issue with the implementation or that we should align on the behavior of this method.

My understanding is that snapshot_download should always fetch the latest update on the default branch of the repository (main right now, but should be updated later to reflect the default git branch in the future) when no revision is passed. If a revision is passed and it is a branch, will fetch the latest update on that branch. If it is a tag or a commit hash, then will fetch the exact version referenced.

With local_files_only, this becomes a bit tricky: if we handle main (or any other branch name), then the aforementioned behavior cannot be strictly respected. Let's see why:

# Fetch repository on branch `main`
directory = snapshot_download("xxx")

# <Update of the repository>

directory = snapshot_download("xxx", local_files_only=True)

This will work correctly and return the correct directory.

# Fetch repository on branch `main`
directory = snapshot_download("xxx")

# <Update of the repository with commit #6fr4de on branch main>

directory = snapshot_download("xxx", revision="6fr4de")

directory = snapshot_download("xxx", local_files_only=True)

This will now return the directory with the main identifier, even though the second call to snapshot_download called a more recent update of that same branch (main).

TBH I think this is mostly a nit than an actual issue. In most cases, users that indicate the revision parameter in a snapshot_download will also indicate it in the same snapshot_download when calling it with local_files_only=True.

I guess to be explicit here, I'd vote to log a warning when not specifying a specific revision:

WARNING: You called `snapshot_download` without specifying an explicit revision and with no networking. Will return the latest local download of the `main` branch.

Another, smallish issue I have is that the following doesn't work:

snapshot_download("xxx", revision="main")
# Returns ~/.cache/xxx.main.<COMMIT_SHA>

snapshot_download("xxx", revision="main", local_files_only=True)
# Returns ~/.cache/xxx.main.<COMMIT_SHA>

snapshot_download("xxx", revision="<COMMIT_SHA>", local_files_only=True)
# ValueError: Cannot find the requested files in the cached path and outgoing traffic has been disabled. To enable model look-ups and downloads online, set 'local_files_only' to False

It seems that most OS (except Windows) have a maximum filename length of 255, and a maximum path of 4096 characters. With ~40 characters per hash and a maximum of two hashes per directory name (revision + file hash), it seems we should be good to go. Windows should work to since #40

@julien-c
Copy link
Member

I might be missing something, but why don't we enforce that if local_files_only=True then revision must be a full commit hash?

This way there's no potential indirection. If it's used mostly for offline tests, it will just work out of the box.

What am I missing here?

@LysandreJik
Copy link
Member

Hmmm if we do that then we lose quite a big portion of the API I think, no? We're mostly dealing with edge cases here, but I think the following is more important than the rest:

snapshot_download("xxx")

# Should work
snapshot_download("xxx", local_files_only=True)

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, only left a few comments. Thanks for working on it and iterating, @patrickvonplaten!

Comment on lines +95 to +100
if len(repo_folders) == 0:
raise ValueError(
"Cannot find the requested files in the cached path and outgoing traffic has been"
" disabled. To enable model look-ups and downloads online, set 'local_files_only'"
" to False."
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

)
# 3) cached repos of format <repo_id>.<any-branch>.{revision}
# -> in this case {revision} also has to be a commit sha
repo_folders_branch_commit = glob(repo_folders_prefix + ".*." + revision)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this isn't necessarily a branch + a commit, I'd say it's an explicit_revision + a folder_sha (even if the latter is ultimately a commit hash)

How about repo_folders_explicit_revision_and_sha ?

Comment on lines +138 to +139
if revision != model_info.sha:
storage_folder += f".{model_info.sha}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this is necessary or superfluous - does it hurt to keep things simple at the cost of having a duplicate of sha in the saved folder name?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After f2f explanation I understand why this is necessary :)

)

# now load from cache and make sure warning to be raised
with self.assertWarns(Warning):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good

@LysandreJik LysandreJik merged commit 99bae57 into main Dec 1, 2021
@LysandreJik LysandreJik deleted the add_more_input_params_to_snapshot_download branch December 1, 2021 16:59
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants