-
Notifications
You must be signed in to change notification settings - Fork 632
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
[Snapshot download] Allow to load local repo id with snapshot download #505
Conversation
This PR is needed for the integration with kensho-technologies/pyctcdecode#32 |
Nice! We'll need some tests and a bit of context :) |
@LysandreJik @osanseviero - I kind of need to be able to use the snapshot download method for local files only. I've added a working solution to this PR that works as follows:
So the somewhat "magic" part here is to take the last modified folder name. Would such a logic be ok for you @LysandreJik ? If yes, I'll add tests, comments and make everything clean |
Thanks for your contribution, @patrickvonplaten! I understand the need for a The only issue I can see here is that it does not play well at all with the snapshot_download("xxx")
snapshot_download("xxx", revision="yyy")
snapshot_download("xxx", local_files_only=True) # Will return the one with revision "yyy" even if the "xxx" should have been the one chosen as it was the latest commit I can't think of a way to go around this limitation, which seems quite troublesome to me. It does seem like a very rare case to me so maybe putting a warning is enough, wdyt? The following use-case doesn't work either: snapshot_download("xxx", revision="zzz")
snapshot_download("xxx", revision="yyy")
snapshot_download("xxx", revision="zzz", local_files_only=True) # Will return the one with revision "yyy" |
Out of curiosity, why does the canonical method of loading all files by name (like in |
The problem is that for https://github.com/kensho-technologies/pyctcdecode the language model is loaded as follows: https://github.com/kensho-technologies/pyctcdecode/blob/b50b5ae39ebb047c42765be0dce2c27137bd9474/pyctcdecode/language_model.py#L352 meaning that it is not loaded by a hard-coded name, but rather the "last file" that is found in the folder |
We could force the user to have to provide the repo structure when doing Or we save the revision in the cached name? As follows:
|
@LysandreJik , @julien-c - I now went for the solution where we update the cache folder names to I think that being able to load repos from a specific revision locally is quite important and that we should enable this functionality. At the same time, we cannot assume that the filenames in a repo are always hardcoded since some repositories I don't see another way as saving the revision into the name of the cached folder which can then be correctly looked up when What do you think? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for iterating on the implementation. I think there's either a small issue with the implementation or that we should align on the behavior of this method.
My understanding is that snapshot_download
should always fetch the latest update on the default branch of the repository (main
right now, but should be updated later to reflect the default git branch in the future) when no revision
is passed. If a revision
is passed and it is a branch, will fetch the latest update on that branch. If it is a tag or a commit hash, then will fetch the exact version referenced.
With local_files_only
, this becomes a bit tricky: if we handle main
(or any other branch name), then the aforementioned behavior cannot be strictly respected. Let's see why:
# Fetch repository on branch `main`
directory = snapshot_download("xxx")
# <Update of the repository>
directory = snapshot_download("xxx", local_files_only=True)
This will work correctly and return the correct directory
.
# Fetch repository on branch `main`
directory = snapshot_download("xxx")
# <Update of the repository with commit #6fr4de on branch main>
directory = snapshot_download("xxx", revision="6fr4de")
directory = snapshot_download("xxx", local_files_only=True)
This will now return the directory with the main
identifier, even though the second call to snapshot_download
called a more recent update of that same branch (main
).
TBH I think this is mostly a nit than an actual issue. In most cases, users that indicate the revision parameter in a snapshot_download
will also indicate it in the same snapshot_download
when calling it with local_files_only=True
.
I guess to be explicit here, I'd vote to log a warning when not specifying a specific revision:
WARNING: You called `snapshot_download` without specifying an explicit revision and with no networking. Will return the latest local download of the `main` branch.
Another, smallish issue I have is that the following doesn't work:
snapshot_download("xxx", revision="main")
# Returns ~/.cache/xxx.main.<COMMIT_SHA>
snapshot_download("xxx", revision="main", local_files_only=True)
# Returns ~/.cache/xxx.main.<COMMIT_SHA>
snapshot_download("xxx", revision="<COMMIT_SHA>", local_files_only=True)
# ValueError: Cannot find the requested files in the cached path and outgoing traffic has been disabled. To enable model look-ups and downloads online, set 'local_files_only' to False
It seems that most OS (except Windows) have a maximum filename length of 255, and a maximum path of 4096 characters. With ~40 characters per hash and a maximum of two hashes per directory name (revision + file hash), it seems we should be good to go. Windows should work to since #40
I might be missing something, but why don't we enforce that if This way there's no potential indirection. If it's used mostly for offline tests, it will just work out of the box. What am I missing here? |
Hmmm if we do that then we lose quite a big portion of the API I think, no? We're mostly dealing with edge cases here, but I think the following is more important than the rest: snapshot_download("xxx")
# Should work
snapshot_download("xxx", local_files_only=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me, only left a few comments. Thanks for working on it and iterating, @patrickvonplaten!
if len(repo_folders) == 0: | ||
raise ValueError( | ||
"Cannot find the requested files in the cached path and outgoing traffic has been" | ||
" disabled. To enable model look-ups and downloads online, set 'local_files_only'" | ||
" to False." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
) | ||
# 3) cached repos of format <repo_id>.<any-branch>.{revision} | ||
# -> in this case {revision} also has to be a commit sha | ||
repo_folders_branch_commit = glob(repo_folders_prefix + ".*." + revision) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And this isn't necessarily a branch
+ a commit
, I'd say it's an explicit_revision
+ a folder_sha
(even if the latter is ultimately a commit hash)
How about repo_folders_explicit_revision_and_sha
?
if revision != model_info.sha: | ||
storage_folder += f".{model_info.sha}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this is necessary or superfluous - does it hurt to keep things simple at the cost of having a duplicate of sha in the saved folder name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After f2f explanation I understand why this is necessary :)
) | ||
|
||
# now load from cache and make sure warning to be raised | ||
with self.assertWarns(Warning): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good
This PR adds more function params to the snapshot download method. @LysandreJik @osanseviero - are there any reasons why we wouldn't add those?