Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Use "hub" directory for cache instead of "diffusers" #2005

Merged
merged 10 commits into from
Feb 28, 2023
Merged

Conversation

pcuenca
Copy link
Member

@pcuenca pcuenca commented Jan 16, 2023

Fixes #1663.

So far I've only changed the cache location. I think maybe we should migrate the existing cached files (so far living in diffusers) upon first launch. If so, how could we go about it? I remember there was a migration process in huggingface_hub when the cache layout was updated, so perhaps @Wauplin can recommend a sensible approach here? My initial idea would be to:

  • Detect whether cached files exist in diffusers.
  • If they do, attempt to move to hub. What happens if there's a directory conflict?
  • When successful, remember somewhere (where?) that a migration already took place so we don't attempt it again.
  • Do we need to deal with "old" versions of the cache layout?

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Jan 16, 2023

The documentation is not available anymore as the PR was closed or merged.

@Wauplin
Copy link
Collaborator

Wauplin commented Jan 16, 2023

Hi @pcuenca , thanks for making the change :)

About your questions, I see 2 things here:

  1. Old-format vs new-format of the cache. I think this is a bit orthogonal in the sense that old style cached files are already not used by diffusers right? At least, moving the default cache location shouldn't be change the use of old vs new cache format (I expect).
  2. Moving cache from old location to the new location would definitely be a good idea for users as it would avoid them to have to re-download everything. I remember that @sgugger did a script to convert cache from old-format to new-format when transformers made the change. If I remember correctly, the script was run once when loading transformers to convert all cached files. In your case (changing the location), you could have such a script as well to transfer the data from 1 folder to the other. The advantage is that once the script has run once, you don't have to care about checking the 2 locations each time you load a model. (maybe warn and ask the user to confirm before moving everything ?)

What happens if there's a directory conflict?

Good question. There shouldn't be conflicts in the snapshots/ and blobs/ directories but the refs/ might conflict. Since refs/ is a folder containing only references (e.g. files containing only 1 line -the commit oid-), I would advice to simply delete it. It will result in 1 additional HTTP call when the user loads a model for the first time but that's neglectable. snapshots/ is a folder containing only symlinks so I guess that if you have conflicts in it, it's best to just remove it (will cost extra HTTP calls afterwards but no re-download). Conflicts in the blobs/ folder would be much more problematic but I don't see how it could happen.

Remember somewhere (where?) that a migration already took place so we don't attempt it again

Easiest solution for that would be to delete the previous folder but that would cause issues for users that have both an old and a new version of diffusers on their machine. In general, I don't see how we could avoid any problem with such a change (either files are duplicated between old and new location -that's inefficient + when would you delete the old one ?-, either the user cannot run both a old and a new version of diffusers -would that be a problem ?-).
Maybe the safest way of handling that is to trigger a warning and suggest the user to manually delete the old folder ?

pcuenca and others added 2 commits January 16, 2023 13:36
I verified that the constants are available in huggingface_hub version
0.10.0, which is the minimum we require.

Co-authored-by: Lucain Pouget <lucainp@gmail.com>
@sgugger
Copy link
Contributor

sgugger commented Jan 16, 2023

For information, the whole logic to migrate to the new cache in Transformers is here.

@patrickvonplaten
Copy link
Contributor

@pcuenca, it would be nice to automatically migrate the cache just like we do in transformers - think we can pretty much copy the whole logic :-)

@pcuenca
Copy link
Member Author

pcuenca commented Jan 16, 2023

Yes, I'll use @sgugger's logic. Thanks a lot for the pointer!

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot added the stale Issues that haven't received updates label Feb 25, 2023
@keturn
Copy link
Contributor

keturn commented Feb 26, 2023

Wait, this was never merged?

Well that explains some things.

@julien-c
Copy link
Member

yeah i think we do want this merged

@pcuenca pcuenca removed the stale Issues that haven't received updates label Feb 26, 2023
Copy link
Member Author

@pcuenca pcuenca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplifications with respect to the migration process in transformers:

@pcuenca
Copy link
Member Author

pcuenca commented Feb 26, 2023

One question I have, if we want to be extremely cautious, is whether to ignore the migration until the new version has been released (i.e., don't do anything when installing from main). This would prevent re-downloads if the user installs from main, the files are moved but then the user decides to go back to a PyPi release. I don't think this case was contemplated in transformers either.

@keturn
Copy link
Contributor

keturn commented Feb 27, 2023

One question I have, if we want to be extremely cautious, is whether to ignore the migration until the new version has been released (i.e., don't do anything when installing from main). This would prevent re-downloads if the user installs from main, the files are moved but then the user decides to go back to a PyPi release.

...or multiple venvs with different versions.

can you mitigate this by leaving symlinks at the old location to point to the new one?

pcuenca and others added 3 commits February 27, 2023 17:49
Co-authored-by: Lucain <lucainp@gmail.com>
Especially important if we want to ensure that the user may want to invoke the
process again later, if they are keeping multiple envs with different
versions.
@pcuenca
Copy link
Member Author

pcuenca commented Feb 27, 2023

...or multiple venvs with different versions.
can you mitigate this by leaving symlinks at the old location to point to the new one?

Good idea. Note, however, that the migration process will only run once automatically. If you keep multiple versions, then you'll need to invoke diffusers.utils.hub_utils.move_cache() periodically to move models to the new location.

Copy link
Collaborator

@Wauplin Wauplin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR looks good to me, thanks for the changes 👍
Maybe another pair of eyes would be good before finally merging.

Comment on lines +178 to +180
"The cache for model files in Diffusers v0.14.0 has moved to a new location. Moving your "
"existing cached models. This is a one-time operation, you can interrupt it or run it "
"later by calling `diffusers.utils.hub_utils.move_cache()`."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a need to also spit out the actual location in this message for ultra transparency?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, perhaps we can include that too.

Copy link
Member

@sayakpaul sayakpaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot!

Copy link
Contributor

@patrickvonplaten patrickvonplaten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool - thanks !

@pcuenca pcuenca merged commit 477aaa9 into main Feb 28, 2023
@pcuenca pcuenca deleted the cache-location branch February 28, 2023 19:01
@carson-katri
Copy link
Contributor

carson-katri commented Mar 2, 2023

Only the blobs folder was moved when I ran the migration. Should the other subfolders within the models be moved, or are they not needed?

@Wauplin
Copy link
Collaborator

Wauplin commented Mar 3, 2023

@carson-katri Yes it's normal. The blobs folders are the only ones that really take space on your disk. If the move_cache() script worked properly without warning, all the blob files should have been moved to the new location and a symlink should have been added in the old location. The goal is that both new and old cache should be functional if you have a transition period with multiple versions of diffusers on your machine.
(Note: the refs and snapshots folders will be re-created the first time you load a model (it's only symlinks so real quick). Don't use use_local_files=True on your first run after the moving script)

Hope this clarifies everything :)

@carson-katri
Copy link
Contributor

@Wauplin Got it, thanks. My issue was I was reading the cache to see what models the user has available. But I think I should read the old cache and new cache now until those folders are recreated.

@Wauplin
Copy link
Collaborator

Wauplin commented Mar 3, 2023

@carson-katri yes exactly. Also worth mentioning we have a scan-cache cli to list which models you have in your cache.

huggingface-cli scan-cache
# or
# huggingface-cli scan-cache --dir=path/to/cache

Disclaimer: I'm not really sure it will work on the old cache has the tool is not expecting symlinks in the blobs folder.

lstein added a commit to invoke-ai/InvokeAI that referenced this pull request Mar 5, 2023
…accelerate 0.16 (#2865)

Things to check for in this version:

- `diffusers` cache location is now more consistent with other
huggingface-hub using code (i.e. `transformers`) as of
huggingface/diffusers#2005. I think ultimately
this should make @damian0815 (and other folks with multiple
diffusers-using projects) happier, but it's worth taking a look to make
sure the way @lstein set things up to respect `HF_HOME` is still
functioning as intended.
- I've gone ahead and updated `transformers` to the current version
(4.26), but I have a vague memory that we were holding it back at some
point? Need to look that up and see if that's the case and why.
mengfei25 pushed a commit to mengfei25/diffusers that referenced this pull request Mar 27, 2023
* Use "hub" directory for cache instead of "diffusers"

* Import cache locations from huggingface_hub

I verified that the constants are available in huggingface_hub version
0.10.0, which is the minimum we require.

Co-authored-by: Lucain Pouget <lucainp@gmail.com>

* make style

* Move cached directories to new location.

* make style

* Apply suggestions by @Wauplin

Co-authored-by: Lucain <lucainp@gmail.com>

* Fix is_file

* Ignore symlinks.

Especially important if we want to ensure that the user may want to invoke the
process again later, if they are keeping multiple envs with different
versions.

* Style

---------

Co-authored-by: Lucain Pouget <lucainp@gmail.com>
yoonseokjin pushed a commit to yoonseokjin/diffusers that referenced this pull request Dec 25, 2023
* Use "hub" directory for cache instead of "diffusers"

* Import cache locations from huggingface_hub

I verified that the constants are available in huggingface_hub version
0.10.0, which is the minimum we require.

Co-authored-by: Lucain Pouget <lucainp@gmail.com>

* make style

* Move cached directories to new location.

* make style

* Apply suggestions by @Wauplin

Co-authored-by: Lucain <lucainp@gmail.com>

* Fix is_file

* Ignore symlinks.

Especially important if we want to ensure that the user may want to invoke the
process again later, if they are keeping multiple envs with different
versions.

* Style

---------

Co-authored-by: Lucain Pouget <lucainp@gmail.com>
AmericanPresidentJimmyCarter pushed a commit to AmericanPresidentJimmyCarter/diffusers that referenced this pull request Apr 26, 2024
* Use "hub" directory for cache instead of "diffusers"

* Import cache locations from huggingface_hub

I verified that the constants are available in huggingface_hub version
0.10.0, which is the minimum we require.

Co-authored-by: Lucain Pouget <lucainp@gmail.com>

* make style

* Move cached directories to new location.

* make style

* Apply suggestions by @Wauplin

Co-authored-by: Lucain <lucainp@gmail.com>

* Fix is_file

* Ignore symlinks.

Especially important if we want to ensure that the user may want to invoke the
process again later, if they are keeping multiple envs with different
versions.

* Style

---------

Co-authored-by: Lucain Pouget <lucainp@gmail.com>
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Store cached models in location compatible with huggingface-cli
9 participants