Device agnostic testing #5612

arsalanu · 2023-11-01T16:56:55Z

What does this PR do?

Adds new features to testing_utils.py and import_utils.py to make testing with non-default PyTorch backends (beyond just cuda, cpu and mps) possible. This should not affect any current testing within the repo or the behaviour of the devices they are run on.

This is heavily based on similar work we have done for Transformers, see: Transformers PR #25870

Adds some device agnostic functions which dispatch to specific backend functions. This is mainly applicable to functions which are device-specific (e.g. torch.cuda.manual_seed). Users can specify new backends and backends for device agnostic functions by creating a device specification file and pointing the test suite to it using the environment variable DIFFUSERS_TEST_DEVICE_SPEC, and add a new device for PyTorch using DIFFUSERS_TEST_DEVICE.

Example of a device specification to run the tests with an alternative accelerator:

import torch
import torch_npu
# User can add additional imports here

# Specify the device name (eg. 'cuda', 'cpu')
DEVICE_NAME = 'npu'

# Specify device-specific backends to dispatch to.
# If not specified (i.e., `None`) will fallback to 'default' in 'testing_utils.py`
MANUAL_SEED_FN = torch.npu.manual_seed
EMPTY_CACHE_FN = None
DEVICE_COUNT_FN = torch.npu.device_count
SUPPORTS_TRAINING=True

Implementation details are fully outlined in the issue #5562

I have a modified a single file (UNet2D condition model tests, and test_modeling_common as this is used in the UNet2D tests) rather than all the tests, as this PR is more focused on the implementation of features required for device-agnostic testing.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case. Issue Make 🤗 Diffusers tests device-agnostic #5562
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

yiyixuxu · 2023-11-03T05:54:10Z

@patrickvonplaten
can you take a look here?

src/diffusers/utils/import_utils.py

patrickvonplaten

The changes look very reasonable to me! Thanks a lot for making everything device-agnostic.

Just a bit worried about the is_torch_fp16_available because we're essestially just saying if the matmul doesn't work fp16 is not available, but the matmul might also not work because of other reasons (badly installed CUDA, OOM, ...)

In PyTorch there is actually a is_bf16_available(): https://github.com/pytorch/pytorch/blob/d64bc8f0f81bd9b514eb1a5ee6f5b03094e4e6e9/torch/cuda/__init__.py#L141

The function seems to check some device properties which is probably less brittle - guess it's hard to do this for fp16 here, but can we maybe make sure that we don't accidentally misinterpret other erros as fp16 not being available?

arsalanu · 2023-11-15T11:10:36Z

Thanks, @patrickvonplaten. I agree that just catching any exception might not be the best way to do this, but I'm not sure if there is a specific exception that would be agnostic to any accelerator or device. On CPU and XLA I believe you get a RuntimeError when trying to perform operations with FP16, but this is vague in itself and may overlap with a different issue.

One suggestion is to write it so the exception is logged and print the error when tests are run, to specify why FP16 is not working? This would make it clear to the user whether it is unsupported behaviour or an issue with their setup.

Looking at the PyTorch is_bf16_available() this specifies different checks for different hardware, which makes it less brittle as you said, but that function is also not device agnostic and would only work for ROCm and CUDA backends.

Another suggestion I have is to add a CUDA specific check and skip the FP16 matmul check in is_torch_fp16_available() if a GPU is being used (I could do this for MPS and CPU too, which would be set to False by default).

This would still be in line with the changes as these backends have defaults specified for them in the custom function dispatch as well. Then the matmul check would only be used if a non-default device is being used. We could do this and also log the error to make it explicit to the user.

Let me know if that makes sense to you and I will add those changes, or any other suggestions you have. Thanks!

src/diffusers/utils/import_utils.py

patrickvonplaten · 2023-11-20T11:10:51Z

Thanks, @patrickvonplaten. I agree that just catching any exception might not be the best way to do this, but I'm not sure if there is a specific exception that would be agnostic to any accelerator or device. On CPU and XLA I believe you get a RuntimeError when trying to perform operations with FP16, but this is vague in itself and may overlap with a different issue.

One suggestion is to write it so the exception is logged and print the error when tests are run, to specify why FP16 is not working? This would make it clear to the user whether it is unsupported behaviour or an issue with their setup.

Looking at the PyTorch is_bf16_available() this specifies different checks for different hardware, which makes it less brittle as you said, but that function is also not device agnostic and would only work for ROCm and CUDA backends.

Another suggestion I have is to add a CUDA specific check and skip the FP16 matmul check in is_torch_fp16_available() if a GPU is being used (I could do this for MPS and CPU too, which would be set to False by default).

This would still be in line with the changes as these backends have defaults specified for them in the custom function dispatch as well. Then the matmul check would only be used if a non-default device is being used. We could do this and also log the error to make it explicit to the user.

Let me know if that makes sense to you and I will add those changes, or any other suggestions you have. Thanks!

Could we maybe do something like this: https://github.com/huggingface/diffusers/pull/5612/files#r1399038284 just to add an extra safety-mechanism that a sure doesn't understand the function incorrectly in case cuda is badly set up?

Also can we make the function private for now, e.g. add an underscore so that it's _is_torch_fp16_available()?

arsalanu · 2023-11-21T10:40:48Z

I've added the changes, slightly restructured so the FP16 op-check happens by default for all accelerators and the CUDA error is raised only if the device type is cuda.

HuggingFaceDocBuilderDev · 2023-11-21T10:41:08Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

patrickvonplaten

Cool thanks for iterating here. This PR LGTM - should we merge it now or do you want to add tests for other classes directly here?

src/diffusers/utils/import_utils.py

src/diffusers/utils/__init__.py

DN6 · 2023-11-22T07:09:22Z

src/diffusers/utils/testing_utils.py

+
+
+# Guard for when Torch is not available
+if is_torch_available():


Is this meant to run when torch isn't available or if DIFFUSERS_TEST_DEVICE_SPEC is set?

The guard is there because the function dispatch should only run if torch is available, it doesn't strictly matter if DIFFUSERS_TEST_DEVICE_SPEC is set. For example for a GPU, CPU or MPS device, a spec doesn't need to be set but torch must still be available to dispatch to the default torch device functions.

arsalanu · 2023-11-22T17:24:29Z

should we merge it now or do you want to add tests for other classes directly here?

If thats okay I'll add some more before merging 😄 I had a few other tests ready but removed them for this PR to keep it minimal.

arsalanu · 2023-11-23T13:40:28Z

I've added more test coverage. The latest commit has the changes for most of the model classes (unet, vae, vq, unet2d and some common files) and one pipeline test (SD2). Any more tests could be added in future PRs.

patrickvonplaten

Cool! The changes look good to me - @DN6 wdyt? Feel free to merge once you're happy with it

DN6

LGTM 👍🏽 Nice work @arsalanu!

* utils and test modifications to enable device agnostic testing * device for manual seed in unet1d * fix generator condition in vae test * consistency changes to testing * make style * add device agnostic testing changes to source and one model test * make dtype check fns private, log cuda fp16 case * remove dtype checks from import utils, move to testing_utils * adding tests for most model classes and one pipeline * fix vae import

arsalanu and others added 8 commits September 22, 2023 10:01

utils and test modifications to enable device agnostic testing

acbd668

device for manual seed in unet1d

eb2a754

fix generator condition in vae test

f620794

consistency changes to testing

c7f3e34

make style

cec21f9

rebase

30c5d75

add device agnostic testing changes to source and one model test

8307730

Merge branch 'huggingface:main' into device-agnostic-test

035a4cc

arsalanu mentioned this pull request Nov 1, 2023

Make 🤗 Diffusers tests device-agnostic #5562

Closed

patrickvonplaten reviewed Nov 13, 2023

View reviewed changes

src/diffusers/utils/import_utils.py Outdated Show resolved Hide resolved

patrickvonplaten reviewed Nov 13, 2023

View reviewed changes

patrickvonplaten reviewed Nov 20, 2023

View reviewed changes

src/diffusers/utils/import_utils.py Outdated Show resolved Hide resolved

make dtype check fns private, log cuda fp16 case

a99e5b0

patrickvonplaten approved these changes Nov 21, 2023

View reviewed changes

patrickvonplaten requested review from DN6 and yiyixuxu November 21, 2023 17:17

DN6 reviewed Nov 22, 2023

View reviewed changes

src/diffusers/utils/import_utils.py Outdated Show resolved Hide resolved

DN6 reviewed Nov 22, 2023

View reviewed changes

src/diffusers/utils/__init__.py Outdated Show resolved Hide resolved

DN6 reviewed Nov 22, 2023

View reviewed changes

remove dtype checks from import utils, move to testing_utils

72701d6

adding tests for most model classes and one pipeline

de8b2c1

fix vae import

b428cfa

patrickvonplaten approved these changes Nov 27, 2023

View reviewed changes

DN6 approved these changes Dec 5, 2023

View reviewed changes

DN6 merged commit f427345 into huggingface:main Dec 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Device agnostic testing #5612

Device agnostic testing #5612

arsalanu commented Nov 1, 2023

yiyixuxu commented Nov 3, 2023

patrickvonplaten left a comment

arsalanu commented Nov 15, 2023 •

edited

Loading

patrickvonplaten commented Nov 20, 2023

arsalanu commented Nov 21, 2023

HuggingFaceDocBuilderDev commented Nov 21, 2023

patrickvonplaten left a comment

DN6 Nov 22, 2023

arsalanu Nov 22, 2023

arsalanu commented Nov 22, 2023

arsalanu commented Nov 23, 2023

patrickvonplaten left a comment

DN6 left a comment



		# Guard for when Torch is not available
		if is_torch_available():

Device agnostic testing #5612

Device agnostic testing #5612

Conversation

arsalanu commented Nov 1, 2023

What does this PR do?

Before submitting

yiyixuxu commented Nov 3, 2023

patrickvonplaten left a comment

Choose a reason for hiding this comment

arsalanu commented Nov 15, 2023 • edited Loading

patrickvonplaten commented Nov 20, 2023

arsalanu commented Nov 21, 2023

HuggingFaceDocBuilderDev commented Nov 21, 2023

patrickvonplaten left a comment

Choose a reason for hiding this comment

DN6 Nov 22, 2023

Choose a reason for hiding this comment

arsalanu Nov 22, 2023

Choose a reason for hiding this comment

arsalanu commented Nov 22, 2023

arsalanu commented Nov 23, 2023

patrickvonplaten left a comment

Choose a reason for hiding this comment

DN6 left a comment

Choose a reason for hiding this comment

arsalanu commented Nov 15, 2023 •

edited

Loading