fix "Expected all tensors to be on the same device, but found at least two devices" error #11690

yao-matrix · 2025-06-11T01:16:24Z

when run pytest -rA tests/models/unets/test_models_unet_2d_condition.py::UNet2DConditionModelTests::test_load_sharded_checkpoint_device_map_from_hub_local on 8 devices(CUDA, XPU), there will be a RuntimeError "RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:4 and cuda:1!", fix it by moving to same device
enable one gpu-only case on accelerator(XPU test passed)

@sayakpaul , pls help review, thx.

Signed-off-by: YAO Matrix <matrix.yao@intel.com>

HuggingFaceDocBuilderDev · 2025-06-11T08:52:13Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Signed-off-by: YAO Matrix <matrix.yao@intel.com>

yao-matrix · 2025-06-12T02:46:20Z

@SunMarc , could you pls help review? Thx very much.

SunMarc

Thanks ! Left a comment

SunMarc · 2025-06-13T15:26:11Z

src/diffusers/models/unets/unet_2d_blocks.py

+            if hidden_states.device != res_hidden_states.device:
+                res_hidden_states = res_hidden_states.to(hidden_states.device)


we shouldn't need that since both hidden_states and res_hidden_states should be on the same device no ? The pre-forward hook added by accelerate should be move all the inputs to the same device.

@SunMarc , i suppose this is a corner case? torch.cat is a weight-less function, so seems cannot covered by the pre-forward hook set by accelerate...

I mean since hidden_states and res_hidden_states_tuple are in the forward definition, they should be moved to the same device by the pre-forward hook added by accelerate

@SunMarc We run into a corner case here. Since we have 8 cards here, so the determined device_map(by https://github.com/huggingface/diffusers/blob/1bc6f3dc0f21779480db70a4928d14282c0198ed/src/diffusers/models/model_loading_utils.py#L64C5-L64C26) is

device_map: OrderedDict([('conv_in', 0), ('time_proj', 0), ('time_embedding', 0), ('down_blocks.0', 0), ('down_blocks.1.resnets.0', 1), ('up_blocks.0.resnets.0', 1), ('up_blocks.0.resnets.1', 2), ('up_blocks.0.upsamplers', 2), ('up_blocks.1', 3), ('mid_block.attentions', 3), ('conv_norm_out', 4), ('conv_act', 4), ('conv_out', 4), ('mid_block.resnets', 4)])

We can see UpBlock is not the atomic module, its submodules are assigned to different devices(up_blocks.0.resnets.0, up_blocks.0.resnets.1), so pre-hook for UpBlock will not help in this case. And since torch.cat is not pre-hooked(and cannot since it's a function rather than a module?), so the issue happens.

If there is no a torch.cat btw the sub-blocks in UpBlock, things will be all fine.

@SunMarc, need your inputs in how to proceed for this corner case, thx.

yao-matrix added 6 commits June 10, 2025 23:23

xx

6001899

fix

8a1d6e5

Signed-off-by: YAO Matrix <matrix.yao@intel.com>

Update model_loading_utils.py

603257b

Update test_models_unet_2d_condition.py

8cdfdd8

Update test_models_unet_2d_condition.py

45e29bd

Merge branch 'main' into xpu

2a7c17d

sayakpaul requested a review from SunMarc June 11, 2025 02:25

yao-matrix added 3 commits June 11, 2025 13:29

fix style

fae7c70

Signed-off-by: YAO Matrix <matrix.yao@intel.com>

Merge branch 'main' into xpu

80fdbfc

Merge branch 'main' into xpu

97a37a1

yao-matrix added 2 commits June 13, 2025 06:49

Merge branch 'main' into xpu

5f0c794

Merge branch 'main' into xpu

8cd06b3

SunMarc reviewed Jun 13, 2025

View reviewed changes

yao-matrix added 3 commits June 17, 2025 15:22

Merge branch 'main' into xpu

02a6a35

Merge branch 'main' into xpu

ed1a788

Merge branch 'main' into xpu

220ce94

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix "Expected all tensors to be on the same device, but found at least two devices" error #11690

fix "Expected all tensors to be on the same device, but found at least two devices" error #11690

Uh oh!

yao-matrix commented Jun 11, 2025 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Jun 11, 2025

Uh oh!

yao-matrix commented Jun 12, 2025

Uh oh!

SunMarc left a comment

Uh oh!

SunMarc Jun 13, 2025

Uh oh!

yao-matrix Jun 17, 2025

Uh oh!

SunMarc Jun 17, 2025

Uh oh!

yao-matrix Jun 18, 2025 •

edited

Loading

Uh oh!

yao-matrix Jun 19, 2025

Uh oh!

Uh oh!

		if hidden_states.device != res_hidden_states.device:
		res_hidden_states = res_hidden_states.to(hidden_states.device)

fix "Expected all tensors to be on the same device, but found at least two devices" error #11690

Are you sure you want to change the base?

fix "Expected all tensors to be on the same device, but found at least two devices" error #11690

Uh oh!

Conversation

yao-matrix commented Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Jun 11, 2025

Uh oh!

yao-matrix commented Jun 12, 2025

Uh oh!

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

SunMarc Jun 13, 2025

Choose a reason for hiding this comment

Uh oh!

yao-matrix Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

SunMarc Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

yao-matrix Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yao-matrix Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yao-matrix commented Jun 11, 2025 •

edited

Loading

yao-matrix Jun 18, 2025 •

edited

Loading