Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Revert "Set device for torch tensors with gpu > 1 (#132)" #134

Merged
merged 1 commit into from
Apr 14, 2023

Conversation

edknv
Copy link
Contributor

@edknv edknv commented Apr 13, 2023

This reverts commit 8782c9d (which fixed #131).

Setting the device via the cupy API causes horovod (2GPU) tests to hang with:

[1,1]<stdout>:merlin/models/tf/models/base.py:1387: in fit                                                                                                                          
[1,1]<stdout>:    out = super().fit(**fit_kwargs)                                                                                                                                   
[1,1]<stdout>:/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py:70: in error_handler                                                                            
[1,1]<stdout>:    raise e.with_traceback(filtered_tb) from None                                                                                                                     
[1,1]<stdout>:/usr/local/lib/python3.8/dist-packages/merlin/dataloader/tensorflow.py:78: in __getitem__                                                                             
[1,1]<stdout>:    return self.__next__()                                                                                                                                            
[1,1]<stdout>:/usr/local/lib/python3.8/dist-packages/merlin/dataloader/tensorflow.py:82: in __next__                                                                                
[1,1]<stdout>:    converted_batch = self.convert_batch(super().__next__())                                                                                                          
[1,1]<stdout>:/usr/local/lib/python3.8/dist-packages/merlin/dataloader/loader_base.py:261: in __next__                                                                              
[1,1]<stdout>:    return self._get_next_batch()                                                                                                                                     
[1,1]<stdout>:/usr/local/lib/python3.8/dist-packages/merlin/dataloader/loader_base.py:332: in _get_next_batch                                                                       
[1,1]<stdout>:    batch = next(self._batch_itr)                                                                                                                                     
[1,1]<stdout>:/usr/local/lib/python3.8/dist-packages/merlin/dataloader/loader_base.py:369: in make_tensors                                                                          
[1,1]<stdout>:    tensors_by_name = self._convert_df_to_tensors(gdf)                                                                                                                
[1,1]<stdout>:/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py:101: in inner                                                                                                     
[1,1]<stdout>:    result = func(*args, **kwargs)                                                                                                                                    
[1,1]<stdout>:/usr/local/lib/python3.8/dist-packages/merlin/dataloader/loader_base.py:524: in _convert_df_to_tensors                                                                
[1,1]<stdout>:    tensors_by_name[column_name] = self._to_tensor(gdf_i[[column_name]])                                                                                              
[1,1]<stdout>:/usr/local/lib/python3.8/dist-packages/merlin/dataloader/loader_base.py:453: in _to_tensor                                                                            
[1,1]<stdout>:    with cupy.cuda.Device(self.device):                                                                                                                               
[1,1]<stdout>:cupy/cuda/device.pyx:184: in cupy.cuda.device.Device.__enter__                                                                                                        
[1,1]<stdout>:    ???                                                                                                                                                               
[1,1]<stdout>:cupy_backends/cuda/api/runtime.pyx:365: in cupy_backends.cuda.api.runtime.setDevice                                                                                   
[1,1]<stdout>:    ???                                                                                                                                                               
[1,1]<stdout>:_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_ _ _ _ _ _ _                                                                                                                                                                       
[1,1]<stdout>:                                                                                                                                                                      
[1,1]<stdout>:>   ???                                                                                                                                                               
[1,1]<stdout>:E   cupy_backends.cuda.api.runtime.CUDARuntimeError: cudaErrorInvalidDevice: invalid device ordinal                                                                   
[1,1]<stdout>:                                                                                                                                                                      
[1,1]<stdout>:cupy_backends/cuda/api/runtime.pyx:142: CUDARuntimeError                                                                                                              

@edknv edknv requested a review from jperez999 April 13, 2023 17:42
@edknv edknv self-assigned this Apr 13, 2023
@edknv edknv added bug Something isn't working chore labels Apr 13, 2023
@edknv edknv added this to the Merlin 23.04 milestone Apr 13, 2023
@edknv edknv merged commit 014b658 into NVIDIA-Merlin:main Apr 14, 2023
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
bug Something isn't working chore
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Device assignment does not work in PyTorch
2 participants