You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Description
I am trying to perform multi-GPU training using the DataParallel wrapper from PyTorch. When I try to run the fit method, I encounter a RuntimeError saying that the parameters and buffers must be on the same device.
Here's a snippet of the code that I am using:
# Initialize learner and model
learn = Learner(...)
learn.model = ...
# Attempt to use DataParallel
model = nn.DataParallel(learn.model, device_ids=[1, 2, 3])
learn.model = model
# Update DataLoader device
learn.dls.device = torch.device("cuda:1")
# Clear cache
torch.cuda.empty_cache()
# Start training
learn.fit(1)
Error Message
The error message I receive is:
RuntimeError: module must have its parameters and buffers on device cuda:1 (device_ids[0]) but found one of them on device: cuda:3
Environment
PyTorch version: (e.g., 1.9.0)
Library version: (e.g., 0.2.0)
CUDA/cuDNN version: (e.g., CUDA 11.8, cuDNN 8.2.1)
GPU models and configuration: (e.g., 4x Tesla T4)
Operating System: (e.g., Ubuntu 18.04)
Additional Context
I've tried to set both the model and the DataLoader to the same device but without success. It seems like the model parameters and DataLoader are ending up on different devices during the training, causing the error.
Would appreciate any guidance on how to resolve this issue or if it's something that needs to be addressed in the library.
The text was updated successfully, but these errors were encountered:
I was able to perform training according to what appeared in the notebook but without multiple GPU's but on a single processor with 16 RAM by adding quantization to the model:
model_id = 'meta-llama/Llama-2-7b-hf'
llama_base = AutoModelForCausalLM.from_pretrained(
model_id,
low_cpu_mem_usage=True,
torch_dtype=torch.bfloat16,
use_cache=False,
token=TOKEN, # Add your token here
quantization_config=nf4_config
)
I've researched the base code a bit but I'd like to make sure:
When I call the fit method, does a normal workout take place behind the scenes? Not DeepSpeed or LORA based?
Because it's quite surprising that I was able to train on a single GPU even after the quantization.
Description
I am trying to perform multi-GPU training using the DataParallel wrapper from PyTorch. When I try to run the fit method, I encounter a RuntimeError saying that the parameters and buffers must be on the same device.
Here's a snippet of the code that I am using:
Error Message
The error message I receive is:
RuntimeError: module must have its parameters and buffers on device cuda:1 (device_ids[0]) but found one of them on device: cuda:3
Environment
PyTorch version: (e.g., 1.9.0)
Library version: (e.g., 0.2.0)
CUDA/cuDNN version: (e.g., CUDA 11.8, cuDNN 8.2.1)
GPU models and configuration: (e.g., 4x Tesla T4)
Operating System: (e.g., Ubuntu 18.04)
Additional Context
I've tried to set both the model and the DataLoader to the same device but without success. It seems like the model parameters and DataLoader are ending up on different devices during the training, causing the error.
Would appreciate any guidance on how to resolve this issue or if it's something that needs to be addressed in the library.
The text was updated successfully, but these errors were encountered: