You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am using all default configs (input of static RGB image and Proprio data only). I started getting NaN values that crash the training in epoch 3 (and once in epoch 9). Restarting from the prior epoch never helps. I always have to start training from scratch. I tried debugging this but could not figure it out. Has anyone run into this issue before?
ValueError: Expected parameter loc (Tensor of shape (32, 256)) of distribution Normal(loc: torch.Size([32, 256]), scale: torch.Size([32, 256])) to satisfy the constraint Real(), but found invalid values:
File "/---/calvin/calvin_models/calvin_agent/models/mcil.py", line 278, in training_step
kl, act_loss, mod_loss, pp_dist, pr_dist = self.lmp_train(
File "/---/calvin/calvin_models/calvin_agent/models/mcil.py", line 134, in lmp_train
pr_dist = self.plan_recognition(perceptual_emb) # (batch, 256) each
File "/---/calvin/calvin_models/calvin_agent/models/plan_encoders/plan_recognition_net.py", line 58, in __call__
pr_dist = Independent(Normal(mean, std), 1)
The text was updated successfully, but these errors were encountered:
hmm that's is weird, what batch size are you using? Are you using the same hyperparams and GPU settings?
We used Pytorch Lightnings DDP implementation to scale our training to 8x NVIDIA GPUs with 12GB memory each. Thus, as each GPU receives a batch of 64 sequences (32 language + 32 vision), the effective batch size is 512 for all our experiments. If you have a different batch size, you might need to tune your learning rate to stabilize your train.
I am using all default configs (input of static RGB image and Proprio data only). I started getting NaN values that crash the training in epoch 3 (and once in epoch 9). Restarting from the prior epoch never helps. I always have to start training from scratch. I tried debugging this but could not figure it out. Has anyone run into this issue before?
ValueError: Expected parameter loc (Tensor of shape (32, 256)) of distribution Normal(loc: torch.Size([32, 256]), scale: torch.Size([32, 256])) to satisfy the constraint Real(), but found invalid values:
The text was updated successfully, but these errors were encountered: