Running into NaN outputs of plan_recognition_net #64

RamyE · 2023-12-05T20:35:51Z

I am using all default configs (input of static RGB image and Proprio data only). I started getting NaN values that crash the training in epoch 3 (and once in epoch 9). Restarting from the prior epoch never helps. I always have to start training from scratch. I tried debugging this but could not figure it out. Has anyone run into this issue before?

ValueError: Expected parameter loc (Tensor of shape (32, 256)) of distribution Normal(loc: torch.Size([32, 256]), scale: torch.Size([32, 256])) to satisfy the constraint Real(), but found invalid values:

File "/---/calvin/calvin_models/calvin_agent/models/mcil.py", line 278, in training_step
    kl, act_loss, mod_loss, pp_dist, pr_dist = self.lmp_train(
  File "/---/calvin/calvin_models/calvin_agent/models/mcil.py", line 134, in lmp_train
    pr_dist = self.plan_recognition(perceptual_emb)  # (batch, 256) each
  File "/---/calvin/calvin_models/calvin_agent/models/plan_encoders/plan_recognition_net.py", line 58, in __call__
    pr_dist = Independent(Normal(mean, std), 1)

The text was updated successfully, but these errors were encountered:

mees · 2023-12-29T09:30:25Z

hmm that's is weird, what batch size are you using? Are you using the same hyperparams and GPU settings?
We used Pytorch Lightnings DDP implementation to scale our training to 8x NVIDIA GPUs with 12GB memory each. Thus, as each GPU receives a batch of 64 sequences (32 language + 32 vision), the effective batch size is 512 for all our experiments. If you have a different batch size, you might need to tune your learning rate to stabilize your train.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running into NaN outputs of plan_recognition_net #64

Running into NaN outputs of plan_recognition_net #64

RamyE commented Dec 5, 2023

mees commented Dec 29, 2023

Running into NaN outputs of plan_recognition_net #64

Running into NaN outputs of plan_recognition_net #64

Comments

RamyE commented Dec 5, 2023

mees commented Dec 29, 2023