Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Running into NaN outputs of plan_recognition_net #64

Open
RamyE opened this issue Dec 5, 2023 · 1 comment
Open

Running into NaN outputs of plan_recognition_net #64

RamyE opened this issue Dec 5, 2023 · 1 comment

Comments

@RamyE
Copy link

RamyE commented Dec 5, 2023

I am using all default configs (input of static RGB image and Proprio data only). I started getting NaN values that crash the training in epoch 3 (and once in epoch 9). Restarting from the prior epoch never helps. I always have to start training from scratch. I tried debugging this but could not figure it out. Has anyone run into this issue before?

ValueError: Expected parameter loc (Tensor of shape (32, 256)) of distribution Normal(loc: torch.Size([32, 256]), scale: torch.Size([32, 256])) to satisfy the constraint Real(), but found invalid values:

File "/---/calvin/calvin_models/calvin_agent/models/mcil.py", line 278, in training_step
    kl, act_loss, mod_loss, pp_dist, pr_dist = self.lmp_train(
  File "/---/calvin/calvin_models/calvin_agent/models/mcil.py", line 134, in lmp_train
    pr_dist = self.plan_recognition(perceptual_emb)  # (batch, 256) each
  File "/---/calvin/calvin_models/calvin_agent/models/plan_encoders/plan_recognition_net.py", line 58, in __call__
    pr_dist = Independent(Normal(mean, std), 1)
@mees
Copy link
Owner

mees commented Dec 29, 2023

hmm that's is weird, what batch size are you using? Are you using the same hyperparams and GPU settings?
We used Pytorch Lightnings DDP implementation to scale our training to 8x NVIDIA GPUs with 12GB memory each. Thus, as each GPU receives a batch of 64 sequences (32 language + 32 vision), the effective batch size is 512 for all our experiments. If you have a different batch size, you might need to tune your learning rate to stabilize your train.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants