Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Nimble training 시 에러가 납니다. #12

Open
jsun94 opened this issue May 10, 2021 · 2 comments
Open

Nimble training 시 에러가 납니다. #12

jsun94 opened this issue May 10, 2021 · 2 comments

Comments

@jsun94
Copy link

jsun94 commented May 10, 2021

안녕하세요

Inference 코드를 실행할 때는 에러가 나지 않지만
Training 코드 실행 시 에러가 납니다.

import torch
import torchvision
import os

os.environ["CUDA_VISIBLE_DEVICES"]="1"

BATCH = 32


model = torchvision.models.resnet50(num_classes=10)
model = model.cuda()
model.train()

loss_fn = torch.nn.CrossEntropyLoss().cuda()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

input_shape = [BATCH, 3, 32, 32]
dummy_input = torch.randn(*input_shape).cuda()

nimble_model = torch.cuda.Nimble(model)
nimble_model.prepare(dummy_input, training=True)

rand_input = torch.rand(*input_shape).cuda()
output = nimble_model(rand_input)

label = torch.zeros(BATCH, dtype=torch.long).cuda()
loss = loss_fn(output, label)

loss.backward()

optimizer.step()

위의 코드대로 실행 시 prepare에서

TracerWarning: Output nr 1. of the traced function does not match the corresponding output of the Python function. Detailed error:
With rtol=1e-05 and atol=1e-05, found 297 element(s) (out of 320) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 0.0005762577056884766 (-0.9435920119285583 vs. -0.9441682696342468), which occurred at index (15, 3).

위와 같은 에러가 나서 어떤 부분이 잘못된 것인지 질문드립니다.

환경 :
Ubuntu : 18.04
Linux : 5.4.0
Pytorch : 1.7.0
Python : 3.7.10
cuDNN, CUDA는 각각 Nimble에서 요구하는 환경입니다.

@gyeongin
Copy link
Contributor

This is a warning, not an error. The code runs without any problem.

In fact, this warning is not related with Nimble!
You can reproduce the same warning message by running the following code snippet:

import torch
import torchvision

model = torchvision.models.resnet50(num_classes=10)
model = model.cuda()
model.train()
input_shape = [32, 3, 32, 32]
dummy_input = torch.randn(*input_shape).cuda()
jit_model = torch.jit.trace(model, dummy_input)

@gyeongin
Copy link
Contributor

Error가 아니라 warning 메시지이기 때문에 첨부해주신 코드는 문제없이 동작합니다.
사실 이 warning message는 PyTorch의 JIT tracing 과정에서 생기는 것으로, Nimble과 관계없이 위에 제가 올린 코드를 수행해보시면 같은 종류의 message를 보실 수 있으실 겁니다.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants