-
Notifications
You must be signed in to change notification settings - Fork 465
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Issues running and changing the backend of the mnist example #84
Comments
For the question about the number provided to TchDevice::Cuda, I was curious as well and looked through the documentation. According to the backend, it is the device index. |
I think I saw that same source code which prompted me to try 0 and 1. It's still a bit unclear from the libraries what the value should be, is there a way to list valid device indices? |
I made a pull request that should fix the problem. The reason I did not see this before is that I always test using the I also added documentation on the TchDevice struct, the I will work soon on error handling and proper logging to help understand those issues more clearly. |
Have you tried this trick in your toml to get the speed optimizations without needing
|
I didn't think about this but yeah this is probably a good default and something to include in the example! |
In turns out that you can't set the optimization profile in packages under a workspace, I'm not sure how to make debug builds use a different optimization level (only for examples). |
Ubuntu 20.04 LTS, NVIDIA 3070 GPU (Driver 510.85.02, CUDA Version 11.6)
I am able to run the example as is and it trains successfully but it is very slow and appears to not be fully utilizing all the cores on my cpu. However at what appears to be the end of Epoch 2 (Last progress printout reports Iteration 80 Epoch 2/6, with 2 full bars) it crashes with this message:
I changed the example to use the Tch backend by changing main to this:
Which appeares to train using my full Cpu at a great speeds but then crashed both tries in 2 different ways. The first is the same message as above and upon using the vscode debugger it crashed in a different way:
In that case epoch was 1 and self.num_keep was 2
I changed the example main as follows to try to use my GPU:
My first question is what does the magic number in TchDevice::Cuda(XXX) represent?
Then even with various numbers for that value (0, 1, 1024) the application crashes on the line
model.to_device(device);
I always get this error message which I have been unable to solve:
The text was updated successfully, but these errors were encountered: