-
Notifications
You must be signed in to change notification settings - Fork 200
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
[BUG] cudaErrorIllegalAddress: an illegal memory access was encounteredThread #441
Comments
Hi @kangna-qi ,thank you for using SOK. It seems to be a GPU memory out-of-bounds error. Could you provide me with the code of how you use SOK so that I can reproduce the problem? Also , is the problem happens on the beginning of your training? , if not , I recommend your use HKV as backend of the dynamic embedding variable, here is the example : https://github.com/NVIDIA-Merlin/HugeCTR/blob/main/sparse_operation_kit/sparse_operation_kit/examples/lookup_example_tf1/lookup_sparse_distributed_hkv_test.py |
@kanghui0204 Thanks for your reply.I've alreadly solved this problem.I can train the model with TF single threading. When using TF for multi-threaded model training, cuco requires locking to ensure correct calculations. |
Hi @kangna-qi ,I would like to ask if multi-threaded model training which you mentioned is MirroredStrategy? Or is it about concurrency between different OPs? If it is MirroredStrategy, I would like to know if there are any requirements that necessitate the use of MirroredStrategy. If not, we recommend using Horovod for multi-GPU training. |
Hi @minseokl , because @kangna-qi didn't response for 2 weeks , I decide close this issue, FYI. |
Describe the bug
I use tf1+sok to train my own model.I meet this error.
Environment (please complete the following information):
The text was updated successfully, but these errors were encountered: