-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Problem with finetune model #318
Comments
Anyone got same issue like this? |
There is a small bug in master fine-tune which wastes some memory with the loaded checkpoint. I guess Ive fixed it on Dec branch. Otherwise in couple days I will |
Thanks for the reply. I encounter this issue when using dev branch. Looking forward to your update! I tried to locate the problem myself but no luck with that 😂 |
now fixed on dev |
# for free
to join this conversation on GitHub.
Already have an account?
# to comment
@erogol @reuben @twerkmeister @ekr Hey guys, thanks for you great work here! I'm trying training Tacotron2 with a custom dataset and generally it runs well but still with some issues that I failed to resolve. It would be kind if you could give me some ideas about them.
![image](https://user-images.githubusercontent.com/33118185/69768025-e1045180-11b9-11ea-868b-f4619c9aae44.png)
The nearest problem I got yesterday is when I tried to finetune my model with BN version prenet as mentioned by @erogol in other comment. But with distributed training launched by 'python3.7 distribute.py --restore_path xxxx/best_model.pth.tar', I soon got cuda memory error while found the GPU situation showed below. If I understand well, the main GPU device 0 has been used by all the other 7 subprocess and ran out of memeory while the other 7 still had free memories. I did some search and this probably relates to the restore of Adam optimizer since someone comment that Adam has to restore all parameters only from main GPU device? Any idear about this?
Other doubts are about some training details that I posted here #58 (comment)
I woud be grateful if you could share some ideas with me, thanks in advance!
The text was updated successfully, but these errors were encountered: