We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
如题,多机训练失败后,非master node还是存活着一个libai进程,导致会持续向控制台打印日志。类似这样的日志:
The text was updated successfully, but these errors were encountered:
收到,我们尝试复现一下问题。
Sorry, something went wrong.
您好,请问【多机训练失败】是手动CTRL + C结束程序,还是代码异常报错失败呢?
我这里基于:https://libai.readthedocs.io/en/latest/tutorials/get_started/quick_run.html 的bert demo跑了一下2机的,CTRL + C以后,master(node0)结束后,node1的程序是可以正常终止的。
代码异常报错失败哈
No branches or pull requests
如题,多机训练失败后,非master node还是存活着一个libai进程,导致会持续向控制台打印日志。类似这样的日志:

The text was updated successfully, but these errors were encountered: