Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

多机训练失败后,非master node的进程没有完全kill掉 #416

Open
frankxyy opened this issue Oct 31, 2022 · 3 comments
Open

多机训练失败后,非master node的进程没有完全kill掉 #416

frankxyy opened this issue Oct 31, 2022 · 3 comments

Comments

@frankxyy
Copy link

如题,多机训练失败后,非master node还是存活着一个libai进程,导致会持续向控制台打印日志。类似这样的日志:
image

@strint
Copy link
Collaborator

strint commented Oct 31, 2022

收到,我们尝试复现一下问题。

@Flowingsun007
Copy link

您好,请问【多机训练失败】是手动CTRL + C结束程序,还是代码异常报错失败呢?

我这里基于:https://libai.readthedocs.io/en/latest/tutorials/get_started/quick_run.html 的bert demo跑了一下2机的,CTRL + C以后,master(node0)结束后,node1的程序是可以正常终止的。

@frankxyy
Copy link
Author

代码异常报错失败哈

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants