We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
我设置了个以embedding table为主的模型,设置ps_memory_m=64G,并且调整模型大小,使的placement正好占满memory( 再大一点就会报cannot placement的错误)。然后使用saver take了checkpoint,但是发现每个server的checkpoint大小都只有20G左右。请问为什么checkpoint的大小相比ps_memory_m要小这么多?XDL里什么部分使用了大量的内存,但是却不需要写到checkpoint里?
The text was updated successfully, but these errors were encountered:
不确定,印象中稀疏的embedding初始化时是会分配额外的内存的,但是保存ck的时候是只保存有值部分的,所以ps 的内存和ck大小肯定是不一致的。
Sorry, something went wrong.
No branches or pull requests
我设置了个以embedding table为主的模型,设置ps_memory_m=64G,并且调整模型大小,使的placement正好占满memory(
再大一点就会报cannot placement的错误)。然后使用saver take了checkpoint,但是发现每个server的checkpoint大小都只有20G左右。请问为什么checkpoint的大小相比ps_memory_m要小这么多?XDL里什么部分使用了大量的内存,但是却不需要写到checkpoint里?
The text was updated successfully, but these errors were encountered: