Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Checkpoint size问题 #346

Open
DUKaige opened this issue Sep 15, 2020 · 1 comment
Open

Checkpoint size问题 #346

DUKaige opened this issue Sep 15, 2020 · 1 comment

Comments

@DUKaige
Copy link

DUKaige commented Sep 15, 2020

我设置了个以embedding table为主的模型,设置ps_memory_m=64G,并且调整模型大小,使的placement正好占满memory(
再大一点就会报cannot placement的错误)。然后使用saver take了checkpoint,但是发现每个server的checkpoint大小都只有20G左右。请问为什么checkpoint的大小相比ps_memory_m要小这么多?XDL里什么部分使用了大量的内存,但是却不需要写到checkpoint里?

@deerluffy
Copy link

不确定,印象中稀疏的embedding初始化时是会分配额外的内存的,但是保存ck的时候是只保存有值部分的,所以ps 的内存和ck大小肯定是不一致的。

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants