Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

关于benchmark实验结果的疑问 #421

Open
frankxyy opened this issue Nov 7, 2022 · 2 comments
Open

关于benchmark实验结果的疑问 #421

frankxyy opened this issue Nov 7, 2022 · 2 comments

Comments

@frankxyy
Copy link

frankxyy commented Nov 7, 2022

image

image

在相同的1n1g的机器资源下,为什么对于tensor model parallel,bs更大,samples/s 还小了?

@chengtbf
Copy link

chengtbf commented Nov 7, 2022

  1. 可以看一下 ac 这个参数(activation checkpointing) ,这个是反向重计算,会额外在反向的时候做一遍前向,从而大幅降低显存开销(可以跑更大的 batch size)约 40%,但是会有 20% 左右的性能开销。

视前向计算在整体的占比,如果是 acc 场景, 占比会更大一些,约 1/3 = 前向 /( 前向 + 反向),一般网络,反向计算量是前向的两倍。

tensor model parallel 中用到了 ac,所以才可以跑 128 这么大的 bs,代价就是会多做一次前向。

  1. 并不是增大 bs 一定会增加速度。当 GPU 利用率打满以后,再增加 bs ,并不会增加吞吐。

@frankxyy
Copy link
Author

frankxyy commented Nov 8, 2022

哦哦,了解了,这样看来对于bert,使用tensor parallel没有效果啊

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants