-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Test on a real 24GB GPU #6
Comments
No. When releasing V0.2, we cannot find RTX4090 together with Intel CPUs supporting AVX512. So just as the paper said, we test the performance by using torch.cuda.set_per_process_memory_fraction() for decoding. Theoretically, RTX4090 will be faster than L20 in bandwidth. It will be interesting if it is actually not this case. |
Thank you very much for your response! When I used the offload strategy with a real 24GB GPU, I encountered the following issue: when using the A100 (80GB), the peak memory usage was 20GB. However, when using the same program on the 4090 (24GB), I encountered a “CUDA out of memory” error. This is because the remaining 4GB of memory on the 4090 (24GB - 20GB) is not contiguous. During the prefill phase, a certain computation requires the allocation of a large amount of contiguous memory (due to the long input sequence), and the fragmented 4GB of memory cannot meet this requirement, leading to the error. I would like to ask whether, in your tests, if the peak memory usage is less than 24GB, the program can still run normally on the 4090? |
Thank you very much for your response. I will try modifying my code. |
Thank you again for your help. I have optimized some redundant calculations in the transformer, and it is now running smoothly on a 24GB GPU. |
My pleasure. |
Thanks for your great work! In your paper, you mentioned, 'We simulate a 24GB GPU by setting the memory limit with L20.' Could you please clarify if you tested this on a real 24GB GPU (such as RTX 4090)?
The text was updated successfully, but these errors were encountered: