Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Test on a real 24GB GPU #6

Open
ydyhello opened this issue Jan 14, 2025 · 6 comments
Open

Test on a real 24GB GPU #6

ydyhello opened this issue Jan 14, 2025 · 6 comments

Comments

@ydyhello
Copy link

Thanks for your great work! In your paper, you mentioned, 'We simulate a 24GB GPU by setting the memory limit with L20.' Could you please clarify if you tested this on a real 24GB GPU (such as RTX 4090)?

@dreaming-panda
Copy link
Contributor

dreaming-panda commented Jan 15, 2025

No. When releasing V0.2, we cannot find RTX4090 together with Intel CPUs supporting AVX512. So just as the paper said, we test the performance by using torch.cuda.set_per_process_memory_fraction() for decoding. Theoretically, RTX4090 will be faster than L20 in bandwidth. It will be interesting if it is actually not this case.

@ydyhello
Copy link
Author

ydyhello commented Jan 15, 2025

No. When releasing V0.2, we cannot find RTX4090 together with Intel CPUs supporting AVX512. So just as the paper said, we test the performance by using torch.cuda.set_per_process_memory_fraction() for decoding. Theoretically, RTX4090 will be faster than L20 in bandwidth. It will be interesting if it is actually not this case.

Thank you very much for your response!

When I used the offload strategy with a real 24GB GPU, I encountered the following issue: when using the A100 (80GB), the peak memory usage was 20GB. However, when using the same program on the 4090 (24GB), I encountered a “CUDA out of memory” error. This is because the remaining 4GB of memory on the 4090 (24GB - 20GB) is not contiguous. During the prefill phase, a certain computation requires the allocation of a large amount of contiguous memory (due to the long input sequence), and the fragmented 4GB of memory cannot meet this requirement, leading to the error.

I would like to ask whether, in your tests, if the peak memory usage is less than 24GB, the program can still run normally on the 4090?

@dreaming-panda
Copy link
Contributor

I have never met this error (when I ran the prefill stage on 4090). Could you make the chunk size smaller?

self.chunk_size = 8192

self.chunk_size = 8192

This can make the peak memory smaller. I hope this can solve your problem.

@ydyhello
Copy link
Author

I have never met this error (when I ran the prefill stage on 4090). Could you make the chunk size smaller?

self.chunk_size = 8192

self.chunk_size = 8192

This can make the peak memory smaller. I hope this can solve your problem.

Thank you very much for your response. I will try modifying my code.

@ydyhello
Copy link
Author

Thank you again for your help. I have optimized some redundant calculations in the transformer, and it is now running smoothly on a 24GB GPU.

@dreaming-panda
Copy link
Contributor

My pleasure.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants