Test on a real 24GB GPU #6

ydyhello · 2025-01-14T13:54:35Z

Thanks for your great work! In your paper, you mentioned, 'We simulate a 24GB GPU by setting the memory limit with L20.' Could you please clarify if you tested this on a real 24GB GPU (such as RTX 4090)?

dreaming-panda · 2025-01-15T05:12:48Z

No. When releasing V0.2, we cannot find RTX4090 together with Intel CPUs supporting AVX512. So just as the paper said, we test the performance by using torch.cuda.set_per_process_memory_fraction() for decoding. Theoretically, RTX4090 will be faster than L20 in bandwidth. It will be interesting if it is actually not this case.

ydyhello · 2025-01-15T06:09:32Z

No. When releasing V0.2, we cannot find RTX4090 together with Intel CPUs supporting AVX512. So just as the paper said, we test the performance by using torch.cuda.set_per_process_memory_fraction() for decoding. Theoretically, RTX4090 will be faster than L20 in bandwidth. It will be interesting if it is actually not this case.

Thank you very much for your response!

When I used the offload strategy with a real 24GB GPU, I encountered the following issue: when using the A100 (80GB), the peak memory usage was 20GB. However, when using the same program on the 4090 (24GB), I encountered a “CUDA out of memory” error. This is because the remaining 4GB of memory on the 4090 (24GB - 20GB) is not contiguous. During the prefill phase, a certain computation requires the allocation of a large amount of contiguous memory (due to the long input sequence), and the fragmented 4GB of memory cannot meet this requirement, leading to the error.

I would like to ask whether, in your tests, if the peak memory usage is less than 24GB, the program can still run normally on the 4090?

dreaming-panda · 2025-01-15T06:17:52Z

I have never met this error (when I ran the prefill stage on 4090). Could you make the chunk size smaller?

MagicPIG/models/attnserver.py

Line 67 in ac9aa36

self.chunk_size = 8192

MagicPIG/models/llama.py

Line 100 in ac9aa36

self.chunk_size = 8192

This can make the peak memory smaller. I hope this can solve your problem.

ydyhello · 2025-01-15T06:33:04Z

I have never met this error (when I ran the prefill stage on 4090). Could you make the chunk size smaller?

MagicPIG/models/attnserver.py

Line 67 in ac9aa36

self.chunk_size = 8192

MagicPIG/models/llama.py

Line 100 in ac9aa36

self.chunk_size = 8192

This can make the peak memory smaller. I hope this can solve your problem.

Thank you very much for your response. I will try modifying my code.

ydyhello · 2025-01-15T06:58:59Z

Thank you again for your help. I have optimized some redundant calculations in the transformer, and it is now running smoothly on a 24GB GPU.

dreaming-panda · 2025-01-15T08:07:01Z

My pleasure.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test on a real 24GB GPU #6

Test on a real 24GB GPU #6

ydyhello commented Jan 14, 2025

dreaming-panda commented Jan 15, 2025 •

edited

Loading

ydyhello commented Jan 15, 2025 •

edited

Loading

dreaming-panda commented Jan 15, 2025

ydyhello commented Jan 15, 2025

ydyhello commented Jan 15, 2025

dreaming-panda commented Jan 15, 2025

Test on a real 24GB GPU #6

Test on a real 24GB GPU #6

Comments

ydyhello commented Jan 14, 2025

dreaming-panda commented Jan 15, 2025 • edited Loading

ydyhello commented Jan 15, 2025 • edited Loading

dreaming-panda commented Jan 15, 2025

ydyhello commented Jan 15, 2025

ydyhello commented Jan 15, 2025

dreaming-panda commented Jan 15, 2025

dreaming-panda commented Jan 15, 2025 •

edited

Loading

ydyhello commented Jan 15, 2025 •

edited

Loading