You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Example the following task - running multiple prompts to the same LLM.
LLM weights does not fit into vram, but can be loaded partialy, layer by layer or group of them. Of course, loading weights into vram too slow and we can't move it continuously between cpu-gpu but... We can save the intermediate result of calculations between layers.
Now, we can load part of layers into vram, do partially execution for each tasks from batch, save result, load next part of layers weights, do next computation for tasks awaiting for and repeat. So we can run all layers computations on GPU and minimize data moving from cpu to gpu by increasing number of tasks in batch.
Desktop GPU faster than CPU aprox 6-10 times. Moving weights overheads compared to 1-2 runs for CPU, so more than 10 tasks in batch can make result faster with any gpu vram amount (minimum one layer of maximum size)! Only pci-e bus bandwidth is important.
How much RAM is required to store the context of a single request in a batch? Is there data that can be shared or compressed without sacrificing speed? Can be ~100 contexts stored in 64GB ram?
p.s. Much more complex to speed up runs using multiple gpu, as i understand, single matrix can be divided to parts and compute on multiple gpu simultaneously
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Example the following task - running multiple prompts to the same LLM.
LLM weights does not fit into vram, but can be loaded partialy, layer by layer or group of them. Of course, loading weights into vram too slow and we can't move it continuously between cpu-gpu but... We can save the intermediate result of calculations between layers.
Now, we can load part of layers into vram, do partially execution for each tasks from batch, save result, load next part of layers weights, do next computation for tasks awaiting for and repeat. So we can run all layers computations on GPU and minimize data moving from cpu to gpu by increasing number of tasks in batch.
Desktop GPU faster than CPU aprox 6-10 times. Moving weights overheads compared to 1-2 runs for CPU, so more than 10 tasks in batch can make result faster with any gpu vram amount (minimum one layer of maximum size)! Only pci-e bus bandwidth is important.
How much RAM is required to store the context of a single request in a batch? Is there data that can be shared or compressed without sacrificing speed? Can be ~100 contexts stored in 64GB ram?
As far as I understand, this is the approach implemented in this project:
https://github.com/FMInference/FlexGen/blob/main/docs/block_schedule.jpg
p.s. Much more complex to speed up runs using multiple gpu, as i understand, single matrix can be divided to parts and compute on multiple gpu simultaneously
Beta Was this translation helpful? Give feedback.
All reactions