-
Notifications
You must be signed in to change notification settings - Fork 11.4k
Question regarding distributed computing... #946
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Comments
Consider the discussion in this PR. They're discussing limiting much more integrated high-core count CPUs to only 8 (or 4) as more cores does not seem to positively correlate with better performance. I might be misunderstanding, but I think you need faster threads, not more. |
This is somewhat false. The issue in #934 was about the interference of hyperthreaded logical "cores" and Efficiency cores (E-cores) on M1 and recent Intel chips (alderlake and above).
I think it's a better idea to stick to a single node. Distributed inference is a pretty terrible idea and has high overhead, unless you have a HPC setup. I would suggest sticking to the model (e.g. 30B 4bit quantized) that can run on a single node with 32GB RAM, and then load distributing your requests over those nodes. |
i understand the single inference.... but wouldnt it be possible to distribute it to 20 computers? (Image from Wikipedia) What i mean is for example PC1 provides the input embedding, the last pc provides the softmax output and the decoder, all pc's in between do 1 or multiple transformer blocks. network wise in this way only layer to layer transfer (at least from my noobie understanding) would happen which is very small (input and output from the transformer). I understand there is no speedup in computing, but i could if that works create thousands of requests parallel (which speeds up total compute). on the 65B model for example there should be around 10 trillion calculations required / token therefore a single output token will be maximum as fast as the operations and readspeed of the disk. But what the multi computer system allows is creating a API where we can let multiple "auto-gpt" run or even distribute it like a seti@home computing system where a huge number of requests can happen in parallel. Even assuming 1 token takes 5 seconds, if you can process with 20 computers 5000 requests in parallel it means 1000 tokens/s/batch which is pretty fast. But each request takes then approximately 10 minute to complete. Just my 2 cents on the idea why it would be nice to have in my view. |
There already exists many ways to distribute across tensor and operator. See e.g. https://alpa.ai/index.html I believe this is out of scope for llama.cpp |
Thank you very much, i will check out alpa.ai and if it would fit my need :-) |
From The main thing to solve is make the nodes communicate with each other - for example over the network. Unless, you find a very elegant way to pass and queue messages between the nodes that fits in a few hundred lines of C/C++ code. In that case, this can become a |
If you accept MPI as a dependency, this is actually very possible. The test should be written using multiple processes to simulate multiple nodes. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
I have currently access to 20 old computers, each with 32GB ram and 4 cores, 256gb ssd, 1 gbit speed network, connected to a 48port switch. (i could get a lot lot more computers but i dont have enough electricity currently)
Would it be somehow possible to distribute the llama model with llama.cpp to the 20 computers to being able to run the 65b model at a moderate speed?
What would i have to do to distribute the model on many computers to run it on cpu?
i am only interested in inference, not training..... for training i can rent cloud gpu's.
Thanks for any input that would help me / recommendation / problems.
What i see as a problem is how to split the model / models (in case i use other models) efficiently so that network bandwidth isnt the limiting factor.
The text was updated successfully, but these errors were encountered: