Question regarding distributed computing... #946

snapo · 2023-04-13T14:03:55Z

I have currently access to 20 old computers, each with 32GB ram and 4 cores, 256gb ssd, 1 gbit speed network, connected to a 48port switch. (i could get a lot lot more computers but i dont have enough electricity currently)
Would it be somehow possible to distribute the llama model with llama.cpp to the 20 computers to being able to run the 65b model at a moderate speed?
What would i have to do to distribute the model on many computers to run it on cpu?
i am only interested in inference, not training..... for training i can rent cloud gpu's.

Thanks for any input that would help me / recommendation / problems.

What i see as a problem is how to split the model / models (in case i use other models) efficiently so that network bandwidth isnt the limiting factor.

Loufe · 2023-04-14T02:26:24Z

Consider the discussion in this PR. They're discussing limiting much more integrated high-core count CPUs to only 8 (or 4) as more cores does not seem to positively correlate with better performance. I might be misunderstanding, but I think you need faster threads, not more.

jon-chuang · 2023-04-14T04:42:48Z

more cores does not seem to positively correlate with better performance

This is somewhat false. The issue in #934 was about the interference of hyperthreaded logical "cores" and Efficiency cores (E-cores) on M1 and recent Intel chips (alderlake and above).

What would i have to do to distribute the model on many computers to run it on cpu?

I think it's a better idea to stick to a single node. Distributed inference is a pretty terrible idea and has high overhead, unless you have a HPC setup. I would suggest sticking to the model (e.g. 30B 4bit quantized) that can run on a single node with 32GB RAM, and then load distributing your requests over those nodes.

snapo · 2023-04-14T05:15:37Z

i understand the single inference.... but wouldnt it be possible to distribute it to 20 computers?
I mean with that each layer on a single computer that runs on 4 threads (because there are 4 cores).
the connection between the layers contain only the transformer block output (even it means upgrading the disk on all the 20 computers so they have the full network each)

(Image from Wikipedia)

What i mean is for example PC1 provides the input embedding, the last pc provides the softmax output and the decoder, all pc's in between do 1 or multiple transformer blocks.

network wise in this way only layer to layer transfer (at least from my noobie understanding) would happen which is very small (input and output from the transformer).

I understand there is no speedup in computing, but i could if that works create thousands of requests parallel (which speeds up total compute).

on the 65B model for example there should be around 10 trillion calculations required / token therefore a single output token will be maximum as fast as the operations and readspeed of the disk.

But what the multi computer system allows is creating a API where we can let multiple "auto-gpt" run or even distribute it like a seti@home computing system where a huge number of requests can happen in parallel.

Even assuming 1 token takes 5 seconds, if you can process with 20 computers 5000 requests in parallel it means 1000 tokens/s/batch which is pretty fast. But each request takes then approximately 10 minute to complete.

Just my 2 cents on the idea why it would be nice to have in my view.

jon-chuang · 2023-04-14T05:56:09Z

There already exists many ways to distribute across tensor and operator. See e.g. https://alpa.ai/index.html

I believe this is out of scope for llama.cpp

snapo · 2023-04-14T06:05:36Z

Thank you very much, i will check out alpa.ai and if it would fit my need :-)

ggerganov · 2023-04-14T06:07:32Z

From ggml point of view, such distributed computing is completely possible. You simply have to partition your transformer the way you like and load the respective tensors on the respective nodes. You then create the partial compute graphs and should be ready to compute.

The main thing to solve is make the nodes communicate with each other - for example over the network.
This is something that will likely never be part of ggml or even llama.cpp since it will bring 3rd party dependencies. So a distributed computing example will likely have to be demonstrated in a separate respository / fork.

Unless, you find a very elegant way to pass and queue messages between the nodes that fits in a few hundred lines of C/C++ code. In that case, this can become a llama.cpp example and I think it will be of great interest. Even if it works only on Linux for example.

jon-chuang · 2023-04-14T06:12:33Z

Unless, you find a very elegant way to pass and queue messages between the nodes that fits in a few hundred lines of C/C++ code. In that case, this can become a llama.cpp example and I think it will be of great interest. Even if it works only on Linux for example.

If you accept MPI as a dependency, this is actually very possible.

The test should be written using multiple processes to simulate multiple nodes.

github-actions · 2024-04-11T01:06:31Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

evanmiller mentioned this issue Jul 4, 2023

Distributed inference via MPI #2099

Merged

github-actions bot added the stale label Mar 25, 2024

github-actions bot closed this as completed Apr 11, 2024

jeroen-mostert pushed a commit to jeroen-mostert/llama.cpp that referenced this issue Aug 30, 2024

Add Cublas12 dlls to .gitignore (ggml-org#946)

1203a56

Bearsaerker mentioned this issue Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question regarding distributed computing... #946

Question regarding distributed computing... #946

snapo commented Apr 13, 2023

Loufe commented Apr 14, 2023

jon-chuang commented Apr 14, 2023 •

edited

Loading

snapo commented Apr 14, 2023 •

edited

Loading

jon-chuang commented Apr 14, 2023

snapo commented Apr 14, 2023

ggerganov commented Apr 14, 2023 •

edited

Loading

jon-chuang commented Apr 14, 2023

github-actions bot commented Apr 11, 2024

Question regarding distributed computing... #946

Question regarding distributed computing... #946

Comments

snapo commented Apr 13, 2023

Loufe commented Apr 14, 2023

jon-chuang commented Apr 14, 2023 • edited Loading

snapo commented Apr 14, 2023 • edited Loading

jon-chuang commented Apr 14, 2023

snapo commented Apr 14, 2023

ggerganov commented Apr 14, 2023 • edited Loading

jon-chuang commented Apr 14, 2023

github-actions bot commented Apr 11, 2024

jon-chuang commented Apr 14, 2023 •

edited

Loading

snapo commented Apr 14, 2023 •

edited

Loading

ggerganov commented Apr 14, 2023 •

edited

Loading