-
Notifications
You must be signed in to change notification settings - Fork 137
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
feat: splitting multihead attention into all nodes. #46
Conversation
To merge this PR I need to fix mixtral & grok architectures. |
I changed the implementation a bit, now there is no synchronization between Transfer size / token
The final state of the attention synchronization looks like this for a single block:
The previous implementation:
|
Not sure why but I pulled the latest code and now it won't generate any tokens, getting stuck here Thought it might be the changes I was working on, as I was cleaning up server.cpp but then I tried it on main and I get same behavior. sudo nice -n -20 ./main inference --steps 10 --prompt "Hello World!" --model ~/Meta-Llama-3-8B-Instruct-Distributed/dllama_original_q40.bin --tokenizer ~/Meta-Llama-3-8B-Instruct-Distributed/dllama-llama3-tokenizer.t --weights-float-type q40 --buffer-float-type q80 --nthreads 8 --workers 192.168.1.3:9990 Then nothing happens, CPU usage goes up to around 70% but no tokens are generated, any idea what might be happening? |
@DifferentialityDevelopment have you pulled to this commit? Accidentally I disabled memory allocation. |
No I think it might have been my bad as I just realized I forgot to rebuild with latest on the worker |
Test
Model: Llama 3 8B Q40
Buffer: Q80
Setup: 4 x Raspberry Pi 5 8GB + TP-Link LS1008G Switch
Transfer size / token
Avg tokens / secon
* I think the used switch is completely non-deterministic, it achieves a random speed at different times. So I recommend to compare only the avg inference time.