Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

feat: splitting multihead attention into all nodes. #46

Merged
merged 6 commits into from
May 13, 2024
Merged

Conversation

b4rtaz
Copy link
Owner

@b4rtaz b4rtaz commented May 11, 2024

Test

Model: Llama 3 8B Q40
Buffer: Q80
Setup: 4 x Raspberry Pi 5 8GB + TP-Link LS1008G Switch

Transfer size / token

Devices 0.3.0 This PR Percentage change
2 x Raspberry Pi 5 S 646 kB + R 476 kB = 1122 kB S 578 kB + R 442 kB = 1020 kB -9.09%
4 x Raspberry Pi 5 S 2295 kB + R 714 kB = 3009 kB S 2193 kB + R 663 kB = 2856 kB -5.08%

Avg tokens / secon

Devices 0.3.0 This PR Percentage change
2 x Raspberry Pi 5 Avg generation time 444.27 ms 381.81 ms
Avg inference time 362.73 349.94 ms -3.53%
Avg transfer time 80.11 ms 30.31 ms*
4 x Raspberry Pi 5 Avg generation time 331.47 ms 359.44 ms
Avg inference time 267.62 ms 258.00 ms -3.59%
Avg transfer time 62.34 ms 99.69 ms

* I think the used switch is completely non-deterministic, it achieves a random speed at different times. So I recommend to compare only the avg inference time.

@b4rtaz
Copy link
Owner Author

b4rtaz commented May 11, 2024

To merge this PR I need to fix mixtral & grok architectures.

@b4rtaz
Copy link
Owner Author

b4rtaz commented May 11, 2024

I changed the implementation a bit, now there is no synchronization between llamaQuantizeMultiheadAtt and llamaAtt.

Transfer size / token

Devices 0.3.0 This PR v2 Percentage change
2 devices S 646 kB + R 476 kB = 1122 kB S 510 kB + R 442 kB = 952 kB -15.15%
4 devices S 2295 kB + R 714 kB = 3009 kB S 1887 kB + R 867 kB = 2754 kB -8.47%
8 devices S 5771 kB + R 833 kB = 6604 kB S 4819 kB + R 1487 kB = 6306 kB -4.51%

The final state of the attention synchronization looks like this for a single block:

root --- xb  ---> node
root <-- xbv ---- node
merge att

The previous implementation:

root --- xb  --> node
root <-- q  ---- node
root <-- k  ---- node
root <-- v  ---- node
root --- xb ---> node
root <-- xb2 --- node
merge att

@b4rtaz b4rtaz marked this pull request as ready for review May 13, 2024 21:26
@b4rtaz b4rtaz merged commit af8b317 into main May 13, 2024
2 checks passed
@DifferentialityDevelopment
Copy link
Contributor

Not sure why but I pulled the latest code and now it won't generate any tokens, getting stuck here
float* logits = inference->infer(token, pos);

Thought it might be the changes I was working on, as I was cleaning up server.cpp but then I tried it on main and I get same behavior.

sudo nice -n -20 ./main inference --steps 10 --prompt "Hello World!" --model ~/Meta-Llama-3-8B-Instruct-Distributed/dllama_original_q40.bin --tokenizer ~/Meta-Llama-3-8B-Instruct-Distributed/dllama-llama3-tokenizer.t --weights-float-type q40 --buffer-float-type q80 --nthreads 8 --workers 192.168.1.3:9990
💡 arch: llama2
💡 dim: 4096
💡 hiddenDim: 14336
💡 nLayers: 32
💡 nHeads: 32
💡 nKvHeads: 8
💡 vocabSize: 128256
💡 seqLen: 2048
💡 nSlices: 2
💡 ropeTheta: 500000.0
📄 bosId: 128000
📄 eosId: 128001
🕒 ropeCache: 16384 kB
⏩ Loaded 6175568 kB

Then nothing happens, CPU usage goes up to around 70% but no tokens are generated, any idea what might be happening?

@b4rtaz
Copy link
Owner Author

b4rtaz commented May 14, 2024

@DifferentialityDevelopment have you pulled to this commit? Accidentally I disabled memory allocation.

@DifferentialityDevelopment
Copy link
Contributor

No I think it might have been my bad as I just realized I forgot to rebuild with latest on the worker

@b4rtaz b4rtaz deleted the feat/qkv branch May 18, 2024 11:50
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants