-
Notifications
You must be signed in to change notification settings - Fork 137
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Use AVX2 to speedup matmulQ40 #54
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Hi @b4rtaz I managed to get a significant speed up on my machine with the following changes I added AVX2 instructions to speed up the matmulQ40 in funcs.cpp From my initial testing it definitely appears to be faster With 1 worker: sudo nice -n -20 ./main inference --steps 20 --prompt "Hello World! " --model ~/Meta-Llama-3-8B-Instruct-Distributed/dllama_original_q40.bin --tokenizer ~/Meta-Llama-3-8B-Instruct-Distributed/dllama-llama3-tokenizer.t --weights-float-type q40 --buffer-float-type q80 --nthreads 8 --workers 192.168.1.3:9990 [sudo] password for azamorn: Using AVX2 instructions💡 arch: llama2 💡 dim: 4096 💡 hiddenDim: 14336 💡 nLayers: 32 💡 nHeads: 32 💡 nKvHeads: 8 💡 vocabSize: 128256 💡 seqLen: 2048 💡 nSlices: 2 💡 ropeTheta: 500000.0 📄 bosId: 128000 📄 eosId: 128001 🕒 ropeCache: 16384 kB ⏩ Loaded 6175568 kB 🔶 G 358 ms I 147 ms T 211 ms S 1917438 kB R 442 kB Hello 🔶 G 352 ms I 133 ms T 219 ms S 510 kB R 442 kB World 🔶 G 344 ms I 143 ms T 200 ms S 510 kB R 442 kB ! 🔶 G 369 ms I 145 ms T 224 ms S 510 kB R 442 kB 🔶 G 339 ms I 140 ms T 198 ms S 510 kB R 442 kB I 🔶 G 347 ms I 148 ms T 198 ms S 510 kB R 442 kB 'm 🔶 G 368 ms I 150 ms T 218 ms S 510 kB R 442 kB a 🔶 G 361 ms I 137 ms T 223 ms S 510 kB R 442 kB bot 🔶 G 380 ms I 137 ms T 242 ms S 510 kB R 442 kB . 🔶 G 365 ms I 143 ms T 221 ms S 510 kB R 442 kB 🔶 G 356 ms I 139 ms T 217 ms S 510 kB R 442 kB I 🔶 G 356 ms I 145 ms T 211 ms S 510 kB R 442 kB 'm 🔶 G 364 ms I 143 ms T 221 ms S 510 kB R 442 kB here 🔶 G 375 ms I 136 ms T 239 ms S 510 kB R 442 kB to 🔶 G 345 ms I 132 ms T 212 ms S 510 kB R 442 kB help 🔶 G 367 ms I 140 ms T 227 ms S 510 kB R 442 kB you 🔶 G 343 ms I 134 ms T 208 ms S 510 kB R 442 kB with 🔶 G 352 ms I 144 ms T 208 ms S 510 kB R 442 kB any 🔶 G 362 ms I 145 ms T 217 ms S 510 kB R 442 kB questions 🔶 G 344 ms I 143 ms T 200 ms S 510 kB R 442 kB you Generated tokens: 20 Avg tokens / second: 2.80 Avg generation time: 357.35 ms Avg inference time: 141.20 ms Avg transfer time: 215.70 ms Without a worker: sudo nice -n -20 ./main inference --steps 20 --prompt "Hello World! " --model ~/Meta-Llama-3-8B-Instruct-Distributed/dllama_original_q40.bin --tokenizer ~/Meta-Llama-3-8B-Instruct-Distributed/dllama-llama3-tokenizer.t --weights-float-type q40 --buffer-float-type q80 --nthreads 8 Using AVX2 instructions💡 arch: llama2 💡 dim: 4096 💡 hiddenDim: 14336 💡 nLayers: 32 💡 nHeads: 32 💡 nKvHeads: 8 💡 vocabSize: 128256 💡 seqLen: 2048 💡 nSlices: 1 💡 ropeTheta: 500000.0 📄 bosId: 128000 📄 eosId: 128001 🕒 ropeCache: 32768 kB ⏩ Loaded 6175568 kB 🔶 G 232 ms I 232 ms T 0 ms S 0 kB R 0 kB Hello 🔶 G 256 ms I 255 ms T 1 ms S 0 kB R 0 kB World 🔶 G 235 ms I 234 ms T 1 ms S 0 kB R 0 kB ! 🔶 G 223 ms I 222 ms T 1 ms S 0 kB R 0 kB 🔶 G 230 ms I 229 ms T 0 ms S 0 kB R 0 kB I 🔶 G 244 ms I 243 ms T 0 ms S 0 kB R 0 kB am 🔶 G 235 ms I 233 ms T 1 ms S 0 kB R 0 kB an 🔶 G 232 ms I 231 ms T 0 ms S 0 kB R 0 kB AI 🔶 G 228 ms I 227 ms T 1 ms S 0 kB R 0 kB designed 🔶 G 227 ms I 225 ms T 1 ms S 0 kB R 0 kB to 🔶 G 232 ms I 230 ms T 1 ms S 0 kB R 0 kB generate 🔶 G 227 ms I 225 ms T 1 ms S 0 kB R 0 kB text 🔶 G 225 ms I 224 ms T 0 ms S 0 kB R 0 kB based 🔶 G 229 ms I 228 ms T 0 ms S 0 kB R 0 kB on 🔶 G 232 ms I 230 ms T 1 ms S 0 kB R 0 kB the 🔶 G 227 ms I 225 ms T 1 ms S 0 kB R 0 kB input 🔶 G 228 ms I 227 ms T 0 ms S 0 kB R 0 kB I 🔶 G 228 ms I 226 ms T 1 ms S 0 kB R 0 kB receive 🔶 G 228 ms I 228 ms T 0 ms S 0 kB R 0 kB . 🔶 G 226 ms I 224 ms T 1 ms S 0 kB R 0 kB I Generated tokens: 20 Avg tokens / second: 4.33 Avg generation time: 231.20 ms Avg inference time: 229.90 ms Avg transfer time: 0.60 ms So it does seem to be working correctly at least, and it's definitely much faster than without it. For reference previously I was getting With worker: Avg tokens / second: 2.60 Avg generation time: 384.90 ms Avg inference time: 184.65 ms Avg transfer time: 199.60 ms Without worker: Avg tokens / second: 3.69 Avg generation time: 271.15 ms Avg inference time: 269.80 ms Avg transfer time: 0.90 ms So with worker it went up to 2.8 from 2.6 t/s (7% faster) Without worker it went up to 4.33 from 3.69 t/s (17% faster)
This project has taught me that I definitely need a faster networking setup, looking at connecting my machines using SFP+ connected to an switch with 4 or more SFP+ ports |
Merged. Great job! |
Confirmed the speed up. Setup: GitHub codepaces, 4-core AMD EPYC 7763 64-Core Processor, 16GB RAM 0.5.0
Your PR:
|
# for free
to join this conversation on GitHub.
Already have an account?
# to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hi @b4rtaz
I managed to get a significant speed up on my machine with the following changes
I added AVX2 instructions to speed up the matmulQ40 in funcs.cpp
From my initial testing it definitely appears to be faster
With 1 worker:
sudo nice -n -20 ./main inference --steps 20 --prompt "Hello World! " --model ~/Meta-Llama-3-8B-Instruct-Distributed/dllama_original_q40.bin --tokenizer ~/Meta-Llama-3-8B-Instruct-Distributed/dllama-llama3-tokenizer.t --weights-float-type q40 --buffer-float-type q80 --nthreads 8 --workers 192.168.1.3:9990 [sudo] password for azamorn:
Using AVX2 instructions💡 arch: llama2
💡 dim: 4096
💡 hiddenDim: 14336
💡 nLayers: 32
💡 nHeads: 32
💡 nKvHeads: 8
💡 vocabSize: 128256
💡 seqLen: 2048
💡 nSlices: 2
💡 ropeTheta: 500000.0
📄 bosId: 128000
📄 eosId: 128001
🕒 ropeCache: 16384 kB
⏩ Loaded 6175568 kB
🔶 G 358 ms I 147 ms T 211 ms S 1917438 kB R 442 kB Hello 🔶 G 352 ms I 133 ms T 219 ms S 510 kB R 442 kB World 🔶 G 344 ms I 143 ms T 200 ms S 510 kB R 442 kB !
🔶 G 369 ms I 145 ms T 224 ms S 510 kB R 442 kB
🔶 G 339 ms I 140 ms T 198 ms S 510 kB R 442 kB I
🔶 G 347 ms I 148 ms T 198 ms S 510 kB R 442 kB 'm
🔶 G 368 ms I 150 ms T 218 ms S 510 kB R 442 kB a
🔶 G 361 ms I 137 ms T 223 ms S 510 kB R 442 kB bot 🔶 G 380 ms I 137 ms T 242 ms S 510 kB R 442 kB .
🔶 G 365 ms I 143 ms T 221 ms S 510 kB R 442 kB
🔶 G 356 ms I 139 ms T 217 ms S 510 kB R 442 kB I
🔶 G 356 ms I 145 ms T 211 ms S 510 kB R 442 kB 'm
🔶 G 364 ms I 143 ms T 221 ms S 510 kB R 442 kB here 🔶 G 375 ms I 136 ms T 239 ms S 510 kB R 442 kB to
🔶 G 345 ms I 132 ms T 212 ms S 510 kB R 442 kB help 🔶 G 367 ms I 140 ms T 227 ms S 510 kB R 442 kB you 🔶 G 343 ms I 134 ms T 208 ms S 510 kB R 442 kB with 🔶 G 352 ms I 144 ms T 208 ms S 510 kB R 442 kB any 🔶 G 362 ms I 145 ms T 217 ms S 510 kB R 442 kB questions 🔶 G 344 ms I 143 ms T 200 ms S 510 kB R 442 kB you Generated tokens: 20
Avg tokens / second: 2.80
Avg generation time: 357.35 ms
Avg inference time: 141.20 ms
Avg transfer time: 215.70 ms
Without a worker:
sudo nice -n -20 ./main inference --steps 20 --prompt "Hello World! " --model ~/Meta-Llama-3-8B-Instruct-Distributed/dllama_original_q40.bin --tokenizer ~/Meta-Llama-3-8B-Instruct-Distributed/dllama-llama3-tokenizer.t --weights-float-type q40 --buffer-float-type q80 --nthreads 8 Using AVX2 instructions💡 arch: llama2
💡 dim: 4096
💡 hiddenDim: 14336
💡 nLayers: 32
💡 nHeads: 32
💡 nKvHeads: 8
💡 vocabSize: 128256
💡 seqLen: 2048
💡 nSlices: 1
💡 ropeTheta: 500000.0
📄 bosId: 128000
📄 eosId: 128001
🕒 ropeCache: 32768 kB
⏩ Loaded 6175568 kB
🔶 G 232 ms I 232 ms T 0 ms S 0 kB R 0 kB Hello
🔶 G 256 ms I 255 ms T 1 ms S 0 kB R 0 kB World
🔶 G 235 ms I 234 ms T 1 ms S 0 kB R 0 kB !
🔶 G 223 ms I 222 ms T 1 ms S 0 kB R 0 kB
🔶 G 230 ms I 229 ms T 0 ms S 0 kB R 0 kB I
🔶 G 244 ms I 243 ms T 0 ms S 0 kB R 0 kB am
🔶 G 235 ms I 233 ms T 1 ms S 0 kB R 0 kB an
🔶 G 232 ms I 231 ms T 0 ms S 0 kB R 0 kB AI
🔶 G 228 ms I 227 ms T 1 ms S 0 kB R 0 kB designed
🔶 G 227 ms I 225 ms T 1 ms S 0 kB R 0 kB to
🔶 G 232 ms I 230 ms T 1 ms S 0 kB R 0 kB generate
🔶 G 227 ms I 225 ms T 1 ms S 0 kB R 0 kB text
🔶 G 225 ms I 224 ms T 0 ms S 0 kB R 0 kB based
🔶 G 229 ms I 228 ms T 0 ms S 0 kB R 0 kB on
🔶 G 232 ms I 230 ms T 1 ms S 0 kB R 0 kB the
🔶 G 227 ms I 225 ms T 1 ms S 0 kB R 0 kB input
🔶 G 228 ms I 227 ms T 0 ms S 0 kB R 0 kB I
🔶 G 228 ms I 226 ms T 1 ms S 0 kB R 0 kB receive
🔶 G 228 ms I 228 ms T 0 ms S 0 kB R 0 kB .
🔶 G 226 ms I 224 ms T 1 ms S 0 kB R 0 kB I
Generated tokens: 20
Avg tokens / second: 4.33
Avg generation time: 231.20 ms
Avg inference time: 229.90 ms
Avg transfer time: 0.60 ms
So it does seem to be working correctly at least, and it's definitely much faster than without it.
For reference previously I was getting
With worker:
Avg tokens / second: 2.60
Avg generation time: 384.90 ms
Avg inference time: 184.65 ms
Avg transfer time: 199.60 ms
Without worker:
Avg tokens / second: 3.69
Avg generation time: 271.15 ms
Avg inference time: 269.80 ms
Avg transfer time: 0.90 ms
So with worker it went up to 2.8 from 2.6 t/s (7% faster) Without worker it went up to 4.33 from 3.69 t/s (17% faster)