Skip to content

~2x perf improvement on Apple Silicon by changing state_shared.has_work access from atomic to mutex/conditional #633

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Closed
gjmulder opened this issue Mar 30, 2023 Discussed in #616 · 5 comments
Labels
enhancement New feature or request performance Speed related topics stale

Comments

@gjmulder
Copy link
Collaborator

Discussed in #616

Originally posted by izard March 30, 2023
I profiled on a latest Mac Book Pro machine and found that significantly more time is spent in atomic checks for state_shared.has_work in while loops than doing actual work in matrix multiply.
So I changed busy waits like:

pthread_mutex_lock(&state->shared->mutex);
   while (state->shared->has_work) {
     pthread_cond_wait(&state->shared->cond, &state->shared->mutex);
// unlock

and setting has_work to

pthread_mutex_lock(&state_shared.mutex);
state_shared.has_work = true;
pthread_cond_broadcast(&state_shared.cond);
pthread_mutex_unlock(&state_shared.mutex);

Got a nice 2x speedup in time/token.

I can't post a patch/pull request because everything I do in spare time still belongs to my employer, but the change is trivial as described above. Probably won't provide much benefit (if any) for other platforms though.

@gjmulder gjmulder added enhancement New feature or request performance Speed related topics labels Mar 30, 2023
@izard
Copy link

izard commented Mar 31, 2023

Tested more with different model sizes, different prompts, and Linux OS. 2x on MBP was an outlier. Now I see different configs have different speedups/slowdowns. So the change as it is cannot be suggested, though it is a place to look at in further perf analysis.

@prusnak
Copy link
Collaborator

prusnak commented Mar 31, 2023

Tested more with different model sizes,

Can you create a draft pull request anyway (with a note you don't want to merge, just sharing the code with the others), so we can test as well?

@bogdad
Copy link
Contributor

bogdad commented Mar 31, 2023

^ did give it a try - works a bit slower on my machine 7B and very small n_predict, but maybe it can be improved - i did a mechanical rewrite without much thinking. its a draft and does not work on windows, so hiding it in my fork, feel free to use : )

@bogdad
Copy link
Contributor

bogdad commented Apr 2, 2023

#710 - this one is a try to change to an existing c thread pool, slight eval timings drop, but cpu usage gets from 700% to 400% on 8 threads, guess cool for mobile usage. also the pr has timings, hope thats useful

Deadsg pushed a commit to Deadsg/llama.cpp that referenced this issue Dec 19, 2023
GGUF (Breaking Change to Model Files)
@github-actions github-actions bot added the stale label Mar 25, 2024
Copy link
Contributor

This issue was closed because it has been inactive for 14 days since being marked as stale.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
enhancement New feature or request performance Speed related topics stale
Projects
None yet
Development

No branches or pull requests

4 participants