Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
ggml_rope()
when not inplace ggml-org/ggml@788381eggml_rope()
GPT-NeoX mode (hopefully) ggml-org/ggml@788381eggml_diag_mask_inf()
operator ggml-org/ggml@a483bb2The
ggml_rope()
fixes are irrelevant for LLaMA sincen_rot == (n_embd / n_head)
, but it makes a difference for other models like GPT-J and GPT-NeoX wheren_rot < (n_embd / n_head)
. I'm still not sure if this is the correct implementation, especially for the GPT-NeoX mode, but results kind of seem a bit better than before.The non-inplace multi-thread
ggml_diag_mask_inf()
was broken here: #1428 . Again, irrelevant since in LLaMA forward we useggml_diag_mask_inf_inplace()
. Might be relevant to @xaedesThe "scratch buffers" fix might be relevant for LLaMA. See the new
ggml_scratch_save()
andggml_scratch_load()
functions and their usage inggml.c
: https://github.com/ggerganov/llama.cpp/blob/fixes/ggml.c#LL3925C1-L3939C1The scratch buffers are mechanism for reusing memory from previous ops when it is no longer needed. The current way of using them is manual and very error-prone. Will hopefully come up with something better in the future.
More info here: ggml-org/whisper.cpp#431