Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[Enhancement]: Implement optimizations used in CTranslate2 #811

Closed
janekb04 opened this issue Apr 6, 2023 · 3 comments
Closed

[Enhancement]: Implement optimizations used in CTranslate2 #811

janekb04 opened this issue Apr 6, 2023 · 3 comments
Labels
enhancement New feature or request stale

Comments

@janekb04
Copy link

janekb04 commented Apr 6, 2023

CTranslate2 is a "competitor" to llama.cpp that advertises itself with:

Fast and efficient execution on CPU and GPU

The execution is significantly faster and requires less resources than general-purpose deep learning frameworks on supported models and tasks thanks to many advanced optimizations: layer fusion, padding removal, batch reordering, in-place operations, caching mechanism, etc.

I am no expert in LLMs and I don't know what these optimizations are, but I am asking: would it be possible/feasible and/or desirable to implement these optimizations into llama.cpp or GGML?

@ggerganov ggerganov added the enhancement New feature or request label Apr 7, 2023
@guillaumekln
Copy link

guillaumekln commented Apr 8, 2023

(Hi there, I'm the author of CTranslate2.)

llama.cpp already implements similar optimizations. They often come naturally when reimplementing a model in C/C++.

In my experience the most impactful optimization is to integrate vendor specific libraries to run the matrix multiplications, which are usually the bottlenecks for these models. For example Apple Accelerate was a huge win for performance when it was first integrated in whisper.cpp. For x64 processors I recommend oneDNN which has a very good 8-bit GEMM implementation (as fast as Intel MKL).

However, I'm not aware of similar libraries providing efficient 4-bit GEMM at this time, and I also understand that llama.cpp is trying to avoid additional dependencies as much as possible.

@jon-chuang
Copy link
Contributor

jon-chuang commented Apr 12, 2023

So we are already fusing and tiling the attention layer to fit in CPU-SRAM ala flash attention?

Edit: I guess it is currently being experimented on: #778

@github-actions github-actions bot added the stale label Mar 25, 2024
Copy link
Contributor

This issue was closed because it has been inactive for 14 days since being marked as stale.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
enhancement New feature or request stale
Projects
None yet
Development

No branches or pull requests

4 participants