You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
Currently to test the model one needs to use Nvidia GPU. In principale it should be possible to run it on with Apple GPU via Metal as well. It's implemented in llama.cpp
Being able to run locally on a Mac would be useful for development. But how substainable would be this solution? How much time do we need to make it work?
In the pyproject.toml the flash_attn is listed as optional, so one could just replace it flex_attn. If we don't plan to that, I think it should be changed as it can be confusing
Is your feature request related to a problem? Please describe.
Currently to test the model one needs to use Nvidia GPU. In principale it should be possible to run it on with Apple GPU via Metal as well. It's implemented in llama.cpp
Describe the solution you'd like
When model is installed on a Mac it should use this dependency: https://github.com/philipturner/metal-flash-attention
Describe alternatives you've considered
No response
Additional context
No response
Organisation
AWI
The text was updated successfully, but these errors were encountered: