-
Notifications
You must be signed in to change notification settings - Fork 454
Description
Introduction
LLamaSharp uses llama.cpp as a backend, and have introduced dynamic native library loading, which allows us to choose which DLL to load at runtime. However, users still need to install the backend packages unless they have exactly one DLL to use. The problem is, at most of the time, an user only needs one DLL, for example, the CUDA11 one. However, many DLLs have to be included, especially if we support CUDA with AVX in the future.
Dividing into backend packages with a single file, as previously discussed in other issues, appears to be a solution. However, if the user has chosen a specific backend, what is the purpose of our backend selection strategy? Furthermore, this approach may lead to an excessive number of backends, causing potential difficulties.
Is it possible to select the native library by the configuration and system information, and only download the selected one, and without having too many backend packages? This is the point of this proposal.
Brief Description
My idea is to put all the native library files on HuggingFace, then download the selected one according to the configuration and system information at runtime. That's all!
APIs
The following APIs will be exposed for users to get this feature.
// Use along with other strategies such as `WithCuda`.
NativeLibraryConfig NativeLibraryConfig::WithAutoDownload(bool enable = true, string? cacheDir = null, string? endPoint = null, string? proxy = null, int timeout = -1);
// Explicitly download the library with filename and version.
void NativeLibraryConfig::DownloadNativeLibrary(string filename, string? version = null, string? cacheDir = null, string? endPoint = null, string? proxy = null, int timeout = -1);
// Explicitly download the library with specified configurations.
void NativeLibraryConfig::DownloadNativeLibrary(bool useCuda, AvxLevel avxLevel, string os = "auto", string? cacheDir = null, string? endPoint = null, string? proxy = null, int timeout = -1);
// Explicitly download the best library (for efficiency) selected by LLamaSharp according to detected system info.
void NativeLibraryConfig::DownloadBestNativeLibrary(string? cacheDir = null, string? endPoint = null, string? proxy = null, int timeout = -1);
p.s. To be honest, I don't think it's good to put the methods for downloading in NativeLibraryConfig
, but I haven't come up with a better idea yet.
Behaviors
Priorities
The most important thing is that what the behavior is when this feature is used with backend installed?
My answer would be that we'll follow the priorities below.
- If a local file is specified by
WithLibrary
, just load it. - If a backend has been installed, try to load a library with the configuration. If no matched file could be found, fallback to 3.
- Search the default native library cache directory at first. If no matched file could be found, try to download it.
- If there's still no matched file, throw an exception.
Directory structure
We will cache the files in a default directory (may be ~/.llama-sharp
) or a specified one by user. In this directory, we will make subdirectories named by version, in which there are downloaded files.
In this way, there're two possible directory structures, which are listed as below.
the first one, flatten all the files
Root
|------v0.11.2
|------llama-cuda11-win-x64.dll
|------libllama-avx512-linux-x64.so
|------v0.12.0
|------llama-cuda12-win-x64.dll
|------libllama-metal-osx-x64.dylib
the second one, keep the current structure
Root
|------v0.11.2
|------cuda11
|------llama.dll
|------libllama.so
|------cpu
|------llama.dll
|------libllama.so
|------libllama.dylib
|------v0.12.0
... ...
I'm open to this and will leave the decision till the final time, depending on discussions about it.
How to implement
Downloading files from Huggingface
It would not be implemented in LLamaSharp. I'll create a repo named HuggingfaceHub
and I'm already working on it. I'm pretty sure that the downloading could be implemented without too many difficulties.
As an evidence, llama.cpp has already had an example function to download model files from Huggingface. In this proposal, the downloading will be more complex because we are making a library API rather than an example, but I think I could hold this.
After the completion of the this library, We could depend on it in LLamaSharp to download files. The reason why I won't put it in LLamaSharp is because:
- It's not a thing necessary for LLM, though it will make things more convinience.
- It does not need frequent changes.
- Once the API from Huggingface changes, the user could install a new version of
HuggingfaceHub
but with an old version of LLamaSharp.
Pushing files to Huggingface
I'll do this in our CI. We only need to put files when we are going to publish a new release. I'll add a secret key to github actions secrets, and use huggingface-cli to push files.
Advantages
I believe this feature will bring the following advantages:
- Making LLamaSharp more easy-to-use for users. Backend package will no longer be necessary, though we'll keep publishing it.
- Offering one more choice for developers who want to publish an APP built with LLamaSharp. It's possible to auto-download the native library files after the installation, instead of managing those files themselves.
- Allow us (core developing team) to introduce more native files, without worries about the increasing package size.
- Benefitting from the library for Huggingface downloading, it will be easy for us to support downloading models. Thus, it will provide a more easy-to-use way for user, like
new LLamaWeights("Facebook/LLaMA", "llama2.gguf")
.
Potential risks
- More complex logic for native library loading.
- More work for us to do if we want to add a new native library file.
I would appreciate for any suggestion for this proposal!