Skip to content

[Proposal] Backend-free support #670

@AsakusaRinne

Description

@AsakusaRinne

Introduction

LLamaSharp uses llama.cpp as a backend, and have introduced dynamic native library loading, which allows us to choose which DLL to load at runtime. However, users still need to install the backend packages unless they have exactly one DLL to use. The problem is, at most of the time, an user only needs one DLL, for example, the CUDA11 one. However, many DLLs have to be included, especially if we support CUDA with AVX in the future.

Dividing into backend packages with a single file, as previously discussed in other issues, appears to be a solution. However, if the user has chosen a specific backend, what is the purpose of our backend selection strategy? Furthermore, this approach may lead to an excessive number of backends, causing potential difficulties.

Is it possible to select the native library by the configuration and system information, and only download the selected one, and without having too many backend packages? This is the point of this proposal.

Brief Description

My idea is to put all the native library files on HuggingFace, then download the selected one according to the configuration and system information at runtime. That's all!

APIs

The following APIs will be exposed for users to get this feature.

// Use along with other strategies such as `WithCuda`.
NativeLibraryConfig NativeLibraryConfig::WithAutoDownload(bool enable = true, string? cacheDir = null, string? endPoint = null, string? proxy = null, int timeout = -1);

// Explicitly download the library with filename and version.
void NativeLibraryConfig::DownloadNativeLibrary(string filename, string? version = null, string? cacheDir = null, string? endPoint = null, string? proxy = null, int timeout = -1);

// Explicitly download the library with specified configurations.
void NativeLibraryConfig::DownloadNativeLibrary(bool useCuda, AvxLevel avxLevel, string os = "auto", string? cacheDir = null, string? endPoint = null, string? proxy = null, int timeout = -1);

// Explicitly download the best library (for efficiency) selected by LLamaSharp according to detected system info.
void NativeLibraryConfig::DownloadBestNativeLibrary(string? cacheDir = null, string? endPoint = null, string? proxy = null, int timeout = -1);

p.s. To be honest, I don't think it's good to put the methods for downloading in NativeLibraryConfig, but I haven't come up with a better idea yet.

Behaviors

Priorities

The most important thing is that what the behavior is when this feature is used with backend installed?

My answer would be that we'll follow the priorities below.

  1. If a local file is specified by WithLibrary, just load it.
  2. If a backend has been installed, try to load a library with the configuration. If no matched file could be found, fallback to 3.
  3. Search the default native library cache directory at first. If no matched file could be found, try to download it.
  4. If there's still no matched file, throw an exception.

Directory structure

We will cache the files in a default directory (may be ~/.llama-sharp) or a specified one by user. In this directory, we will make subdirectories named by version, in which there are downloaded files.

In this way, there're two possible directory structures, which are listed as below.

the first one, flatten all the files

Root
  |------v0.11.2
            |------llama-cuda11-win-x64.dll
            |------libllama-avx512-linux-x64.so
 |------v0.12.0
            |------llama-cuda12-win-x64.dll
            |------libllama-metal-osx-x64.dylib

the second one, keep the current structure

Root
  |------v0.11.2
            |------cuda11
                   |------llama.dll
                   |------libllama.so
            |------cpu
                   |------llama.dll
                   |------libllama.so
                   |------libllama.dylib
 |------v0.12.0
           ... ...

I'm open to this and will leave the decision till the final time, depending on discussions about it.

How to implement

Downloading files from Huggingface

It would not be implemented in LLamaSharp. I'll create a repo named HuggingfaceHub and I'm already working on it. I'm pretty sure that the downloading could be implemented without too many difficulties.

As an evidence, llama.cpp has already had an example function to download model files from Huggingface. In this proposal, the downloading will be more complex because we are making a library API rather than an example, but I think I could hold this.

After the completion of the this library, We could depend on it in LLamaSharp to download files. The reason why I won't put it in LLamaSharp is because:

  • It's not a thing necessary for LLM, though it will make things more convinience.
  • It does not need frequent changes.
  • Once the API from Huggingface changes, the user could install a new version of HuggingfaceHub but with an old version of LLamaSharp.

Pushing files to Huggingface

I'll do this in our CI. We only need to put files when we are going to publish a new release. I'll add a secret key to github actions secrets, and use huggingface-cli to push files.

Advantages

I believe this feature will bring the following advantages:

  • Making LLamaSharp more easy-to-use for users. Backend package will no longer be necessary, though we'll keep publishing it.
  • Offering one more choice for developers who want to publish an APP built with LLamaSharp. It's possible to auto-download the native library files after the installation, instead of managing those files themselves.
  • Allow us (core developing team) to introduce more native files, without worries about the increasing package size.
  • Benefitting from the library for Huggingface downloading, it will be easy for us to support downloading models. Thus, it will provide a more easy-to-use way for user, like new LLamaWeights("Facebook/LLaMA", "llama2.gguf").

Potential risks

  • More complex logic for native library loading.
  • More work for us to do if we want to add a new native library file.

I would appreciate for any suggestion for this proposal!

Metadata

Metadata

Assignees

No one assigned

    Labels

    discussionenhancementNew feature or requeststaleStale issue will be autoclosed soon

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions