[Proposal] Backend-free support

# Introduction

LLamaSharp uses llama.cpp as a backend, and have introduced [dynamic native library loading](https://scisharp.github.io/LLamaSharp/0.11.2/Tutorials/NativeLibraryConfig/), which allows us to choose which DLL to load at runtime. However, users still need to install the backend packages unless they have exactly one DLL to use. The problem is, at most of the time, an user only needs one DLL, for example, the CUDA11 one. However, many DLLs have to be included, especially if we support CUDA with AVX in the future.

Dividing into backend packages with a single file, as previously discussed in other issues, appears to be a solution. However, if the user has chosen a specific backend, what is the purpose of our backend selection strategy? Furthermore, this approach may lead to an excessive number of backends, causing potential difficulties.

Is it possible to select the native library by the configuration and system information, and only download the selected one, and without having too many backend packages? This is the point of this proposal.

# Brief Description

My idea is to put all the native library files on [HuggingFace](huggingface.co), then download the selected one according to the configuration and system information at runtime. That's all!

# APIs

The following APIs will be exposed for users to get this feature.

```cs
// Use along with other strategies such as `WithCuda`.
NativeLibraryConfig NativeLibraryConfig::WithAutoDownload(bool enable = true, string? cacheDir = null, string? endPoint = null, string? proxy = null, int timeout = -1);

// Explicitly download the library with filename and version.
void NativeLibraryConfig::DownloadNativeLibrary(string filename, string? version = null, string? cacheDir = null, string? endPoint = null, string? proxy = null, int timeout = -1);

// Explicitly download the library with specified configurations.
void NativeLibraryConfig::DownloadNativeLibrary(bool useCuda, AvxLevel avxLevel, string os = "auto", string? cacheDir = null, string? endPoint = null, string? proxy = null, int timeout = -1);

// Explicitly download the best library (for efficiency) selected by LLamaSharp according to detected system info.
void NativeLibraryConfig::DownloadBestNativeLibrary(string? cacheDir = null, string? endPoint = null, string? proxy = null, int timeout = -1);
```

p.s. To be honest, I don't think it's good to put the methods for downloading in `NativeLibraryConfig`, but I haven't come up  with a better idea yet.

# Behaviors

## Priorities

The most important thing is that what the behavior is when this feature is used with backend installed?

My answer would be that we'll follow the priorities below.

1. If a local file is specified by `WithLibrary`, just load it.
2. If a backend has been installed, try to load a library with the configuration. If no matched file could be found, fallback to 3.
3. Search the default native library cache directory at first. If no matched file could be found, try to download it.
4. If there's still no matched file, throw an exception.

## Directory structure

We will cache the files in a default directory (may be `~/.llama-sharp`) or a specified one by user. In this directory, we will make subdirectories named by version, in which there are downloaded files.

In this way, there're two possible directory structures, which are listed as below.

#### the first one, flatten all the files
```
Root
  |------v0.11.2
            |------llama-cuda11-win-x64.dll
            |------libllama-avx512-linux-x64.so
 |------v0.12.0
            |------llama-cuda12-win-x64.dll
            |------libllama-metal-osx-x64.dylib
```

#### the second one, keep the current structure
```
Root
  |------v0.11.2
            |------cuda11
                   |------llama.dll
                   |------libllama.so
            |------cpu
                   |------llama.dll
                   |------libllama.so
                   |------libllama.dylib
 |------v0.12.0
           ... ...
```

I'm open to this and will leave the decision till the final time, depending on discussions about it.

# How to implement

## Downloading files from Huggingface

It would not be implemented in LLamaSharp. I'll create a repo named `HuggingfaceHub` and I'm already working on it. I'm pretty sure that the downloading could be implemented without too many difficulties.

As an evidence, llama.cpp has already had [an example function](https://github.com/ggerganov/llama.cpp/blob/master/common/common.h#L209) to download model files from Huggingface. In this proposal, the downloading will be more complex because we are making a library API rather than an example, but I think I could hold this.

After the completion of the this library, We could depend on it in LLamaSharp to download files. The reason why I won't put it in LLamaSharp is because:

- It's not a thing necessary for LLM, though it will make things more convinience.
- It does not need frequent changes.
- Once the API from Huggingface changes, the user could install a new version of `HuggingfaceHub` but with an old version of LLamaSharp.

## Pushing files to Huggingface

I'll do this in our CI. We only need to put files when we are going to publish a new release. I'll add a secret key to github actions secrets, and use huggingface-cli to push files.

# Advantages

I believe this feature will bring the following advantages:

- Making LLamaSharp more easy-to-use for users. Backend package will no longer be necessary, though we'll keep publishing it.
- Offering one more choice for developers who want to publish an APP built with LLamaSharp. It's possible to auto-download the native library files after the installation, instead of managing those files themselves.
- Allow us (core developing team) to introduce more native files, without worries about the increasing package size. 
- Benefitting from the library for Huggingface downloading, it will be easy for us to support downloading models. Thus, it will provide a more easy-to-use way for user, like `new LLamaWeights("Facebook/LLaMA", "llama2.gguf")`.

# Potential risks

- More complex logic for native library loading.
- More work for us to do if we want to add a new native library file.

---


I would appreciate for any suggestion for this proposal!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Proposal] Backend-free support #670

Introduction

Brief Description

APIs

Behaviors

Priorities

Directory structure

the first one, flatten all the files

the second one, keep the current structure

How to implement

Downloading files from Huggingface

Pushing files to Huggingface

Advantages

Potential risks

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Proposal] Backend-free support #670

Description

Introduction

Brief Description

APIs

Behaviors

Priorities

Directory structure

the first one, flatten all the files

the second one, keep the current structure

How to implement

Downloading files from Huggingface

Pushing files to Huggingface

Advantages

Potential risks

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions