LLM Server is a Ruby Rack API that hosts the llama.cpp
binary in memory(1) and provides an endpoint for text completion using the configured Language Model (LLM).
(1) The server now introduces am inteactive
configuration key. By default this value is set to true
. I have found this mode works well with models like: Llama, Open Llama, and Vicuna.
Other models like Orca model tends to allucinate, but turning off interactive
model and loading the model on each request works for Orca, especially for the smaller model 3b. It responds
very fast.
LLM Server serves as a convenient wrapper for the llama.cpp binary, allowing you to interact with it through a simple API. It exposes a single endpoint that accepts text input and returns the completion generated by the Language Model.
llama.cpp
process is kept in memory to provide a better experience. Use any Language Model supported by llama.cpp
.
Please, look at the configuration section of the server to setup your model.
LlmServer.mp4
To use LLM Server, ensure that you have the following components installed:
- Ruby (version 3.2.2 or higher)
- A
llama.cpp
binary. llama.cpp repository have instructions to build the binary - A Language Model (LLM) compatible with the
llama.cpp
binary. Hugging Face is a place to look for a model
Follow these steps to set up and run the LLM Server:
- Clone the LLM Server repository:
$ git clone https://github.com/your-username/llm-server.git
- Change to the project directory:
$ cd llm-server
- Install the required dependencies:
$ bundle install
-
Copy the file
config/config.yml.sample
toconfig/config.yml
. The sample file is a template to configure your models. See bellow for more information. -
Start the server:
$ bin/server
This will start the server on the default port (9292). Export a PORT
variable before starting the server to use a different port. Puma server starts in a single-mode with one thread to
protect the llama.cpp
process from parallel inferences. The Puma server enqueues requests to be served first in, first out.
Before looking into server configuration, remember that you need at least one Large Language Model compatible with llama.cpp
.
Place your models inside ./models
folder.
Update the configuration file to better fit your model.
current_model: "vic-13b-1.3"
llama_bin: "../llama.cpp/main"
models_path: "./models"
models:
"orca-3b":
model: "orca-mini-3b.ggmlv3.q4_0.bin"
interactive: false
strip_before: "respuesta: "
parameters: >
-n 2048 -c 2048 --top_k 40 --temp 0.1 --repeat_penalty 1.2 -t 6 -ngl 1
timeout: 90
"vic-13b-1.3":
model: "vicuna-13b-v1.3.0.ggmlv3.q4_0.bin"
suffix: "Asistente:"
reverse_prompt: "Usuario:"
parameters: >
-n 2048 -c 2048 --top_k 10000 --temp 0 --repeat_penalty 1.2 -t 4 -ngl 1
timeout: 90
The models
key allows you to configure one or more models to be used by the server. Not that the server are going to use all of them at the same time.
To configure a model, use a unique helpful name, ex: open-llama-7b. Then add three parameters:
model
: This is the name the file for that model.suffix
: String that suffix prompt. This is required for interactive mode.reverse_prompt
: This halt generation at PROMPT, return control in interactive mode. This is required for interactive mode.interactive
: Tells the server how to load the model. Whentrue
, model is loaded in interactive mode and it is keep in memory. Whenfalse
, model is loaded on each request. This works fine for small models. By default, this value istrue
.strip_before
: When running model in non-interactive mode, you can use this to strip from response any unwanted text.parameters
: These are the parameters that are passed tollama.cpp
process to load and run your model. It is important that model is executed as interactive to take advantage of being in memory all the time. Seellama.cpp
documentation to learn what other parameters to pass to the process.timeout
: This tells the server how much time in seconds to wait for the model to produce a response before it assumes that model did’t respond.
The first three keys tells the server how to start the Large Language Model process.
current_model
: Has the key of a model defined in themodels
key. This is the model to be executed with the server.llama_bin
: Points to thellama.cpp
binary relatively to the server path.models_path
: Is the path are saved. This is relative to the server path.
The API is simple, it send a JSON object as payload and receives a JSON object as response. You can include headers Accept
and Content-Type
in every request with a value application/json
or you can omit them, the server will assume the value for both of them.
If you request has a different values for Accept
or Content-Type
then you will receive a status code 406 - Not Acceptable
.
Requesting an endpoint not available will produce a 404 - Not found
response. In case of trouble with the Large Language Model you receive a 503 - Server Unavailable
status code.
Endpoint: POST /completion
Request Body: The request body should contain a JSON object with the following key:
prompt
: The input text for which completion is requested.
Example request body:
{
"prompt": "Who created Ruby language?"
}
Response: The response will be a JSON object containing the completion generated by the LLM and the used model.
Example response body:
{
"model": "vicuna-13b-v1.3.0.ggmlv3.q4_0.bin",
"response": "The Ruby programming language was created by Yukihiro Matsumoto in the late 1990s. He wanted to create a simple, intuitive and dynamic language that could be used for various purposes such as web development, scripting and data analysis."
}
Here"s an example using curl
to make a completion request:
curl -X POST -H "Content-Type: application/json" -d "{'prompt':'Who created Ruby language?'}" http://localhost:9292/completion
The response will be:
{
"model": "vicuna-13b-v1.3.0.ggmlv3.q4_0.bin",
"response": "The Ruby programming language was created by Yukihiro Matsumoto in the late 1990s. He wanted to create a simple, intuitive and dynamic language that could be used for various purposes such as web development, scripting and data analysis."
}
Feel free to modify the request body and experiment with different input texts or to provide a more complex prompt for the model.
There is a gem llm_client that you can use to interact with the LLM Server.
Here is an example on how to use the gem.
response = LlmClient.completion("Who is the creator of Ruby language?")
if result.success?
puts "Completions generated successfully"
response = result.success
puts "Status: #{response.status}"
puts "Body: #{response.body}"
puts "Headers: #{response.headers}"
calculated_response = response.body[:response]
puts "Calculated Response: #{calculated_response}"
else
puts "Failed to generate completions"
error = result.failure
puts "Error: #{error}"
end
Bug reports and pull requests are welcome on GitHub at https://github.com/mariochavez/llm_server. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the code of conduct.
The gem is available as open source under the terms of the MIT License.
Everyone interacting in the Llm Server project"s codebases, issue trackers, chat rooms and mailing lists is expected to follow the code of conduct.
LLM Server provides a simple way to interact with the llama.cpp
binary and leverage the power of your configured Language Model. You can integrate this server into your applications to facilitate text completion tasks.