Releases: c0sogi/llama-api
v0.1.8
This release includes several hotfixes.
New Features
- Max Tokens Limit Option: Added
--max-tokens-limit MAX_TOKENS_LIMIT
option. You can now adjust the upper limit of max tokens. If exceeded, a pydantic error will be triggered.
Enhancements
- Docker Image Update: Removed the PORT environment variable. You can now customize the port using the
docker run
command and--port
option.
Bug Fixes
-
CUDA Memory Error: If a CUDA-related error occurs, a
MemoryError
is raised to automatically terminate the worker process. Subsequent worker processes can be automatically generated. -
Unix Lifespan Bug: Fixed a bug where the process pool would not close and deadlock would occur when terminating the fastapi app in a Unix environment.
-
Langchain Compatibility: Resolved a type conflict issue causing a pydantic validation error when using ChatOpenAI in Langchain if the request body contained
None
.None
values are now ignored.
Usage Example for Docker
docker run -d --name my-container --port 8080:8080 my-image
Usage Example for Max Tokens Limit Option
python -m main --max-tokens-limit 500
v0.1.7
This release introduces the following changes:
-
Added Instruction Templates: We have added
instruction-templates
to the model definition. Now, one can explicitly provide aninstruction_template
inLlamaCppModel
orExllamaModel
. This helps to generate more accurate prompts in the chat completion endpoint. -
Streaming Response Timeout: Streaming responses will now timeout automatically if the next chunk is not received within 30 seconds.
-
Semaphore Bug Fix: Fixed a bug where a semaphore would not be properly released after being acquired.
-
Auto Truncate in Model Definition: If
auto_truncate
is set toTrue
in the model definition, past prompts will be automatically truncated to fit within the context window, thus preventing errors. The default setting isTrue
. -
Automatic RoPE Parameter Adjustment: If explicit settings for rope_freq_base, rope_freq_scale in llama.cpp, or alpha_value, compress_pos_emb in exllama are not provided, the RoPE frequency and scaling factor will be automatically adjusted. This default behavior is based on the llama2 model with a training token count of 4096.
-
Dynamic Model Definition Parsing: Model definitions are primarily configured in
model_definitions.py
. However, parsing will now also be attempted for Python script files in the root directory containing the words 'model' and 'def'. Environment variables containing 'model' and 'def' will also be parsed automatically. This applies toopenai_replacement_models
as well. For example, you can set environment variables as shown below:
#!/bin/bash
export MODEL_DEFINITIONS='{
"gptq": {
"type": "exllama",
"model_path": "TheBloke/MythoMax-L2-13B-GPTQ",
"max_total_tokens": 4096
},
"ggml": {
"type": "llama.cpp",
"model_path": "TheBloke/Airoboros-L2-13B-2.1-GGUF",
"max_total_tokens": 8192,
"rope_freq_base": 26000,
"rope_freq_scale": 0.5,
"n_gpu_layers": 50
}
}'
export OPENAI_REPLACEMENT_MODELS='{
"gpt-3.5-turbo": "ggml",
"gpt-3.5-turbo-0613": "ggml",
"gpt-3.5-turbo-16k": "ggml",
"gpt-3.5-turbo-16k-0613": "ggml",
"gpt-3.5-turbo-0301": "ggml",
"gpt-4": "gptq",
"gpt-4-32k": "gptq",
"gpt-4-0613": "gptq",
"gpt-4-32k-0613": "gptq",
"gpt-4-0301": "gptq"
}'
echo "MODEL_DEFINITIONS: $MODEL_DEFINITIONS"
echo "OPENAI_REPLACEMENT_MODELS: $OPENAI_REPLACEMENT_MODELS"
v0.1.6
🚀 Overview
This release introduces a series of improvements aimed at enhancing user experience and refining the codebase. Here's a breakdown of the changes:
⚡ 1. Function calling support for llama.cpp
- Request a function call with the same scheme used in OpenAI API.
🌐 2. Fix compatibility with newer version of llama.cpp
- In recent update with GGUF support, there are some API changes.
v0.1.5
🚀 Overview
This release introduces a series of improvements aimed at enhancing user experience and refining the codebase. Here's a breakdown of the changes:
⚡ 1. Optimized performance: llama.cpp & exllama
- Made performance improvements by changing the text generation logic.
🌐 2. Tunnel through Cloudflare
- Expose this API to the external network using the
--tunnel
option.
⚙️ 3. CLI args Refinement
- Moved
argparse.ArgumentParser
toconfig.py
.
🐞 4. Bugfix: niceness of process
- Fixed a bug where the niceness of the process couldn't be modified in a docker environment.
🔜 5. Enhancement: required
option in function call schema
- The function call feature is not yet implemented. Stay tuned!
v0.1.4
🚀 This release introduces a series of improvements aimed at enhancing user experience and refining the codebase. Here's a breakdown of the changes:
🌟 1. Exllama Module - LoRA Integration
- By placing
adapter_config.json
andadapter_model.bin
in the./models/gptq/YOUR_MODEL
directory, the system will now seamlessly initialize LoRA.
🔗 2. OpenAI Logit Bias Support
- For API queries to models specified within the
openai_replacement_models
dictionary, there's an auto-conversion from OpenAI ID to Llama ID,_ courtesy of the Tiktoken tokenizer.
⚖ 3. Optimized Worker Load Balancing
- Workers within the process pool have undergone a revamp in their load balancing algorithm. Based on the computed
worker_rank
, they now allocate clients more efficiently. In scenarios where ranks tie, a random worker is selected.
📜 4. Enhanced Logging Mechanism
- Expect crisper log messages henceforth. Additionally, both user prompts and response prompts stemming from Chat Completion and Text Completion operations are archived in
logs/chat.log
.
🔥 5. Docker Image Upgrades
- The antecedent Docker image was reliant on the CPU version of llama.cpp, which can't use of CUDA acceleration. However, given the constraints in utilizing the CUDA Compiler during the build phase, JIT comes to the rescue to ensure automatic compilation.
v0.1.3
This release encompasses several enhancements to usability and code refactoring. The primary changes include:
- Skip compilation: You can skip compilation of llama.cpp shared library when running server with
--install-pkgs
. Just add--skip-compile
option. - Removed auto process kill feature: Killing process when unloading model, was introduced to prevent the program from memory leak, but this sometimes make the program exit for no reason. So this feature is removed.
- API key checker: API key checker will be activated if you start the server with option
--api-key YOUR_API_KEY
. Client must includeAuthorization
header withBearer YOUR_API_KEY
.
v0.1.2
This release encompasses several enhancements to usability and code refactoring. The primary changes include:
-
Automatic Model Downloader: In our previous implementation, the
model_path
attribute inmodel_definitions.py
required an actual filename of a model. We have now upgraded this to accept the name of a HuggingFace repository instead. As a result, the specified model is automatically downloaded when needed. For instance, if you defineTheBloke/NewHope-GPTQ
as themodel_path
, the necessary files will be downloaded intomodels/gptq/thebloke_newhope_gptq
. This functionality works similarly for GGML. -
Simpler Log Message: We've made our log messages more concise when using Completions, Chat Completions, or Embeddings endpoints. These logs will now fundamentally display elapsed time, token usage, and token generations per second.
-
Improved Responsiveness for Job Cancellation: The
Event
object inSyncManager
now sends an interrupt signal to worker processes. It checks theis_interrupted
property at the most low-level accessible area and tries to cancel the operation.
v0.1.1
This release incorporates various convenience improvements and code refactoring. The main changes are as follows:
-
Dependencies are automatically installed. By providing the
--install-pkgs
option when running the server, not only the packages of this project but also the packages of all related repositories are installed. This process includes identifying the appropriate version of CUDA and installing the corresponding PyTorch, as well as the installation of TensorFlow. Please refer to the README for more details. -
The need to install the pytest package has been eliminated by performing unittest instead of pytest.
-
The docker-compose file is configured to fetch the already built docker image from Dockerhub.
-
The poetry dependency is included in pyproject.toml. However, it is not recommended to directly install dependencies using poetry. When running the server, the toml file is converted to a requirements.txt file, and the necessary packages are installed via pip install.
-
There is no need to use semaphores because the concurrent use of the model is already limited through the scheduling of the process pool's workers. However, by using semaphores, a
queue
can be created to efficiently utilize the cache model left in the existing worker for the worker scheduler, so the feature has been retained.