Skip to content

Releases: c0sogi/llama-api

v0.1.8

04 Sep 02:33
2e8cead
Compare
Choose a tag to compare

This release includes several hotfixes.

New Features

  • Max Tokens Limit Option: Added --max-tokens-limit MAX_TOKENS_LIMIT option. You can now adjust the upper limit of max tokens. If exceeded, a pydantic error will be triggered.

Enhancements

  • Docker Image Update: Removed the PORT environment variable. You can now customize the port using the docker run command and --port option.

Bug Fixes

  • CUDA Memory Error: If a CUDA-related error occurs, a MemoryError is raised to automatically terminate the worker process. Subsequent worker processes can be automatically generated.

  • Unix Lifespan Bug: Fixed a bug where the process pool would not close and deadlock would occur when terminating the fastapi app in a Unix environment.

  • Langchain Compatibility: Resolved a type conflict issue causing a pydantic validation error when using ChatOpenAI in Langchain if the request body contained None. None values are now ignored.


Usage Example for Docker

docker run -d --name my-container --port 8080:8080 my-image

Usage Example for Max Tokens Limit Option

python -m main --max-tokens-limit 500

v0.1.7

03 Sep 04:49
97f08d6
Compare
Choose a tag to compare

This release introduces the following changes:

  1. Added Instruction Templates: We have added instruction-templates to the model definition. Now, one can explicitly provide an instruction_template in LlamaCppModel or ExllamaModel. This helps to generate more accurate prompts in the chat completion endpoint.

  2. Streaming Response Timeout: Streaming responses will now timeout automatically if the next chunk is not received within 30 seconds.

  3. Semaphore Bug Fix: Fixed a bug where a semaphore would not be properly released after being acquired.

  4. Auto Truncate in Model Definition: If auto_truncate is set to True in the model definition, past prompts will be automatically truncated to fit within the context window, thus preventing errors. The default setting is True.

  5. Automatic RoPE Parameter Adjustment: If explicit settings for rope_freq_base, rope_freq_scale in llama.cpp, or alpha_value, compress_pos_emb in exllama are not provided, the RoPE frequency and scaling factor will be automatically adjusted. This default behavior is based on the llama2 model with a training token count of 4096.

  6. Dynamic Model Definition Parsing: Model definitions are primarily configured in model_definitions.py. However, parsing will now also be attempted for Python script files in the root directory containing the words 'model' and 'def'. Environment variables containing 'model' and 'def' will also be parsed automatically. This applies to openai_replacement_models as well. For example, you can set environment variables as shown below:

#!/bin/bash

export MODEL_DEFINITIONS='{
  "gptq": {
    "type": "exllama",
    "model_path": "TheBloke/MythoMax-L2-13B-GPTQ",
    "max_total_tokens": 4096
  },
  "ggml": {
    "type": "llama.cpp",
    "model_path": "TheBloke/Airoboros-L2-13B-2.1-GGUF",
    "max_total_tokens": 8192,
    "rope_freq_base": 26000,
    "rope_freq_scale": 0.5,
    "n_gpu_layers": 50
  }
}'

export OPENAI_REPLACEMENT_MODELS='{
  "gpt-3.5-turbo": "ggml",
  "gpt-3.5-turbo-0613": "ggml",
  "gpt-3.5-turbo-16k": "ggml",
  "gpt-3.5-turbo-16k-0613": "ggml",
  "gpt-3.5-turbo-0301": "ggml",
  "gpt-4": "gptq",
  "gpt-4-32k": "gptq",
  "gpt-4-0613": "gptq",
  "gpt-4-32k-0613": "gptq",
  "gpt-4-0301": "gptq"
}'

echo "MODEL_DEFINITIONS: $MODEL_DEFINITIONS"
echo "OPENAI_REPLACEMENT_MODELS: $OPENAI_REPLACEMENT_MODELS"

v0.1.6

27 Aug 03:14
067d7d8
Compare
Choose a tag to compare

🚀 Overview
This release introduces a series of improvements aimed at enhancing user experience and refining the codebase. Here's a breakdown of the changes:


1. Function calling support for llama.cpp

  • Request a function call with the same scheme used in OpenAI API.

🌐 2. Fix compatibility with newer version of llama.cpp

  • In recent update with GGUF support, there are some API changes.

v0.1.5

22 Aug 12:43
61385d4
Compare
Choose a tag to compare

🚀 Overview
This release introduces a series of improvements aimed at enhancing user experience and refining the codebase. Here's a breakdown of the changes:


1. Optimized performance: llama.cpp & exllama

  • Made performance improvements by changing the text generation logic.

🌐 2. Tunnel through Cloudflare

  • Expose this API to the external network using the --tunnel option.

⚙️ 3. CLI args Refinement

  • Moved argparse.ArgumentParser to config.py.

🐞 4. Bugfix: niceness of process

  • Fixed a bug where the niceness of the process couldn't be modified in a docker environment.

🔜 5. Enhancement: required option in function call schema

  • The function call feature is not yet implemented. Stay tuned!

v0.1.4

17 Aug 03:39
023fb40
Compare
Choose a tag to compare

🚀 This release introduces a series of improvements aimed at enhancing user experience and refining the codebase. Here's a breakdown of the changes:


🌟 1. Exllama Module - LoRA Integration

  • By placing adapter_config.json and adapter_model.bin in the ./models/gptq/YOUR_MODEL directory, the system will now seamlessly initialize LoRA.

🔗 2. OpenAI Logit Bias Support

  • For API queries to models specified within the openai_replacement_models dictionary, there's an auto-conversion from OpenAI ID to Llama ID,_ courtesy of the Tiktoken tokenizer.

3. Optimized Worker Load Balancing

  • Workers within the process pool have undergone a revamp in their load balancing algorithm. Based on the computed worker_rank, they now allocate clients more efficiently. In scenarios where ranks tie, a random worker is selected.

📜 4. Enhanced Logging Mechanism

  • Expect crisper log messages henceforth. Additionally, both user prompts and response prompts stemming from Chat Completion and Text Completion operations are archived in logs/chat.log.

🔥 5. Docker Image Upgrades

  • The antecedent Docker image was reliant on the CPU version of llama.cpp, which can't use of CUDA acceleration. However, given the constraints in utilizing the CUDA Compiler during the build phase, JIT comes to the rescue to ensure automatic compilation.

v0.1.3

09 Aug 13:40
178fe3e
Compare
Choose a tag to compare

This release encompasses several enhancements to usability and code refactoring. The primary changes include:

  1. Skip compilation: You can skip compilation of llama.cpp shared library when running server with --install-pkgs. Just add --skip-compile option.
  2. Removed auto process kill feature: Killing process when unloading model, was introduced to prevent the program from memory leak, but this sometimes make the program exit for no reason. So this feature is removed.
  3. API key checker: API key checker will be activated if you start the server with option --api-key YOUR_API_KEY. Client must include Authorization header with Bearer YOUR_API_KEY.

v0.1.2

03 Aug 01:32
344ab12
Compare
Choose a tag to compare

This release encompasses several enhancements to usability and code refactoring. The primary changes include:

  1. Automatic Model Downloader: In our previous implementation, the model_path attribute in model_definitions.py required an actual filename of a model. We have now upgraded this to accept the name of a HuggingFace repository instead. As a result, the specified model is automatically downloaded when needed. For instance, if you define TheBloke/NewHope-GPTQ as the model_path, the necessary files will be downloaded into models/gptq/thebloke_newhope_gptq. This functionality works similarly for GGML.

  2. Simpler Log Message: We've made our log messages more concise when using Completions, Chat Completions, or Embeddings endpoints. These logs will now fundamentally display elapsed time, token usage, and token generations per second.

  3. Improved Responsiveness for Job Cancellation: The Event object in SyncManager now sends an interrupt signal to worker processes. It checks the is_interrupted property at the most low-level accessible area and tries to cancel the operation.

v0.1.1

31 Jul 05:25
Compare
Choose a tag to compare

This release incorporates various convenience improvements and code refactoring. The main changes are as follows:

  1. Dependencies are automatically installed. By providing the --install-pkgs option when running the server, not only the packages of this project but also the packages of all related repositories are installed. This process includes identifying the appropriate version of CUDA and installing the corresponding PyTorch, as well as the installation of TensorFlow. Please refer to the README for more details.

  2. The need to install the pytest package has been eliminated by performing unittest instead of pytest.

  3. The docker-compose file is configured to fetch the already built docker image from Dockerhub.

  4. The poetry dependency is included in pyproject.toml. However, it is not recommended to directly install dependencies using poetry. When running the server, the toml file is converted to a requirements.txt file, and the necessary packages are installed via pip install.

  5. There is no need to use semaphores because the concurrent use of the model is already limited through the scheduling of the process pool's workers. However, by using semaphores, a queue can be created to efficiently utilize the cache model left in the existing worker for the worker scheduler, so the feature has been retained.

v0.1.0

30 Jul 05:36
Compare
Choose a tag to compare

First release