v0.1.4
π This release introduces a series of improvements aimed at enhancing user experience and refining the codebase. Here's a breakdown of the changes:
π 1. Exllama Module - LoRA Integration
- By placing
adapter_config.json
andadapter_model.bin
in the./models/gptq/YOUR_MODEL
directory, the system will now seamlessly initialize LoRA.
π 2. OpenAI Logit Bias Support
- For API queries to models specified within the
openai_replacement_models
dictionary, there's an auto-conversion from OpenAI ID to Llama ID,_ courtesy of the Tiktoken tokenizer.
β 3. Optimized Worker Load Balancing
- Workers within the process pool have undergone a revamp in their load balancing algorithm. Based on the computed
worker_rank
, they now allocate clients more efficiently. In scenarios where ranks tie, a random worker is selected.
π 4. Enhanced Logging Mechanism
- Expect crisper log messages henceforth. Additionally, both user prompts and response prompts stemming from Chat Completion and Text Completion operations are archived in
logs/chat.log
.
π₯ 5. Docker Image Upgrades
- The antecedent Docker image was reliant on the CPU version of llama.cpp, which can't use of CUDA acceleration. However, given the constraints in utilizing the CUDA Compiler during the build phase, JIT comes to the rescue to ensure automatic compilation.