Skip to content

v0.1.4

Compare
Choose a tag to compare
@c0sogi c0sogi released this 17 Aug 03:39
· 63 commits to master since this release
023fb40

πŸš€ This release introduces a series of improvements aimed at enhancing user experience and refining the codebase. Here's a breakdown of the changes:


🌟 1. Exllama Module - LoRA Integration

  • By placing adapter_config.json and adapter_model.bin in the ./models/gptq/YOUR_MODEL directory, the system will now seamlessly initialize LoRA.

πŸ”— 2. OpenAI Logit Bias Support

  • For API queries to models specified within the openai_replacement_models dictionary, there's an auto-conversion from OpenAI ID to Llama ID,_ courtesy of the Tiktoken tokenizer.

βš– 3. Optimized Worker Load Balancing

  • Workers within the process pool have undergone a revamp in their load balancing algorithm. Based on the computed worker_rank, they now allocate clients more efficiently. In scenarios where ranks tie, a random worker is selected.

πŸ“œ 4. Enhanced Logging Mechanism

  • Expect crisper log messages henceforth. Additionally, both user prompts and response prompts stemming from Chat Completion and Text Completion operations are archived in logs/chat.log.

πŸ”₯ 5. Docker Image Upgrades

  • The antecedent Docker image was reliant on the CPU version of llama.cpp, which can't use of CUDA acceleration. However, given the constraints in utilizing the CUDA Compiler during the build phase, JIT comes to the rescue to ensure automatic compilation.