开发手册

快速开始

创建你自己的项目，使用 Maven 或 Gradle 引入 llama-java-core 框架。

最新的版本请查阅 GitHub Release 或搜索 Maven repository。

Maven

<dependency>
    <groupId>chat.octet</groupId>
    <artifactId>llama-java-core</artifactId>
    <version>LAST_RELEASE_VERSION</version>
</dependency>

Gradle

implementation group: 'chat.octet', name: 'llama-java-core', version: 'LAST_RELEASE_VERSION'

Examples

Chat Console Example

这里提供了一个简单的聊天示例。

public class ConsoleExample {
    private static final String MODEL_PATH = "/octet-chat/models/llama2/ggml-model-7b-q6_k.gguf";

    public static void main(String[] args) {
        ModelParameter modelParams = ModelParameter.builder()
                .modelPath(MODEL_PATH)
                .threads(6)
                .contextSize(4096)
                .verbose(true)
                .build();

        try (BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(System.in, StandardCharsets.UTF_8));
             Model model = new Model(modelParams)) {

            GenerateParameter generateParams = GenerateParameter.builder().build();
            String system = "Answer the questions.";

            while (true) {
                System.out.print("\nQuestion: ");
                String input = bufferedReader.readLine();
                if (StringUtils.trimToEmpty(input).equalsIgnoreCase("exit")) {
                    break;
                }
                model.chat(generateParams, system, input).output();
                System.out.print("\n");
                model.metrics();
            }
        } catch (Exception e) {
            System.err.println("Error: " + e);
            System.exit(1);
        }
    }
}

Continuous Chat Example

public class ContinuousChatExample {

    private static final String MODEL_PATH = "/octet-chat/models/llama2/ggml-model-7b-q6_k.gguf";

    public static void main(String[] args) {
        String system = "You are a helpful assistant. ";
        String[] questions = new String[]{
                "List five emojis about food and explain their meanings",
                "Write a fun story based on the third emoji",
                "Continue this story and refine it",
                "Summarize a title for this story, extract five keywords, and the keywords should not exceed five words",
                "Mark the characters, time, and location of this story",
                "Great, translate this story into Chinese",
                "Who are you and why are you here?",
                "Summarize today's conversation"
        };

        GenerateParameter generateParams = GenerateParameter.builder()
                .verbosePrompt(true)
                .user("William")
                .build();

        try (Model model = new Model(MODEL_PATH)) {
            for (String question : questions) {
                //Example: Continuous chat example
                model.chat(generateParams, system, question).output();
                System.out.println("\n");
                model.metrics();
            }
        }
    }
}

Tip

更多示例: chat.octet.examples.*

推理组件

LogitsProcessor
StoppingCriteria

可以使用 LogitsProcessor 和 StoppingCriteria 对模型推理过程进行自定义控制。

注：如果需要在Java中进行矩阵计算请使用 openblas

chat.octet.model.components.processor.LogitsProcessor

自定义一个处理器对词的概率分布进行调整，控制模型推理的生成结果。这里是一个示例：NoBadWordsLogitsProcessor.java

LogitBias logitBias = new LogitBias();
logitBias.put(5546, "false");
logitBias.put(12113, "5.89");
LogitsProcessorList logitsProcessorList = new LogitsProcessorList().add(new CustomBiasLogitsProcessor(logitBias, model.getVocabSize()));

GenerateParameter generateParams = GenerateParameter.builder()
        .logitsProcessorList(logitsProcessorList)
        .build();

chat.octet.model.components.criteria.StoppingCriteria

自定义一个控制器实现对模型推理的停止规则控制，例如：控制生成最大超时时间，这里是一个示例：MaxTimeCriteria

long maxTime = TimeUnit.MINUTES.toMillis(Optional.ofNullable(params.getTimeout()).orElse(10L));
StoppingCriteriaList stopCriteriaList = new StoppingCriteriaList().add(new MaxTimeCriteria(maxTime));

GenerateParameter generateParams = GenerateParameter.builder()
        .stoppingCriteriaList(stopCriteriaList)
        .build();

完整的文档请参考 Java docs

模型量化

下载原始模型文件

搜索 huggingface 获取开源模型，支持 Llama2 和 GPT 模型及其他开源模型，例如：Baichuan 7B、Qwen 7B。

转换为GGUF格式模型

# 下载llama.cpp项目
git clone https://github.com/ggerganov/llama.cpp.git

# 安装python库
cd llama.cpp
pip3 install -r requirements.txt

# 转换为GGUF格式模型
python3 convert.py SOURCE_MODEL_PATH --outfile OUTPUT_MODEL_PATH/model-f16.gguf

模型量化

使用 LlamaService.llamaModelQuantize 进行模型量化，设置 ModelFileType.LLAMA_FTYPE_MOSTLY_Q8_0 调整量化精度。

public class QuantizeExample {
    public static void main(String[] args) {
        int status = LlamaService.llamaModelQuantize("OUTPUT_MODEL_PATH/model-f16.gguf",
                "OUTPUT_MODEL_PATH/model.gguf",
                ModelFileType.LLAMA_FTYPE_MOSTLY_Q8_0
        );
        System.out.println("Quantize status: " + status);
    }
}

或者使用 quantize 工具进行量化：

quantize OUTPUT_MODEL_PATH/model-f16.gguf OUTPUT_MODEL_PATH/model.gguf q8_0

如何编译

Tip

默认已包含各系统版本库，可以直接使用。

如需重新编译，可以使用llama-java-core/build.sh

On Linux

linux-x86-64/libllamajava.so

On macOS

darwin-x86-64/default.metallib
darwin-x86-64/libllamajava.dylib

darwin-aarch64/default.metallib
darwin-aarch64/libllamajava.dylib

On Windows

win32-x86-64/llamajava.dll

如果需要加载外部库文件，可以使用如下参数。

-Doctet.llama.lib=<YOUR_LIB_PATH>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

开发手册

快速开始

推理组件

模型量化

如何编译

Clone this wiki locally