Server example with API Rest #1443

FSSRepo · 2023-05-14T02:55:32Z

Hi. These days, I have been working on adding an API to llama.cpp using cpp-httplib. I think it could help some people to implement Llama in their projects easily. I know that there are already several alternatives such as bindings to Node JS and Python, but with this example, I intend to implement it natively in C++.

For now, it can only be compiled with CMake on Windows, Linux, and MacOS. All usage and API information can be found in the README.md file inside the examples/server directory. It doesn't require an external dependency.

Edit:

Current available features:

Completion (wait to end, in loop)
Custom prompt (generation and interactive behavior)
Tokenize
Embeddings

Any suggestions or contributions to this PR are welcome.

CRD716 · 2023-05-14T05:26:49Z

I feel like this goes pretty directly against the no dependencies rule.

FSSRepo · 2023-05-14T05:29:44Z

@CRD716 why?

CRD716 · 2023-05-14T05:37:45Z

Because this adds extra headers and an external library in the form of cpp-httplib.

Avoid adding third-party dependencies, extra files, extra headers, etc.

Readme

FSSRepo · 2023-05-14T05:44:38Z

So, would it be better if this is a separate project that will never be part of the master repository? Is there no other way? Or should I just close this PR?. Cublas, CLblas are third party libraries.

CRD716 · 2023-05-14T06:01:05Z

I personally think this would be better as a separate project, but it's really up to ggerganov on if this is acceptable within the examples or not.

prusnak · 2023-05-14T10:23:00Z

@CRD716 this has been already discussed in #1025 (which this PR supersedes) and we want to have a server example in this repository

What I am not sure is introducing further dependencies (json11 in this PR, especially since json11 is no longer being developed/maintained). If we really want JSON support, maybe we can use https://github.com/nlohmann/json which is a) maintained and b) single-file include only library?

ggerganov · 2023-05-14T13:17:32Z

Yes, what @prusnak says

In general, for this kind of examples that depend on something extra, make sure to:

Not add them to the Makefile. Use only CMake and put the example behind a CMake option that is disabled by default
Put all 3rd party dependencies (i.e. json lib, etc.) inside the new folder in the examples. They will be used only by that example
If possible, provide a separate CI job for that example so that it gets long-term support when breaking changes occur

Keep in mind that we might decide to remove such examples at any point if the maintenance effort becomes too big and there is nobody to do it

An alternative approach is to make a fork and add a README example in llama.cpp that describes the purpose of the example and links to the fork. This way, the maintenance efforts will be up to the owner of the fork

Note that the rules in the README refer to third-party dependencies to the core llama and ggml code.
The examples can link to small dependencies, but we still have to be thoughtful about what we link and choose the minimal option

FSSRepo · 2023-05-14T14:39:20Z

If we really want JSON support, maybe we can use https://github.com/nlohmann/json which is a) maintained and b) single-file include only library?

I will change it

examples/server/server.h

SlyEcho · 2023-05-14T23:44:30Z

examples/server/server.cpp

+    fprintf(stderr, "  -h, --help            show this help message and exit\n");
+    fprintf(stderr, "  -s SEED, --seed SEED  RNG seed (default: -1, use random seed for < 0)\n");
+    fprintf(stderr, "  --memory_f32          use f32 instead of f16 for memory key+value\n");
+    fprintf(stderr, "  --keep                number of tokens to keep from the initial prompt (default: %d, -1 = all)\n", params.n_keep);


n_keep should be passed to the API, since it determines how any given input is evaluated.

You're right

SlyEcho · 2023-05-14T23:57:20Z

examples/server/server.cpp

+                  auto role = ctx_msg["role"].get<std::string>();
+                  if (role == "system")
+                  {
+                    llama->params.prompt = ctx_msg["content"].get<std::string>() + "\n\n";
+                  }
+                  else if (role == "user")
+                  {
+                    llama->params.prompt += llama->user_tag + " " + ctx_msg["content"].get<std::string>() + "\n";
+                  }
+                  else if (role == "assistant")
+                  {
+                    llama->params.prompt += llama->assistant_tag + " " + ctx_msg["content"].get<std::string>() + "\n";
+                  }


This formatting seems too fixed, with the spaces and newlines. I don't think it can work for Alpaca at all.

What happens if the role is not system or user or assistant?

When role is not user, assistant or system the content is ignored.

Related Alpaca model, passing the following options in setting-context should work:

const axios = require('axios'); async function Test() { let result = await axios.post("http://127.0.0.1:8080/setting-context", { context: [ { role: "system", content: "Below is an instruction that describes a task. Write a response that appropriately completes the request." }, { role: "user", content: "Write 3 random words" }, { role: "assistant", content: "1. Sunshine\n2. Elephant\n3. Symphony" } ], tags: { user: "### Instruction:", assistant: "### Response:" }, batch_size: 256, temperature: 0.2, top_k: 40, top_p: 0.9, n_predict: 2048, threads: 5 }); result = await axios.post("http://127.0.0.1:8080/set-message", { message: 'How to do a Hello word in C++, step by step' }); if(result.data.can_inference) { result = await axios.get("http://127.0.0.1:8080/completion?stream=true", { responseType: 'stream' }); result.data.on('data', (data) => { let dat = JSON.parse(data.toString()); // token by token completion process.stdout.write(dat.content); if(dat.stop) { console.log("Completed"); } }); } } Test();

Output:

node test.js 1. Include the <iostream> header file 2. Use the "cout" object to output the text "Hello World!" 3. Add a new line character at the end of the output 4. Display the output on the screen Example code: '''c++ #include <iostream> int main() { cout << "Hello World!" << endl; return 0; } ''' Completed

I mean it will actually format it as ### Instruction: Write 3 random words\n### Response: 1. Sunshine\n2. Elephant\n3. Symphony when Alpaca is trained on ### Instruction:\nWrite 3 random words\n\n### Response:\n1. Sunshine\n2. Elephant\n3. Symphony. It does work but that's because LLMs are flexible, but it's not optimal.

Don't get me wrong, I am trying to look out for you, if you decide to add this prompt templating into the API then you will be fixing and adding features to it for a long time when people come up with new models and ideas.

How about this?

const context = [ { user: "Write 3 random words", assistant: "1. Sunshine\n2. Elephant\n3. Symphony" } ] const system = 'Below is an instruction that describes a task. Write a response that appropriately completes the request.' const message = 'How to do a Hello word in C++, step by step' const prompt = `${system} ${context.map(c => `### Instruction: ${c.user} ### Response: ${c.assistant}`)} ### Instruction: ${message} ` // I think these two calls could be just merged into one endpoint await axios.post("http://127.0.0.1:8080/setting-context", { content: prompt, batch_size: 256, // actually these could be command line arguments, since threads: 5, // they are relevant to the server CPU usage }) let result = await axios.get("http://127.0.0.1:8080/completion?stream=true", { responseType: 'stream', temperature: 0.2, // sampling parameters are used only for top_k: 40, // generating new text top_p: 0.9, n_predict: 2048, // this is why it's nice to be able to get tokens keep: (await axios.post("http://127.0.0.1:8080/tokenize", { content: system })).length, // alternative is to pass keep as a string itself, but then you need to check if its tokens match what was given beforebegin stop: ['###', '\n'], // stop generating when any of these strings is generated, // when streaming you have to make sure that '#' or '##' is not // shown to the user because it could be a part of the stop keyword // so there needs to be a little buffer. }) let answer = '' result.data.on('data', (data) => { let dat = JSON.parse(data.toString()) // token by token completion process.stdout.write(dat.content) answer += dat.content if(dat.stop) { console.log("Completed") // save into chat log for the next question context.push({ user: message, assistant: answer }) } })

Actually there are libraries for this and other kinds of prompts in langchain for example.

I have a problem with cpp-httplib. And it's because it only allows streaming of data with GET requests. I tried to create a single POST endpoint called 'completion' to allow defining options, and it works. However, when it comes to streaming, it doesn't work for some reason.

The idea you propose seems interesting, but there are limitations such as the time it takes to evaluate the prompt, especially if the prompts are very long. It also involves restarting the prompt, and I'm not sure how that could be done.

Currently in this server, the setting-context needs to be called only once, set-message and completion need to be called whenever a response is desired, that allows for fast responses. The interaction context is being stored in "embd_inp" in the server instance.

I'm thinking of something like changing the behavior of the API with an option called behavior that has the choices instruction, chat and generation. There could also be an endpoint for performing embeddings and tokenization.

If the client is sending a string that starts with the same text as last time, there is no need to evaluate it again, tokenize the new string, find how many tokens are the same at the start and set that as n_past. Anyway, at this point with a GPU we can evaluate even 512 tokens in less than 10 seconds (depending on hardware).

For the streaming, maybe the easiest solution is to have no streaming, just return the generated text. If the client wants to show individual tokens appearing to the user, it could just limit n_predict to 1 and call the API in a loop.

I will try it. Good idea. I have to compare the new string tokens with embd_inp?

SlyEcho · 2023-05-15T00:16:06Z

What is this API based on? Does it follow some existing standard or service so that it could be used with other software?

Overall, there is a lot of code dealing with prompt generation, when there should just really be two: evaluate text, generate text (with stop keyword). It's not that hard to generate prompt from a past chat, for example, in Javascript.

There could also be a tokenization endpoint so that the client can estimate the need for truncation or summarization.

But these are my opinions.

FSSRepo · 2023-05-15T02:34:38Z

Initially, this server implementation was more geared towards serving as a chatbot-like interaction. It was roughly based on the OpenAI ChatGPT API although there were unfortunately many limitations such as the fact that prompt evaluation is a bit slow, so the initial context is constant from the beginning. It is also not possible to reset the context (without the need to reload the model) or at least I haven't found a way to do so.

slaren · 2023-05-15T02:45:05Z

examples/server/server.cpp

+  gpt_params params;
+  params.model = "ggml-model.bin";
+
+  std::string hostname = "0.0.0.0";


It may be better to only listen on 127.0.0.1 by default, exposing the server to the network can be dangerous.

You're right. I will fix it.

examples/CMakeLists.txt

howard0su · 2023-05-21T14:59:40Z

If you can simulate OpenAI REST API, this example will become much more useful (even with some limitations).
https://platform.openai.com/docs/api-reference/completions
https://platform.openai.com/docs/api-reference/embeddings

github-actions

clang-tidy made some suggestions

There were too many comments to post at once. Showing the first 25 out of 212. Check the log or trigger a new build to see more.

github-actions · 2023-05-21T17:37:01Z

examples/server/httplib.h

+
+inline bool parse_multipart_boundary(const std::string &content_type,
+                                     std::string &boundary) {
+  auto boundary_keyword = "boundary=";


warning: 'auto boundary_keyword' can be declared as 'const auto *boundary_keyword' [readability-qualified-auto]

Suggested change

auto boundary_keyword = "boundary=";

const auto *boundary_keyword = "boundary=";

github-actions · 2023-05-21T17:37:01Z

examples/server/httplib.h

+    auto len = static_cast<size_t>(m.length(1));
+    bool all_valid_ranges = true;
+    split(&s[pos], &s[pos + len], ',', [&](const char *b, const char *e) {
+      if (!all_valid_ranges) return;


warning: statement should be inside braces [readability-braces-around-statements]

Suggested change

if (!all_valid_ranges) return;

if (!all_valid_ranges) { return;

}

github-actions · 2023-05-21T17:37:02Z

examples/server/httplib.h

+  bool start_with_case_ignore(const std::string &a,
+                              const std::string &b) const {


warning: method 'start_with_case_ignore' can be made static [readability-convert-member-functions-to-static]

Suggested change

bool start_with_case_ignore(const std::string &a,

const std::string &b) const {

static bool start_with_case_ignore(const std::string &a,

const std::string &b) {

github-actions · 2023-05-21T17:37:02Z

examples/server/httplib.h

+  bool start_with(const std::string &a, size_t spos, size_t epos,
+                  const std::string &b) const {


warning: method 'start_with' can be made static [readability-convert-member-functions-to-static]

Suggested change

bool start_with(const std::string &a, size_t spos, size_t epos,

const std::string &b) const {

static bool start_with(const std::string &a, size_t spos, size_t epos,

const std::string &b) {

github-actions · 2023-05-21T17:37:02Z

examples/server/httplib.h

+
+inline std::string to_lower(const char *beg, const char *end) {
+  std::string out;
+  auto it = beg;


warning: 'auto it' can be declared as 'const auto *it' [readability-qualified-auto]

Suggested change

auto it = beg;

const auto *it = beg;

github-actions · 2023-05-21T17:37:06Z

examples/server/httplib.h

+    } else if (n <= static_cast<ssize_t>(size)) {
+      memcpy(ptr, read_buff_.data(), static_cast<size_t>(n));
+      return n;
+    } else {
+      memcpy(ptr, read_buff_.data(), size);
+      read_buff_off_ = size;
+      read_buff_content_size_ = static_cast<size_t>(n);
+      return static_cast<ssize_t>(size);
+    }


warning: do not use 'else' after 'return' [readability-else-after-return]

Suggested change

} else if (n <= static_cast<ssize_t>(size)) {

memcpy(ptr, read_buff_.data(), static_cast<size_t>(n));

return n;

} else {

memcpy(ptr, read_buff_.data(), size);

read_buff_off_ = size;

read_buff_content_size_ = static_cast<size_t>(n);

return static_cast<ssize_t>(size);

}

} if (n <= static_cast<ssize_t>(size)) {

memcpy(ptr, read_buff_.data(), static_cast<size_t>(n));

return n;

} else {

memcpy(ptr, read_buff_.data(), size);

read_buff_off_ = size;

read_buff_content_size_ = static_cast<size_t>(n);

return static_cast<ssize_t>(size);

}

github-actions · 2023-05-21T17:37:06Z

examples/server/httplib.h

+  return *this;
+}
+
+inline Server &Server::set_error_handler(Handler handler) {


warning: the parameter 'handler' is copied for each invocation but only used as a const reference; consider making it a const reference [performance-unnecessary-value-param]

examples/server/httplib.h:707:

- Server &set_error_handler(Handler handler); + Server &set_error_handler(const Handler& handler);

Suggested change

inline Server &Server::set_error_handler(Handler handler) {

inline Server &Server::set_error_handler(const Handler& handler) {

github-actions · 2023-05-21T17:37:06Z

examples/server/httplib.h

+
+inline bool Server::bind_to_port(const std::string &host, int port,
+                                 int socket_flags) {
+  if (bind_internal(host, port, socket_flags) < 0) return false;


warning: statement should be inside braces [readability-braces-around-statements]

if (bind_internal(host, port, socket_flags) < 0) return false; ^

this fix will not be applied because it overlaps with another fix

github-actions · 2023-05-21T17:37:06Z

examples/server/httplib.h

+  if (bind_internal(host, port, socket_flags) < 0) return false;
+  return true;


warning: redundant boolean literal in conditional return statement [readability-simplify-boolean-expr]

Suggested change

if (bind_internal(host, port, socket_flags) < 0) return false;

return true;

return bind_internal(host, port, socket_flags) >= 0;

github-actions · 2023-05-21T17:37:07Z

examples/server/httplib.h

+  }
+}
+
+inline bool Server::parse_request_line(const char *s, Request &req) {


warning: method 'parse_request_line' can be made static [readability-convert-member-functions-to-static]

examples/server/httplib.h:786:

- bool parse_request_line(const char *s, Request &req); + static bool parse_request_line(const char *s, Request &req);

FSSRepo and others added 7 commits May 2, 2023 23:55

Added httplib support

f01c6cb

Merge branch 'ggerganov:master' into master

3baa706

Added readme for server example

197bb66

Merge branch 'master' of https://github.com/FSSRepo/llama.cpp

f684c4d

fixed some bugs

9f4505a

Fix the build error on Macbook

9d641b0

Improve support and API

45ef626

FSSRepo added 2 commits May 14, 2023 09:12

Merge remote-tracking branch 'upstream/master'

dc48799

changed json11 to nlohmann-json

0bb1ff4

This comment was marked as off-topic.

# to view

FSSRepo added 2 commits May 14, 2023 11:02

removed some whitespaces

bec44eb

remove trailing whitespace

7cec8c8

This comment was marked as off-topic.

# to view

SlyEcho reviewed May 14, 2023

View reviewed changes

examples/server/server.h Outdated Show resolved Hide resolved

SlyEcho reviewed May 14, 2023

View reviewed changes

slaren reviewed May 15, 2023

View reviewed changes

FSSRepo added 2 commits May 17, 2023 00:08

Merge remote-tracking branch 'upstream/master'

0cfbd1d

added support custom prompts and more functions

da7f370

ggerganov requested changes May 19, 2023

View reviewed changes

examples/CMakeLists.txt Outdated Show resolved Hide resolved

some corrections and added as cmake option

733b566

github-actions bot reviewed May 21, 2023

View reviewed changes

ggerganov approved these changes May 21, 2023

View reviewed changes

ggerganov merged commit 7e4ea5b into ggml-org:master May 21, 2023

TusharAgey mentioned this pull request Jun 3, 2023

How do i use llama.cpp as an API? #1683

Closed

Bearsaerker mentioned this pull request Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Open

	auto boundary_keyword = "boundary=";
	const auto *boundary_keyword = "boundary=";

	if (!all_valid_ranges) return;
	if (!all_valid_ranges) { return;
	}

		bool start_with_case_ignore(const std::string &a,
		const std::string &b) const {

		bool start_with(const std::string &a, size_t spos, size_t epos,
		const std::string &b) const {

	inline Server &Server::set_error_handler(Handler handler) {
	inline Server &Server::set_error_handler(const Handler& handler) {

		if (bind_internal(host, port, socket_flags) < 0) return false;
		return true;

	if (bind_internal(host, port, socket_flags) < 0) return false;
	return true;
	return bind_internal(host, port, socket_flags) >= 0;

Server example with API Rest #1443

Server example with API Rest #1443

Conversation

FSSRepo commented May 14, 2023 • edited Loading

CRD716 commented May 14, 2023

FSSRepo commented May 14, 2023

CRD716 commented May 14, 2023

FSSRepo commented May 14, 2023 • edited Loading

CRD716 commented May 14, 2023

prusnak commented May 14, 2023 • edited Loading

ggerganov commented May 14, 2023

FSSRepo commented May 14, 2023

This comment was marked as off-topic.

This comment was marked as off-topic.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SlyEcho commented May 15, 2023

FSSRepo commented May 15, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

howard0su commented May 21, 2023

github-actions bot left a comment

Choose a reason for hiding this comment

github-actions bot May 21, 2023

Choose a reason for hiding this comment

github-actions bot May 21, 2023

Choose a reason for hiding this comment

github-actions bot May 21, 2023

Choose a reason for hiding this comment

github-actions bot May 21, 2023

Choose a reason for hiding this comment

github-actions bot May 21, 2023

Choose a reason for hiding this comment

github-actions bot May 21, 2023

Choose a reason for hiding this comment

github-actions bot May 21, 2023

Choose a reason for hiding this comment

github-actions bot May 21, 2023

Choose a reason for hiding this comment

github-actions bot May 21, 2023

Choose a reason for hiding this comment

github-actions bot May 21, 2023

Choose a reason for hiding this comment

FSSRepo commented May 14, 2023 •

edited

Loading

FSSRepo commented May 14, 2023 •

edited

Loading

prusnak commented May 14, 2023 •

edited

Loading