-
Notifications
You must be signed in to change notification settings - Fork 11.5k
Server example with API Rest #1443
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Conversation
I feel like this goes pretty directly against the no dependencies rule. |
@CRD716 why? |
Because this adds extra headers and an external library in the form of cpp-httplib.
|
So, would it be better if this is a separate project that will never be part of the master repository? Is there no other way? Or should I just close this PR?. Cublas, CLblas are third party libraries. |
I personally think this would be better as a separate project, but it's really up to ggerganov on if this is acceptable within the examples or not. |
@CRD716 this has been already discussed in #1025 (which this PR supersedes) and we want to have a server example in this repository What I am not sure is introducing further dependencies (json11 in this PR, especially since json11 is no longer being developed/maintained). If we really want JSON support, maybe we can use https://github.com/nlohmann/json which is a) maintained and b) single-file include only library? |
Yes, what @prusnak says In general, for this kind of examples that depend on something extra, make sure to:
Keep in mind that we might decide to remove such examples at any point if the maintenance effort becomes too big and there is nobody to do it An alternative approach is to make a fork and add a README Note that the rules in the README refer to third-party dependencies to the core |
I will change it |
examples/server/server.cpp
Outdated
fprintf(stderr, " -h, --help show this help message and exit\n"); | ||
fprintf(stderr, " -s SEED, --seed SEED RNG seed (default: -1, use random seed for < 0)\n"); | ||
fprintf(stderr, " --memory_f32 use f32 instead of f16 for memory key+value\n"); | ||
fprintf(stderr, " --keep number of tokens to keep from the initial prompt (default: %d, -1 = all)\n", params.n_keep); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
n_keep
should be passed to the API, since it determines how any given input is evaluated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right
examples/server/server.cpp
Outdated
auto role = ctx_msg["role"].get<std::string>(); | ||
if (role == "system") | ||
{ | ||
llama->params.prompt = ctx_msg["content"].get<std::string>() + "\n\n"; | ||
} | ||
else if (role == "user") | ||
{ | ||
llama->params.prompt += llama->user_tag + " " + ctx_msg["content"].get<std::string>() + "\n"; | ||
} | ||
else if (role == "assistant") | ||
{ | ||
llama->params.prompt += llama->assistant_tag + " " + ctx_msg["content"].get<std::string>() + "\n"; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This formatting seems too fixed, with the spaces and newlines. I don't think it can work for Alpaca at all.
What happens if the role is not system
or user
or assistant
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When role
is not user
, assistant
or system
the content is ignored.
Related Alpaca model, passing the following options in setting-context should work:
const axios = require('axios');
async function Test() {
let result = await axios.post("http://127.0.0.1:8080/setting-context", {
context: [
{ role: "system", content: "Below is an instruction that describes a task. Write a response that appropriately completes the request." },
{ role: "user", content: "Write 3 random words" },
{ role: "assistant", content: "1. Sunshine\n2. Elephant\n3. Symphony" }
],
tags: { user: "### Instruction:", assistant: "### Response:" },
batch_size: 256,
temperature: 0.2,
top_k: 40,
top_p: 0.9,
n_predict: 2048,
threads: 5
});
result = await axios.post("http://127.0.0.1:8080/set-message", {
message: 'How to do a Hello word in C++, step by step'
});
if(result.data.can_inference) {
result = await axios.get("http://127.0.0.1:8080/completion?stream=true", { responseType: 'stream' });
result.data.on('data', (data) => {
let dat = JSON.parse(data.toString());
// token by token completion
process.stdout.write(dat.content);
if(dat.stop) {
console.log("Completed");
}
});
}
}
Test();
Output:
node test.js
1. Include the <iostream> header file
2. Use the "cout" object to output the text "Hello World!"
3. Add a new line character at the end of the output
4. Display the output on the screen
Example code:
'''c++
#include <iostream>
int main() {
cout << "Hello World!" << endl;
return 0;
}
'''
Completed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean it will actually format it as ### Instruction: Write 3 random words\n### Response: 1. Sunshine\n2. Elephant\n3. Symphony
when Alpaca is trained on ### Instruction:\nWrite 3 random words\n\n### Response:\n1. Sunshine\n2. Elephant\n3. Symphony
. It does work but that's because LLMs are flexible, but it's not optimal.
Don't get me wrong, I am trying to look out for you, if you decide to add this prompt templating into the API then you will be fixing and adding features to it for a long time when people come up with new models and ideas.
How about this?
const context = [
{ user: "Write 3 random words", assistant: "1. Sunshine\n2. Elephant\n3. Symphony" }
]
const system = 'Below is an instruction that describes a task. Write a response that appropriately completes the request.'
const message = 'How to do a Hello word in C++, step by step'
const prompt = `${system}
${context.map(c => `### Instruction:
${c.user}
### Response:
${c.assistant}`)}
### Instruction:
${message}
`
// I think these two calls could be just merged into one endpoint
await axios.post("http://127.0.0.1:8080/setting-context", {
content: prompt,
batch_size: 256, // actually these could be command line arguments, since
threads: 5, // they are relevant to the server CPU usage
})
let result = await axios.get("http://127.0.0.1:8080/completion?stream=true", {
responseType: 'stream',
temperature: 0.2, // sampling parameters are used only for
top_k: 40, // generating new text
top_p: 0.9,
n_predict: 2048,
// this is why it's nice to be able to get tokens
keep: (await axios.post("http://127.0.0.1:8080/tokenize", { content: system })).length,
// alternative is to pass keep as a string itself, but then you need to check if its tokens match what was given beforebegin
stop: ['###', '\n'], // stop generating when any of these strings is generated,
// when streaming you have to make sure that '#' or '##' is not
// shown to the user because it could be a part of the stop keyword
// so there needs to be a little buffer.
})
let answer = ''
result.data.on('data', (data) => {
let dat = JSON.parse(data.toString())
// token by token completion
process.stdout.write(dat.content)
answer += dat.content
if(dat.stop) {
console.log("Completed")
// save into chat log for the next question
context.push({ user: message, assistant: answer })
}
})
Actually there are libraries for this and other kinds of prompts in langchain for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a problem with cpp-httplib. And it's because it only allows streaming of data with GET requests. I tried to create a single POST endpoint called 'completion' to allow defining options, and it works. However, when it comes to streaming, it doesn't work for some reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea you propose seems interesting, but there are limitations such as the time it takes to evaluate the prompt, especially if the prompts are very long. It also involves restarting the prompt, and I'm not sure how that could be done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently in this server, the setting-context
needs to be called only once, set-message
and completion
need to be called whenever a response is desired, that allows for fast responses. The interaction context is being stored in "embd_inp" in the server instance.
I'm thinking of something like changing the behavior of the API with an option called behavior
that has the choices instruction
, chat
and generation
. There could also be an endpoint for performing embeddings and tokenization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the client is sending a string that starts with the same text as last time, there is no need to evaluate it again, tokenize the new string, find how many tokens are the same at the start and set that as n_past
. Anyway, at this point with a GPU we can evaluate even 512 tokens in less than 10 seconds (depending on hardware).
For the streaming, maybe the easiest solution is to have no streaming, just return the generated text. If the client wants to show individual tokens appearing to the user, it could just limit n_predict
to 1 and call the API in a loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will try it. Good idea. I have to compare the new string tokens with embd_inp
?
What is this API based on? Does it follow some existing standard or service so that it could be used with other software? Overall, there is a lot of code dealing with prompt generation, when there should just really be two: evaluate text, generate text (with stop keyword). It's not that hard to generate prompt from a past chat, for example, in Javascript. There could also be a tokenization endpoint so that the client can estimate the need for truncation or summarization. But these are my opinions. |
Initially, this server implementation was more geared towards serving as a chatbot-like interaction. It was roughly based on the OpenAI ChatGPT API although there were unfortunately many limitations such as the fact that prompt evaluation is a bit slow, so the initial context is constant from the beginning. It is also not possible to reset the context (without the need to reload the model) or at least I haven't found a way to do so. |
examples/server/server.cpp
Outdated
gpt_params params; | ||
params.model = "ggml-model.bin"; | ||
|
||
std::string hostname = "0.0.0.0"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be better to only listen on 127.0.0.1 by default, exposing the server to the network can be dangerous.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. I will fix it.
If you can simulate OpenAI REST API, this example will become much more useful (even with some limitations). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
There were too many comments to post at once. Showing the first 25 out of 212. Check the log or trigger a new build to see more.
|
||
inline bool parse_multipart_boundary(const std::string &content_type, | ||
std::string &boundary) { | ||
auto boundary_keyword = "boundary="; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: 'auto boundary_keyword' can be declared as 'const auto *boundary_keyword' [readability-qualified-auto]
auto boundary_keyword = "boundary="; | |
const auto *boundary_keyword = "boundary="; |
auto len = static_cast<size_t>(m.length(1)); | ||
bool all_valid_ranges = true; | ||
split(&s[pos], &s[pos + len], ',', [&](const char *b, const char *e) { | ||
if (!all_valid_ranges) return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: statement should be inside braces [readability-braces-around-statements]
if (!all_valid_ranges) return; | |
if (!all_valid_ranges) { return; | |
} |
bool start_with_case_ignore(const std::string &a, | ||
const std::string &b) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: method 'start_with_case_ignore' can be made static [readability-convert-member-functions-to-static]
bool start_with_case_ignore(const std::string &a, | |
const std::string &b) const { | |
static bool start_with_case_ignore(const std::string &a, | |
const std::string &b) { |
bool start_with(const std::string &a, size_t spos, size_t epos, | ||
const std::string &b) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: method 'start_with' can be made static [readability-convert-member-functions-to-static]
bool start_with(const std::string &a, size_t spos, size_t epos, | |
const std::string &b) const { | |
static bool start_with(const std::string &a, size_t spos, size_t epos, | |
const std::string &b) { |
|
||
inline std::string to_lower(const char *beg, const char *end) { | ||
std::string out; | ||
auto it = beg; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: 'auto it' can be declared as 'const auto *it' [readability-qualified-auto]
auto it = beg; | |
const auto *it = beg; |
} else if (n <= static_cast<ssize_t>(size)) { | ||
memcpy(ptr, read_buff_.data(), static_cast<size_t>(n)); | ||
return n; | ||
} else { | ||
memcpy(ptr, read_buff_.data(), size); | ||
read_buff_off_ = size; | ||
read_buff_content_size_ = static_cast<size_t>(n); | ||
return static_cast<ssize_t>(size); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: do not use 'else' after 'return' [readability-else-after-return]
} else if (n <= static_cast<ssize_t>(size)) { | |
memcpy(ptr, read_buff_.data(), static_cast<size_t>(n)); | |
return n; | |
} else { | |
memcpy(ptr, read_buff_.data(), size); | |
read_buff_off_ = size; | |
read_buff_content_size_ = static_cast<size_t>(n); | |
return static_cast<ssize_t>(size); | |
} | |
} if (n <= static_cast<ssize_t>(size)) { | |
memcpy(ptr, read_buff_.data(), static_cast<size_t>(n)); | |
return n; | |
} else { | |
memcpy(ptr, read_buff_.data(), size); | |
read_buff_off_ = size; | |
read_buff_content_size_ = static_cast<size_t>(n); | |
return static_cast<ssize_t>(size); | |
} |
return *this; | ||
} | ||
|
||
inline Server &Server::set_error_handler(Handler handler) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: the parameter 'handler' is copied for each invocation but only used as a const reference; consider making it a const reference [performance-unnecessary-value-param]
examples/server/httplib.h:707:
- Server &set_error_handler(Handler handler);
+ Server &set_error_handler(const Handler& handler);
inline Server &Server::set_error_handler(Handler handler) { | |
inline Server &Server::set_error_handler(const Handler& handler) { |
|
||
inline bool Server::bind_to_port(const std::string &host, int port, | ||
int socket_flags) { | ||
if (bind_internal(host, port, socket_flags) < 0) return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: statement should be inside braces [readability-braces-around-statements]
if (bind_internal(host, port, socket_flags) < 0) return false;
^
this fix will not be applied because it overlaps with another fix
if (bind_internal(host, port, socket_flags) < 0) return false; | ||
return true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: redundant boolean literal in conditional return statement [readability-simplify-boolean-expr]
if (bind_internal(host, port, socket_flags) < 0) return false; | |
return true; | |
return bind_internal(host, port, socket_flags) >= 0; |
} | ||
} | ||
|
||
inline bool Server::parse_request_line(const char *s, Request &req) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: method 'parse_request_line' can be made static [readability-convert-member-functions-to-static]
examples/server/httplib.h:786:
- bool parse_request_line(const char *s, Request &req);
+ static bool parse_request_line(const char *s, Request &req);
Hi. These days, I have been working on adding an API to llama.cpp using cpp-httplib. I think it could help some people to implement Llama in their projects easily. I know that there are already several alternatives such as bindings to Node JS and Python, but with this example, I intend to implement it natively in C++.
For now, it can only be compiled with CMake on Windows, Linux, and MacOS. All usage and API information can be found in the
README.md
file inside theexamples/server
directory. It doesn't require an external dependency.Edit:
Current available features:
Any suggestions or contributions to this PR are welcome.