Skip to content

Commit f66e519

Browse files
ngxsonmglambda
authored andcommitted
server : fix logprobs, make it OAI-compatible (ggml-org#10783)
* server : fix logprobs, make it openai-compatible * update docs * add std::log * return pre-sampling p * sort before apply softmax * add comment * fix test * set p for sampled token * update docs * add --multi-token-probs * update docs * add `post_sampling_probs` option * update docs [no ci] * remove --multi-token-probs * "top_probs" with "post_sampling_probs" * resolve review comments * rename struct token_prob to prob_info * correct comment placement * fix setting prob for sampled token
1 parent fdd47b7 commit f66e519

File tree

6 files changed

+396
-107
lines changed

6 files changed

+396
-107
lines changed

examples/server/README.md

+56-22
Original file line numberDiff line numberDiff line change
@@ -343,6 +343,10 @@ node index.js
343343

344344
### POST `/completion`: Given a `prompt`, it returns the predicted completion.
345345

346+
> [!IMPORTANT]
347+
>
348+
> This endpoint is **not** OAI-compatible
349+
346350
*Options:*
347351

348352
`prompt`: Provide the prompt for this completion as a string or as an array of strings or numbers representing tokens. Internally, if `cache_prompt` is `true`, the prompt is compared to the previous completion and only the "unseen" suffix is evaluated. A `BOS` token is inserted at the start, if all of the following conditions are true:
@@ -444,38 +448,68 @@ These words will not be included in the completion, so make sure to add them to
444448

445449
`timings_per_token`: Include prompt processing and text generation speed information in each response. Default: `false`
446450

451+
`post_sampling_probs`: Returns the probabilities of top `n_probs` tokens after applying sampling chain.
452+
447453
**Response format**
448454

449455
- Note: In streaming mode (`stream`), only `content`, `tokens` and `stop` will be returned until end of completion. Responses are sent using the [Server-sent events](https://html.spec.whatwg.org/multipage/server-sent-events.html) standard. Note: the browser's `EventSource` interface cannot be used due to its lack of `POST` request support.
450456

451-
- `completion_probabilities`: An array of token probabilities for each completion. The array's length is `n_predict`. Each item in the array has the following structure:
452-
453-
```json
454-
{
455-
"content": "<the token generated by the model>",
456-
"tokens": [ generated token ids if requested ],
457-
"probs": [
458-
{
459-
"prob": float,
460-
"tok_str": "<most likely token>"
461-
},
462-
{
463-
"prob": float,
464-
"tok_str": "<second most likely token>"
465-
},
457+
- `completion_probabilities`: An array of token probabilities for each completion. The array's length is `n_predict`. Each item in the array has a nested array `top_logprobs`. It contains at **maximum** `n_probs` elements:
458+
```json
459+
{
460+
"content": "<the generated completion text>",
461+
"tokens": [ generated token ids if requested ],
466462
...
467-
]
468-
},
469-
```
470-
471-
Notice that each `probs` is an array of length `n_probs`.
463+
"probs": [
464+
{
465+
"id": <token id>,
466+
"logprob": float,
467+
"token": "<most likely token>",
468+
"bytes": [int, int, ...],
469+
"top_logprobs": [
470+
{
471+
"id": <token id>,
472+
"logprob": float,
473+
"token": "<token text>",
474+
"bytes": [int, int, ...],
475+
},
476+
{
477+
"id": <token id>,
478+
"logprob": float,
479+
"token": "<token text>",
480+
"bytes": [int, int, ...],
481+
},
482+
...
483+
]
484+
},
485+
{
486+
"id": <token id>,
487+
"logprob": float,
488+
"token": "<most likely token>",
489+
"bytes": [int, int, ...],
490+
"top_logprobs": [
491+
...
492+
]
493+
},
494+
...
495+
]
496+
},
497+
```
498+
Please note that if `post_sampling_probs` is set to `true`:
499+
- `logprob` will be replaced with `prob`, with the value between 0.0 and 1.0
500+
- `top_logprobs` will be replaced with `top_probs`. Each element contains:
501+
- `id`: token ID
502+
- `token`: token in string
503+
- `bytes`: token in bytes
504+
- `prob`: token probability, with the value between 0.0 and 1.0
505+
- Number of elements in `top_probs` may be less than `n_probs`
472506

473507
- `content`: Completion result as a string (excluding `stopping_word` if any). In case of streaming mode, will contain the next token as a string.
474508
- `tokens`: Same as `content` but represented as raw token ids. Only populated if `"return_tokens": true` or `"stream": true` in the request.
475509
- `stop`: Boolean for use with `stream` to check whether the generation has stopped (Note: This is not related to stopping words array `stop` from input options)
476510
- `generation_settings`: The provided options above excluding `prompt` but including `n_ctx`, `model`. These options may differ from the original ones in some way (e.g. bad values filtered out, strings converted to tokens, etc.).
477-
- `model`: The path to the model loaded with `-m`
478-
- `prompt`: The provided `prompt`
511+
- `model`: The model alias (for model path, please use `/props` endpoint)
512+
- `prompt`: The processed `prompt` (special tokens may be added)
479513
- `stop_type`: Indicating whether the completion has stopped. Possible values are:
480514
- `none`: Generating (not stopped)
481515
- `eos`: Stopped because it encountered the EOS token

0 commit comments

Comments
 (0)