Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Smarter slot handling #5737

Closed
enn-nafnlaus opened this issue Feb 26, 2024 · 11 comments
Closed

Smarter slot handling #5737

enn-nafnlaus opened this issue Feb 26, 2024 · 11 comments

Comments

@enn-nafnlaus
Copy link

enn-nafnlaus commented Feb 26, 2024

The current system of available slots with -np is frustrating in terms of how it forces one to only allow queries of greatly reduced max token count. For example, if you have a context length of 16k, if you want four slots, each will only be 4k, and you can no longer run any 16k queries at all without them being heavily truncated.

While a partial solution would be to allow the operator to specify the numbers of token in each slot so that they could at least leave one high-token-count slot, an ideal solution would be to have the server be adaptive - to look at what's in the queue, and using a combination of how long each query has been waiting and how well different queries could be packed into the max context length, determine which to run and how many slots to use of what size.

While I wouldn't be an ideal person to write the slot-handling side of things, I'd be more than happy to write the queueing mechanism for you if this were of interest. I would just need to know what sort of data structure you could provide for the queue and what limitations there would be on slots (including any performance considerations)

@ggerganov
Copy link
Member

to look at what's in the queue, and using a combination of how long each query has been waiting and how well different queries could be packed into the max context length, determine which to run and how many slots to use of what size.

How would that work? We don't know in advance how long a query would be because the generation can proceed indefinitely

@enn-nafnlaus
Copy link
Author

enn-nafnlaus commented Feb 26, 2024

to look at what's in the queue, and using a combination of how long each query has been waiting and how well different queries could be packed into the max context length, determine which to run and how many slots to use of what size.

How would that work? We don't know in advance how long a query would be because the generation can proceed indefinitely

Fully honest here - I didn't even think about that aspect, as in general my use cases involve a high ratio of input context to output context - e.g. "Here's a big block of stuff, perform some operation on it and get some useful result."

That said, there are numerous possible solutions, including:

  • Allowing the server administrator to specify minimum sequence lengths on slots (via server commandline parameters)
  • Allowing the server administrator to specify minimum additional sequence lengths on slots above and beyond the length of the query (likewise)
  • Allowing the server administrator to specify minimum ratios of query length to slot size (likewise)
  • Allowing an additional parameter in API calls for the user to specify the minimum sequence length (or the minimum additional sequence length, or the minimum ratio of query length to slot size), for any given query.

IMHO, any of those would be more useful than forcing all slot sizes to be fixed and identical.

@ngxson
Copy link
Collaborator

ngxson commented Feb 26, 2024

@enn-nafnlaus I think there is a misunderstood on how slot works in llama.cpp server implementation.

As I pointed out in #5732, many users consider slot to be separated "threads", which has its own resources. However, it's not the case. What we actually do here is "batching" the process, meaning queue up the works, transfer them to the backend (by backend, I mean GPU) then tell the backend to process them all at once.

This article may also make it more clear for you: https://www.anyscale.com/blog/continuous-batching-llm-inference

image

Edit: we actually do continuous batching, which is explained in the same article

@enn-nafnlaus
Copy link
Author

enn-nafnlaus commented Feb 26, 2024

Thank you for that. Though I was already familiar with how slots involve masking, not threads, the discussion on continuous batching was a useful read. And given that, of course changing the number of slots inherently means some wasted GPU as you have to let it flush.

However, we still have a problem. Because surely, if a 16k context model running on a server with four slots receives a 15k token query, the ideal response from the user perspective is not "truncate nearly three fourths of the query's input tokens" - almost nobody would find that an ideal situation. Do we agree on that, that this isn't generally a good outcome?

From the perspective of the person running the server, at present, knowing that she'll get some long queries mixed in with short ones, is put in a quandry. (1) Exclude all long queries; (2) Ruin all long queries with truncation; or (3) don't use multiple slots, at the sacrifice of a performance hit to the majority of your queries.

Thusfar, I've sadly been doing #3 - a huge waste.

Is this dilemma truly necessary? Or would it not be worthwhile to hold long queries for a bit until enough accumulate or the penalty for flushing the batches would be at its lowest (weighted against timeliness), flush it, reduce the number of slots, then process any accumulated long queries, then restore the slots? Wouldn't that be much more desirable from both a user perspective than just truncating their long queries, or causing them to only use one slot and suffer a performance hits a result?

In llama.cpp's present state, only one other possibility comes to mind, though it's ugly: do it on the client side. That is, implement my own proxy. Run the server with a lot of slots with short contexts. Filter out everything that can't fit in the small context, and save them to a queue. Run those that fit for a while until the too-big-queue gets too long. Restart the server with half as many slots. Start flushing my queue, again only sending those queries that fit, requeue'ing those that don't. Repeated, until there's only one slot, and only the largest queries. Then back to the beginning. I'd have to have a server manager thread that takes care of killing and restarting the server.

So, I guess it's your call. From my perspective, it seems I'm going to have to be writing a queueing mechanism either way if I don't want to suffer from needlessly poor performance. Surely it'd be better inside llama.cpp than out of it. But I understand.

@enn-nafnlaus
Copy link
Author

enn-nafnlaus commented Feb 27, 2024

Huh... or not? Testing with an identical dataset of randomly-generated text-processing queries between 20 and 4090 tokens in length:

With 1 slot (16k context):

Number of outputs (#): 195
Mean length of query (tokens): 1148.6923076923076
Mean length of reply (tokens): 193.17948717948718
Mean total length (tokens): 1341.871794871795
Time to completion (seconds): 1721.7471134662628
Mean time (seconds): 8.829472376750065

With 4 slots (4k context each):

Number of outputs (#): 195
Mean length of query (tokens): 1148.6923076923076
Mean length of reply (tokens): 188.12820512820514
Mean total length (tokens): 1336.8205128205127
Time to completion (seconds): 1638.619353055954
Mean time (seconds): 8.403176169517712

Not sure it's worth the hassle for only a 5% increase in performance...

@ngxson
Copy link
Collaborator

ngxson commented Feb 27, 2024

I think it’s depends on your hardware. If you have some datacenter-grade GPU like V100 or A10G, you will see big diffrence since they so much spare CUDA core and memory bandwidth.

For a consumer card, it’s eash to saturate the bandwidth.

Also, what you told about routing / reverse proxy is kinda correct. I believe that OpenAI do also do continuous batching as they have large amount of requests at the same time. The reverse proxy in this case can distribute the requests to multiple different machines, in order to get the batch full before actually processing it.

@ggerganov
Copy link
Member

as in general my use cases involve a high ratio of input context to output context - e.g. "Here's a big block of stuff, perform some operation on it and get some useful result."

If your input queries typically have large input prompt and short output, you won't benefit from parallel slots to begin with because prompt processing already uses large batch size.

Parallel slots are most useful when you have a large shared prompt (i.e. system prompt). Since llama.cpp uses unified KV cache, it will be present only once in memory and shared for all requests.

But what you are asking for is not obvious how to achieve. If you can theoretically have N users in parallel with sequence length S you need to have KV cache of size N*S to handle the worst case scenario. I don't think we can do anything on llama.cpp side to mitigate this. The user has to know their worst case scenario and use the corresponding parameter to guarantee that llama.cpp can handle it

@enn-nafnlaus
Copy link
Author

as in general my use cases involve a high ratio of input context to output context - e.g. "Here's a big block of stuff, perform some operation on it and get some useful result."

If your input queries typically have large input prompt and short output, you won't benefit from parallel slots to begin with because prompt processing already uses large batch size.

Parallel slots are most useful when you have a large shared prompt (i.e. system prompt). Since llama.cpp uses unified KV cache, it will be present only once in memory and shared for all requests.

But what you are asking for is not obvious how to achieve. If you can theoretically have N users in parallel with sequence length S you need to have KV cache of size N*S to handle the worst case scenario. I don't think we can do anything on llama.cpp side to mitigate this. The user has to know their worst case scenario and use the corresponding parameter to guarantee that llama.cpp can handle it

Which in most cases the user will know. But if there's little performance gain to be had without being a case involving a long shared prompt, then the whole topic is probably of limited utility :)

@enn-nafnlaus
Copy link
Author

Given the minimal performance benefits here, probably only useful in specialized cases.

@enn-nafnlaus enn-nafnlaus closed this as not planned Won't fix, can't repro, duplicate, stale Feb 27, 2024
@PyroGenesis
Copy link

Hi @enn-nafnlaus , were you able to find a reasonable solution to this problem?

I have a similar issue where I can only support 2 slots with full context size, but prompts with that much input occur infrequently, so most of the time the performance of much smaller prompts is degraded to 2 slots unnecessarily.

@enn-nafnlaus
Copy link
Author

enn-nafnlaus commented Oct 25, 2024

Hi @enn-nafnlaus , were you able to find a reasonable solution to this problem?

I have a similar issue where I can only support 2 slots with full context size, but prompts with that much input occur infrequently, so most of the time the performance of much smaller prompts is degraded to 2 slots unnecessarily.

I've been working with slots for a while now. The issue of variable-size slots has become less critical as I've now been using two separate GPUs, so I can run one with a high context server and the other with a small-context multislot server.

Keep in mind that you can - if you have the memory - tell llama.cpp to use a context size far larger than the model's context size. For example, if you're running a 8k max context model, you could run lllama.cpp with 32k context with 4 slots, for 8k each, and that works just fine.

As a reminder, if you're testing something and you're not sure if your full context window is valid, you can always submit a prompt that's something like, "Write the word 'banana'. Do not write anything else. Do not pay attention to the rest of this prompt - it's just pointless distraction text. Just write the one aforementioned word and that's it. The pointless distraction text you should ignore follows: ", followed by thousands of tokens of garbage. If you ever go beyond the model's max context size, it'll truncate off the beginning and lose the "Write the word banana" off the beginning.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants