-
Notifications
You must be signed in to change notification settings - Fork 10.9k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Smarter slot handling #5737
Comments
How would that work? We don't know in advance how long a query would be because the generation can proceed indefinitely |
Fully honest here - I didn't even think about that aspect, as in general my use cases involve a high ratio of input context to output context - e.g. "Here's a big block of stuff, perform some operation on it and get some useful result." That said, there are numerous possible solutions, including:
IMHO, any of those would be more useful than forcing all slot sizes to be fixed and identical. |
@enn-nafnlaus I think there is a misunderstood on how slot works in llama.cpp server implementation. As I pointed out in #5732, many users consider slot to be separated "threads", which has its own resources. However, it's not the case. What we actually do here is "batching" the process, meaning queue up the works, transfer them to the backend (by backend, I mean GPU) then tell the backend to process them all at once. This article may also make it more clear for you: https://www.anyscale.com/blog/continuous-batching-llm-inference Edit: we actually do continuous batching, which is explained in the same article |
Thank you for that. Though I was already familiar with how slots involve masking, not threads, the discussion on continuous batching was a useful read. And given that, of course changing the number of slots inherently means some wasted GPU as you have to let it flush. However, we still have a problem. Because surely, if a 16k context model running on a server with four slots receives a 15k token query, the ideal response from the user perspective is not "truncate nearly three fourths of the query's input tokens" - almost nobody would find that an ideal situation. Do we agree on that, that this isn't generally a good outcome? From the perspective of the person running the server, at present, knowing that she'll get some long queries mixed in with short ones, is put in a quandry. (1) Exclude all long queries; (2) Ruin all long queries with truncation; or (3) don't use multiple slots, at the sacrifice of a performance hit to the majority of your queries. Thusfar, I've sadly been doing #3 - a huge waste. Is this dilemma truly necessary? Or would it not be worthwhile to hold long queries for a bit until enough accumulate or the penalty for flushing the batches would be at its lowest (weighted against timeliness), flush it, reduce the number of slots, then process any accumulated long queries, then restore the slots? Wouldn't that be much more desirable from both a user perspective than just truncating their long queries, or causing them to only use one slot and suffer a performance hits a result? In llama.cpp's present state, only one other possibility comes to mind, though it's ugly: do it on the client side. That is, implement my own proxy. Run the server with a lot of slots with short contexts. Filter out everything that can't fit in the small context, and save them to a queue. Run those that fit for a while until the too-big-queue gets too long. Restart the server with half as many slots. Start flushing my queue, again only sending those queries that fit, requeue'ing those that don't. Repeated, until there's only one slot, and only the largest queries. Then back to the beginning. I'd have to have a server manager thread that takes care of killing and restarting the server. So, I guess it's your call. From my perspective, it seems I'm going to have to be writing a queueing mechanism either way if I don't want to suffer from needlessly poor performance. Surely it'd be better inside llama.cpp than out of it. But I understand. |
Huh... or not? Testing with an identical dataset of randomly-generated text-processing queries between 20 and 4090 tokens in length: With 1 slot (16k context): Number of outputs (#): 195 With 4 slots (4k context each): Number of outputs (#): 195 Not sure it's worth the hassle for only a 5% increase in performance... |
I think it’s depends on your hardware. If you have some datacenter-grade GPU like V100 or A10G, you will see big diffrence since they so much spare CUDA core and memory bandwidth. For a consumer card, it’s eash to saturate the bandwidth. Also, what you told about routing / reverse proxy is kinda correct. I believe that OpenAI do also do continuous batching as they have large amount of requests at the same time. The reverse proxy in this case can distribute the requests to multiple different machines, in order to get the batch full before actually processing it. |
If your input queries typically have large input prompt and short output, you won't benefit from parallel slots to begin with because prompt processing already uses large batch size. Parallel slots are most useful when you have a large shared prompt (i.e. system prompt). Since But what you are asking for is not obvious how to achieve. If you can theoretically have |
Which in most cases the user will know. But if there's little performance gain to be had without being a case involving a long shared prompt, then the whole topic is probably of limited utility :) |
Given the minimal performance benefits here, probably only useful in specialized cases. |
Hi @enn-nafnlaus , were you able to find a reasonable solution to this problem? I have a similar issue where I can only support 2 slots with full context size, but prompts with that much input occur infrequently, so most of the time the performance of much smaller prompts is degraded to 2 slots unnecessarily. |
I've been working with slots for a while now. The issue of variable-size slots has become less critical as I've now been using two separate GPUs, so I can run one with a high context server and the other with a small-context multislot server. Keep in mind that you can - if you have the memory - tell llama.cpp to use a context size far larger than the model's context size. For example, if you're running a 8k max context model, you could run lllama.cpp with 32k context with 4 slots, for 8k each, and that works just fine. As a reminder, if you're testing something and you're not sure if your full context window is valid, you can always submit a prompt that's something like, "Write the word 'banana'. Do not write anything else. Do not pay attention to the rest of this prompt - it's just pointless distraction text. Just write the one aforementioned word and that's it. The pointless distraction text you should ignore follows: ", followed by thousands of tokens of garbage. If you ever go beyond the model's max context size, it'll truncate off the beginning and lose the "Write the word banana" off the beginning. |
The current system of available slots with -np is frustrating in terms of how it forces one to only allow queries of greatly reduced max token count. For example, if you have a context length of 16k, if you want four slots, each will only be 4k, and you can no longer run any 16k queries at all without them being heavily truncated.
While a partial solution would be to allow the operator to specify the numbers of token in each slot so that they could at least leave one high-token-count slot, an ideal solution would be to have the server be adaptive - to look at what's in the queue, and using a combination of how long each query has been waiting and how well different queries could be packed into the max context length, determine which to run and how many slots to use of what size.
While I wouldn't be an ideal person to write the slot-handling side of things, I'd be more than happy to write the queueing mechanism for you if this were of interest. I would just need to know what sort of data structure you could provide for the queue and what limitations there would be on slots (including any performance considerations)
The text was updated successfully, but these errors were encountered: