-
-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
[RFC]: Distribute LoRA adapters across deployment #12174
Comments
I like the first option, I'm not sure if it makes sense to have multiple instances of vLLM know about each other. The routing service could periodically poll the replicas at I think the solution to this would be helpful in other scenarios too, such as prefix aware routing, where we could route requests to vLLM instances that already have their prefixes cached. |
I agree! Though also having been into the internals of
Exactly, though as pointed out in the RFC I think all of these routing optimizations would need to be handled together- you can't have multiple routers disagree about where to place a request. |
cc @Jeffwan |
Hi @joerunde @Akshat-Tripathi Great proposal. Did you get a chance to take a look at my earlier proposal here? I share with kubernetes community forks earlier. https://docs.google.com/document/d/125RfImuvCds2UWrWj4mj7mHWpQOYtaR0nkXxWWZaxDs/edit?tab=t.0#heading=h.fxzzb3b7imjf We already build one running internally and if there's common needs, I can publish our solution. Do you want to build into kserve or you just want to use this capability? /cc @jeejeelee |
I think option 2 solves all these problems in the cleanest way. Assuming that the instances poll |
@Jeffwan Thanks for the heads up! I haven't seen the proposal, I just sent you an access request
I only put kserve as an example, I don't think we necessarily have to land anything there. If we want to work together on it, putting it in the vllm org might be the way to go |
@maxdebayser yeah I think it's a nice middle ground, and maybe the watch APIs in etcd can provide more immediate consistency across the cluster as well. What do you think about implementing that, even if an open source routing solution is available? I think I'd prefer to focus on just one solution to reduce complexity, but I could also see the value in having multiple options, say if a user also built their own router and doesn't want to use the open source one. |
Hey all! I'm from: https://github.com/kubernetes-sigs/gateway-api-inference-extension we try to tackle multi-LoRA serving as well, we use vLLM as our current de facto model server, and would love to work with yall on this!
@joerunde Agreed! We would love to collaborate and work with ya, our meetings are every Th 10AM PDT if you have time to join us!
Yeah, agreed, our thinking was that separation of concerns dictates that model servers should be focused on serving, and let a Load balancer worry about global state.
Def agreed that prefix-aware routing is powerful, something we are looking at also. |
Are we thinking this is something vLLM would manage? I liked @Jeffwan 's paper on the topic, managing replicas seems well within K8s skillset. |
Operationally, it might be simpler to go one step further and enable true dynamic loading on demand, with a GC mechanism to purge least recently used / unused adapters. I.e. static definition of a catalog of an adapters source (and a loading mechanism), and having new requests queued until the adapter is loaded. GC would be balanced against kv cache usage, subject to operational thresholds ("don't load more than N% of kvcache with adapters"). That "option 4" would leave the existing load/unload available for more sophisticated behavior / preloading, but let a broader range of environments use dynamic lora without having to build a controller to manage LoRA load state. Perhaps this has already been considered elsewhere (I looked, but couldn't find a concrete discussion)? |
@smarterclayton , in option 4, do you mean loading any lora adapter that is specified in the completion request if it's not already loaded? If my understanding is correct, how would we restrict the permissible set of LoRA adapters? In addition, the router would also need to keep track of which pods have handled requests for adapter X recently to take advantage of pre-loaded adapters. @Jeffwan , @joerunde , so Jeff's proposal would essentially be Option 2, where the external state is tracked by the CR controller, but with an inverted control flow because the controller would push changes to the pods instead of the pods polling external state, right? |
@maxdebayser Yeah, the design principle I feel here is not to introduce traffic flow from pod (vLLM) to control plane. It's good to send request to or pull status from engine. I think this issue talk about few things together. routing, lora orchestration, artifact management. These needs to be decoupled first. (To me, option 1 talks about some routing or service discovery issue, option 3 focus more on artifact mangement) the Lora source of truth information is stored and can be exposed from the vLLM engine. Any external state needs to be synced with engine anyway. For example, engine crash, preemption etc. Any solution can achieve such mechanism would be solid solutions.
|
@kfswain Thanks! I missed today's but I can join next week! @smarterclayton I think that most of "Option 4" is implemented in the vllm engine already- it will load LoRA adapters on demand, using a fixed sized LRU cache that the vRAM has already been budgeted for when sizing the KV cache. The only remaining problem is that the vllm frontend won't know what to do when it receives a request where the model name is a lora adapter. @Jeffwan I love the proposal, and agree with all the points about how vLLM should not be handling cluster resources, nor should it be making requests to the control plane.
To be clear though I was intending option 3 to be about discovery, i.e. using the filesystem to track which lora adapters have been added and should be available for any pod to load on request. (Though yes, the fs would probably also hold the lora adapters themselves as well). I think option 3 is objectively worse than what your paper proposes, since it has no way to balance loaded adapters across pods or reconcile state when it changes |
Yes, that's in my mind a gap that can and should be resolved in vLLM. Model artifacts are overwhelmingly stored in object storage / shared file systems with write-once semantics and unique naming - finding patterns that avoid the need for data plane components to coordinate with the control plane on restart or during long outages is a best practice that we should encourage. Since model servers will have the highest replication factor of all components in the system and the majority of model consumers will have a proxy infrastructure in front of the model servers, removing the need for vllm to manage a model name -> adapter name map and all of the operational complexity Jiaxin mentioned (a separate etcd, redis, a controller querying kubernetes). |
💯 A couple concrete examples:
It seems to me that supporting those use cases requires integration outside of vLLM, but to be fair these only really matter for enterprise products. I'm still just debating whether we should support the 80% use case with a simple solution with these known limitations. Maybe you're right and I'm just overthinking it 🤷 |
Motivation.
Production LoRA serving
This RFC lays out the current limitations in online LoRA serving, potential solutions, and a proposal for implementation.
Context
What we would like to offer SaaS products is a way to serve a single, multi-replica deployment of an LLM, where multiple tenants can each load or unload their own LoRA adapters for that LLM as needed without requiring downtime or redeployment.
However, the only "non-development" way to serve LoRA adapters for online inference with vLLM today is to tell vLLM about them ahead of time with the
--lora-modules
CLI argument. This presents a problem for products that want to adopt vLLM for multi-tenant LoRA serving, as the only way to load a new adapter is to redeploy the entire service.There is a "development mode" method to dynamically load LoRA adapters: Setting
VLLM_ALLOW_RUNTIME_LORA_UPDATING=True
will enable the/v1/load_lora_adapter
and/v1/unload_lora_adapter
endpoints, which can be used to load or unload new LoRA adapters at runtime. However this is currently inappropriate for production use, because it neither:Solving both of these problems is necessary to offer multi-tenant LoRA serving in production settings.
The rest of this RFC makes the same assumptions as the
/v1/load_lora_adapter
endpoint: i.e that the LoRA adapters in question are either:The problem described here is tracking the metadata of which adapters should be loaded at any point in time across a deployment. Storing and loading the adapter artifacts themselves is yet another problem- other updates can be made to vLLM to address that such as:
/v1/load_lora_adapter
payloads/v1/load/lora_adapter
, etc.Proposed Change.
General Solution Ideas
Option 1: Handle externally with smart routing
One option is to ignore the problem entirely at the vLLM level, and have an external routing component ensure that requests are only routed to replicas which have the adapter loaded. For example, kserve/modelmesh-serving provides a general purpose solution to this problem.
It would be possible to implement the internal APIs required for
modelmesh
in vLLM so that kserve could handle loading and routing for LoRA adapters without any extra state management in vLLM. There are probably some other third-party components that could be used in the same way, or we could write our own routing component.Pros:
Cons:
Option 2: Use external state management to track adapters
Another option is to have vLLM use an external data store like etcd directly to track loaded adapters. This would be a lighter-weight option than relying on a third party model management solution, but would still introduce extra deployment dependencies and overhead.
Pros:
Cons:
Option 3: Use simple disk-based storage to track loaded adapters
Often, replicas of a deployment will mount a shared filesystem to access common artifacts. For example, in kubernetes deployments an S3 bucket can be mounted as network-attached storage with a persistent volume claim using S3FS. This shared filesystem space can be used to write simple files that track metadata for the adapters to be loaded for a given deployment.
Pros:
Cons:
Proposal
These options aren't necessarily mutually exclusive, so we propose implementing Option 3 as a short term solution.
The simplest implementation would be to store the payloads from
/v1/load_lora_adapter
as json files in a configurable directory. The name of the file should be the adapter name, to easily identify if an adapter is loaded without reading file contents. At load time, this file should be written after the adapter successfully loads. When the/v1/models
endpoint is called, these files should be used to determine the full set of available adapters so that responses from all replicas are consistent (within the constraints of the underlying filesystem used).The entire implementation can be contained within the API layer of vLLM (vllm.entrypoints.openai.*) and no changes would be required to the lora caching mechanism in the engine.
Assumptions:
Open questions:
Where should the adapter files be stored?
One option is the existing configuration directory, e.g. "${VLLM_CONFIG_ROOT}/adapters". This seems appropriate for storing metadata files about loaded adapters, but may be inappropriate for later expansion of caching actual adapter artifacts, if we end up going that route later. We could introduce a new
VLLM_LORA_ADAPTER_CACHE
environment variable for clarity about where this data is stored.How should we handle deleting adapters - i.e.
/v1/unload_lora_adapter
?Deleting the metadata files seems appropriate, but propagating the deletion across replicas seems tricky. Filesystem-watch APIs are both OS and filesytem dependent and have limitations on some network backed storage. (e.g. you can't use inotify with S3FS). We could check file existence on every inference API call for each adapter, but that would add overhead to the critical path. It may be sufficient to not attempt to unload an adapter from all replicas, instead allowing them to either eventually be unloaded when:
This would mean an adapter could remain available for inference on some replicas after unload, however since we assume that the consuming application is providing access control, this may not be a problem.
Should adapters be loaded at boot time?
Currently we validate all lora adapters given statically by the
--lora-modules
CLI arg by loading them at boot time. With dynamically loaded adapters, there may be an unbounded number of adapters to potentially load. The LRU caching mechanism in the engine ensures only the ones in use stay loaded, but there could be far more adapters "loaded" than can fit in the cache. We can assume that if adapter metadata is in the cache, then it has been successfully loaded before, so we don't need to re-load it to verify at boot time, and we should lazily load at inference time instead. But, if the number of adapters is low (less than cache size), we might still want to eagerly load at boot to trade a slightly longer boot time for lower latency on first inference.Feedback Period.
Through 2/1/25
CC List.
@njhill @wangchen615 @tjohnson31415 @maxdebayser
Any Other Things.
No response
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: