Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[RFC]: Distribute LoRA adapters across deployment #12174

Open
1 task done
joerunde opened this issue Jan 17, 2025 · 15 comments
Open
1 task done

[RFC]: Distribute LoRA adapters across deployment #12174

joerunde opened this issue Jan 17, 2025 · 15 comments
Labels

Comments

@joerunde
Copy link
Collaborator

Motivation.

Production LoRA serving

This RFC lays out the current limitations in online LoRA serving, potential solutions, and a proposal for implementation.

Context

What we would like to offer SaaS products is a way to serve a single, multi-replica deployment of an LLM, where multiple tenants can each load or unload their own LoRA adapters for that LLM as needed without requiring downtime or redeployment.

However, the only "non-development" way to serve LoRA adapters for online inference with vLLM today is to tell vLLM about them ahead of time with the --lora-modules CLI argument. This presents a problem for products that want to adopt vLLM for multi-tenant LoRA serving, as the only way to load a new adapter is to redeploy the entire service.

There is a "development mode" method to dynamically load LoRA adapters: Setting VLLM_ALLOW_RUNTIME_LORA_UPDATING=True will enable the /v1/load_lora_adapter and /v1/unload_lora_adapter endpoints, which can be used to load or unload new LoRA adapters at runtime. However this is currently inappropriate for production use, because it neither:

  • Ensures the adapter is loaded across all replicas of the deployment
  • Guarantees that the adapter will be available on a new replica, or after a replica restart

Image

Solving both of these problems is necessary to offer multi-tenant LoRA serving in production settings.

The rest of this RFC makes the same assumptions as the /v1/load_lora_adapter endpoint: i.e that the LoRA adapters in question are either:

  1. To be downloaded from HF Hub, or
  2. Available on disk to the vLLM process

The problem described here is tracking the metadata of which adapters should be loaded at any point in time across a deployment. Storing and loading the adapter artifacts themselves is yet another problem- other updates can be made to vLLM to address that such as:

  • Accepting generic URLs in /v1/load_lora_adapter payloads
  • Accepting a tar archive upload in /v1/load/lora_adapter, etc.

Proposed Change.

General Solution Ideas

Option 1: Handle externally with smart routing

Image

One option is to ignore the problem entirely at the vLLM level, and have an external routing component ensure that requests are only routed to replicas which have the adapter loaded. For example, kserve/modelmesh-serving provides a general purpose solution to this problem.

It would be possible to implement the internal APIs required for modelmesh in vLLM so that kserve could handle loading and routing for LoRA adapters without any extra state management in vLLM. There are probably some other third-party components that could be used in the same way, or we could write our own routing component.

Pros:

  • No extra state management required in vLLM
  • Third party model management systems already offer compliance-ready solutions, handling issues like data security and backup and disaster recovery
  • Addressing this at the routing layer can allow us to manage which adapters and how many adapters are loaded per replica. For large numbers of adapters, this could become required in order to avoid cache thrashing issues in each replica.

Cons:

  • Doesn't offer a vLLM-native solution to the problem
  • Increases deployment complexity
  • Introduces deployment dependency on a third party component
  • Would collide with other routing strategies like
    • session aware routing
    • prefill/decode disaggregation

Option 2: Use external state management to track adapters

Image

Another option is to have vLLM use an external data store like etcd directly to track loaded adapters. This would be a lighter-weight option than relying on a third party model management solution, but would still introduce extra deployment dependencies and overhead.

Pros:

  • Distributed data stores like etcd are well-understood and production-tested
  • Backup and restore operations are relatively easy for service operators
  • Logic can be wrapped in atomic transactions to ensure consistency across a deployment
  • "Watch" apis can be used to push updates immediately to all replicas in a deployment
  • No routing changes needed, wouldn't collide with any other routing work

Cons:

  • Requires writing state management logic into vLLM
  • We have to maintain a database client, and any schema changes would need to be carefully considered
  • Introduces extra deployment dependency
  • Increases deployment complexity

Option 3: Use simple disk-based storage to track loaded adapters

Image

Often, replicas of a deployment will mount a shared filesystem to access common artifacts. For example, in kubernetes deployments an S3 bucket can be mounted as network-attached storage with a persistent volume claim using S3FS. This shared filesystem space can be used to write simple files that track metadata for the adapters to be loaded for a given deployment.

Pros:

  • Simplest option, no additional code or deployment dependencies
  • Works anywhere you can mount a filesystem
  • Easy to implement and test locally

Cons:

  • Simple disk storage leaves encryption, backup and restore as exercises for service operators
  • Requires file write permissions, which may be a security risk
  • NAS systems are generally non-atomic: concurrent writes may appear to succeed but the last one will win
  • Consistency can be an issue depending on the filesystem used, writes may not be visible by other replicas for some time

Proposal

These options aren't necessarily mutually exclusive, so we propose implementing Option 3 as a short term solution.

The simplest implementation would be to store the payloads from /v1/load_lora_adapter as json files in a configurable directory. The name of the file should be the adapter name, to easily identify if an adapter is loaded without reading file contents. At load time, this file should be written after the adapter successfully loads. When the /v1/models endpoint is called, these files should be used to determine the full set of available adapters so that responses from all replicas are consistent (within the constraints of the underlying filesystem used).

The entire implementation can be contained within the API layer of vLLM (vllm.entrypoints.openai.*) and no changes would be required to the lora caching mechanism in the engine.

Assumptions:

  • The consuming application is responsible for providing access control and guarding against misuse, i.e. not allowing one user to register 10000 adapters at once
    • We can handle some basic security checks at load time like denying path traversal
  • We won't be providing per-adapter authorization

Open questions:

  1. Where should the adapter files be stored?

    One option is the existing configuration directory, e.g. "${VLLM_CONFIG_ROOT}/adapters". This seems appropriate for storing metadata files about loaded adapters, but may be inappropriate for later expansion of caching actual adapter artifacts, if we end up going that route later. We could introduce a new VLLM_LORA_ADAPTER_CACHE environment variable for clarity about where this data is stored.

  2. How should we handle deleting adapters - i.e. /v1/unload_lora_adapter?

    Deleting the metadata files seems appropriate, but propagating the deletion across replicas seems tricky. Filesystem-watch APIs are both OS and filesytem dependent and have limitations on some network backed storage. (e.g. you can't use inotify with S3FS). We could check file existence on every inference API call for each adapter, but that would add overhead to the critical path. It may be sufficient to not attempt to unload an adapter from all replicas, instead allowing them to either eventually be unloaded when:

    • They are evicted from the LRU cache
    • The /models endpoint is accessed and we check all the loaded adapters
    • The process ends

    This would mean an adapter could remain available for inference on some replicas after unload, however since we assume that the consuming application is providing access control, this may not be a problem.

  3. Should adapters be loaded at boot time?

    Currently we validate all lora adapters given statically by the --lora-modules CLI arg by loading them at boot time. With dynamically loaded adapters, there may be an unbounded number of adapters to potentially load. The LRU caching mechanism in the engine ensures only the ones in use stay loaded, but there could be far more adapters "loaded" than can fit in the cache. We can assume that if adapter metadata is in the cache, then it has been successfully loaded before, so we don't need to re-load it to verify at boot time, and we should lazily load at inference time instead. But, if the number of adapters is low (less than cache size), we might still want to eagerly load at boot to trade a slightly longer boot time for lower latency on first inference.

Feedback Period.

Through 2/1/25

CC List.

@njhill @wangchen615 @tjohnson31415 @maxdebayser

Any Other Things.

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@joerunde joerunde added the RFC label Jan 17, 2025
@Akshat-Tripathi
Copy link
Contributor

I like the first option, I'm not sure if it makes sense to have multiple instances of vLLM know about each other.

The routing service could periodically poll the replicas at /v1/models and forward incoming requests accordingly.

I think the solution to this would be helpful in other scenarios too, such as prefix aware routing, where we could route requests to vLLM instances that already have their prefixes cached.

@joerunde
Copy link
Collaborator Author

I like the first option, I'm not sure if it makes sense to have multiple instances of vLLM know about each other.

I agree! Though also having been into the internals of kserve/model-mesh I know that there's a fair bit of complex logic and state to handle when it comes to balancing models across a deployment and routing appropriately. It's a problem I want to tackle, but not all on my own 😅

I think the solution to this would be helpful in other scenarios too, such as prefix aware routing, where we could route requests to vLLM instances that already have their prefixes cached.

Exactly, though as pointed out in the RFC I think all of these routing optimizations would need to be handled together- you can't have multiple routers disagree about where to place a request.

@jeejeelee
Copy link
Collaborator

cc @Jeffwan

@Jeffwan
Copy link
Contributor

Jeffwan commented Jan 20, 2025

Hi @joerunde @Akshat-Tripathi Great proposal. Did you get a chance to take a look at my earlier proposal here? I share with kubernetes community forks earlier. https://docs.google.com/document/d/125RfImuvCds2UWrWj4mj7mHWpQOYtaR0nkXxWWZaxDs/edit?tab=t.0#heading=h.fxzzb3b7imjf

We already build one running internally and if there's common needs, I can publish our solution. Do you want to build into kserve or you just want to use this capability?

/cc @jeejeelee

@maxdebayser
Copy link
Contributor

I think option 2 solves all these problems in the cleanest way. Assuming that the instances poll etcd on boot and periodically after that, any instance that gets /v1/load_lora_adapter or /v1/unload_lora_adapter can update etcd and the other instances will be eventually consistent.

@joerunde
Copy link
Collaborator Author

@Jeffwan Thanks for the heads up! I haven't seen the proposal, I just sent you an access request

We already build one running internally and if there's common needs, I can publish our solution. Do you want to build into kserve or you just want to use this capability?

I only put kserve as an example, I don't think we necessarily have to land anything there. If we want to work together on it, putting it in the vllm org might be the way to go

@joerunde
Copy link
Collaborator Author

I think option 2 solves all these problems in the cleanest way. Assuming that the instances poll etcd on boot and periodically after that, any instance that gets /v1/load_lora_adapter or /v1/unload_lora_adapter can update etcd and the other instances will be eventually consistent.

@maxdebayser yeah I think it's a nice middle ground, and maybe the watch APIs in etcd can provide more immediate consistency across the cluster as well. What do you think about implementing that, even if an open source routing solution is available? I think I'd prefer to focus on just one solution to reduce complexity, but I could also see the value in having multiple options, say if a user also built their own router and doesn't want to use the open source one.

@kfswain
Copy link

kfswain commented Jan 22, 2025

Hey all! I'm from: https://github.com/kubernetes-sigs/gateway-api-inference-extension we try to tackle multi-LoRA serving as well, we use vLLM as our current de facto model server, and would love to work with yall on this!

I know that there's a fair bit of complex logic and state to handle when it comes to balancing models across a deployment and routing appropriately. It's a problem I want to tackle, but not all on my own 😅

@joerunde Agreed! We would love to collaborate and work with ya, our meetings are every Th 10AM PDT if you have time to join us!

I'm not sure if it makes sense to have multiple instances of vLLM know about each other.

Yeah, agreed, our thinking was that separation of concerns dictates that model servers should be focused on serving, and let a Load balancer worry about global state.

I think the solution to this would be helpful in other scenarios too, such as prefix aware routing, where we could route requests to vLLM instances that already have their prefixes cached.

Def agreed that prefix-aware routing is powerful, something we are looking at also.

@kfswain
Copy link

kfswain commented Jan 22, 2025

Deleting the metadata files seems appropriate, but propagating the deletion across replicas seems tricky.

Are we thinking this is something vLLM would manage?

I liked @Jeffwan 's paper on the topic, managing replicas seems well within K8s skillset.

@smarterclayton
Copy link

smarterclayton commented Jan 22, 2025

Operationally, it might be simpler to go one step further and enable true dynamic loading on demand, with a GC mechanism to purge least recently used / unused adapters. I.e. static definition of a catalog of an adapters source (and a loading mechanism), and having new requests queued until the adapter is loaded. GC would be balanced against kv cache usage, subject to operational thresholds ("don't load more than N% of kvcache with adapters").

That "option 4" would leave the existing load/unload available for more sophisticated behavior / preloading, but let a broader range of environments use dynamic lora without having to build a controller to manage LoRA load state. Perhaps this has already been considered elsewhere (I looked, but couldn't find a concrete discussion)?

@maxdebayser
Copy link
Contributor

@smarterclayton , in option 4, do you mean loading any lora adapter that is specified in the completion request if it's not already loaded? If my understanding is correct, how would we restrict the permissible set of LoRA adapters? In addition, the router would also need to keep track of which pods have handled requests for adapter X recently to take advantage of pre-loaded adapters.

@Jeffwan , @joerunde , so Jeff's proposal would essentially be Option 2, where the external state is tracked by the CR controller, but with an inverted control flow because the controller would push changes to the pods instead of the pods polling external state, right?

@Jeffwan
Copy link
Contributor

Jeffwan commented Jan 23, 2025

@maxdebayser Yeah, the design principle I feel here is not to introduce traffic flow from pod (vLLM) to control plane. It's good to send request to or pull status from engine.

I think this issue talk about few things together. routing, lora orchestration, artifact management. These needs to be decoupled first. (To me, option 1 talks about some routing or service discovery issue, option 3 focus more on artifact mangement)

the Lora source of truth information is stored and can be exposed from the vLLM engine. Any external state needs to be synced with engine anyway. For example, engine crash, preemption etc. Any solution can achieve such mechanism would be solid solutions.

  1. service discovery problem. I think https://github.com/kubernetes-sigs/gateway-api-inference-extension expose some apis, we internally have another k8s service native solution leveraging fine grain endpoint slice control.

  2. orchestration. kubernetes native design prefers controller, it's easy to track the lora associated pod, any pod event like crash etc can trigger the reconcile. Rely on other metadata service like etcd or redis I think works as well.

  3. artifact management: share file system definitely works. At this moment, most lora (low ranks) are not that large, local disk/cloud disk should be large enough to manage them separately as well. We do have a cold start manager to give artifact hint but this is mainly for base model but not for lora yet.

@joerunde
Copy link
Collaborator Author

@joerunde Agreed! We would love to collaborate and work with ya, our meetings are every Th 10AM PDT if you have time to join us!

@kfswain Thanks! I missed today's but I can join next week!

@smarterclayton I think that most of "Option 4" is implemented in the vllm engine already- it will load LoRA adapters on demand, using a fixed sized LRU cache that the vRAM has already been budgeted for when sizing the KV cache. The only remaining problem is that the vllm frontend won't know what to do when it receives a request where the model name is a lora adapter.

@Jeffwan I love the proposal, and agree with all the points about how vLLM should not be handling cluster resources, nor should it be making requests to the control plane.

I think this issue talk about few things together. routing, lora orchestration, artifact management. These needs to be decoupled first. (To me, option 1 talks about some routing or service discovery issue, option 3 focus more on artifact mangement)

To be clear though I was intending option 3 to be about discovery, i.e. using the filesystem to track which lora adapters have been added and should be available for any pod to load on request. (Though yes, the fs would probably also hold the lora adapters themselves as well). I think option 3 is objectively worse than what your paper proposes, since it has no way to balance loaded adapters across pods or reconcile state when it changes

@smarterclayton
Copy link

The only remaining problem is that the vllm frontend won't know what to do when it receives a request where the model name is a lora adapter.

Yes, that's in my mind a gap that can and should be resolved in vLLM. Model artifacts are overwhelmingly stored in object storage / shared file systems with write-once semantics and unique naming - finding patterns that avoid the need for data plane components to coordinate with the control plane on restart or during long outages is a best practice that we should encourage. Since model servers will have the highest replication factor of all components in the system and the majority of model consumers will have a proxy infrastructure in front of the model servers, removing the need for vllm to manage a model name -> adapter name map and all of the operational complexity Jiaxin mentioned (a separate etcd, redis, a controller querying kubernetes).

@joerunde
Copy link
Collaborator Author

joerunde commented Feb 5, 2025

finding patterns that avoid the need for data plane components to coordinate with the control plane on restart or during long outages is a best practice that we should encourage.

💯
I think that the 80% use case is covered by a simple solution that doesn't involve the control plane or any routing logic (ie options 3/4). The 20% that I think isn't covered are use cases where the number of adapters pushes the limit of the filesystem, and where security requirements break the assumption that the file system is shared.

A couple concrete examples:

  1. Back when prompt tuning was the rage, we had a product that had 20k+ prompt tune adapters for one base model in an object storage bucket. The pods mounted this using S3FS and the results from Path.exists(/path/to/adapter) were flaky- it would sometimes report that an adapter did not exist when it did in fact exist in the bucket.
  2. For some products, we have requirements where the customer's data and artifacts need to remain on their own storage. In this case when the artifacts need to be accessed by multi-tenant systems a short-lived token can be generated to download them from the customer's storage on demand, but the data cannot be copied into shared buckets. We've generally had to work around this problem by requiring that the model servers be single-tenanted to support this use case, but this could be solved by some control plane integration that can handle authn/z and pass vLLM the proper credentials to download a customer's adapter.

It seems to me that supporting those use cases requires integration outside of vLLM, but to be fair these only really matter for enterprise products. I'm still just debating whether we should support the 80% use case with a simple solution with these known limitations. Maybe you're right and I'm just overthinking it 🤷

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants