-
Notifications
You must be signed in to change notification settings - Fork 524
Run a custom model with Petals
Alexander Borzunov edited this page Aug 6, 2023
·
18 revisions
Starting with Petals 1.2.0, you don't have to convert a model to a special Petals-compatible format and can serve it directly from a Hugging Face hub repository (e.g., you can host smaller versions of BLOOM and LLaMA off-the-self).
Still, Petals supports only a predefined set of model architectures defined in the petals.models package. If you'd like to support a new architecture, you need to copy the src/petals/models/bloom
or src/petals/models/llama
directory and update all files to work with your new model.
We recommend to do that in the following order:
-
Prerequisites:
- Ensure that model weights are available on Hugging Face Hub (if necessary, you can use a private repo and the
use_auth_token
argument in both Petals client and server). - Ensure that you have a small version of the model, so you can compare the Petals outputs to the outputs of a model running locally on your GPU.
- If you're stuck, don't hesitate to reach us out on Discord!
- Ensure that model weights are available on Hugging Face Hub (if necessary, you can use a private repo and the
-
Edit
config.py
and__init__.py
:- Make sure that the config is correctly loaded from a Hugging Face Hub repo when using
AutoDistributedConfig.from_pretrained(...)
.
- Make sure that the config is correctly loaded from a Hugging Face Hub repo when using
-
Edit
block.py
:- Make sure that you can run a Petals server with your model's blocks.
- Make sure the server returns correct results for forward and backward passes (the outputs are close the ones of a locally hosted block).
- You have to pay attention to the dimension order in attention caches (both keys and values), since many implementations use different dimension orders (e.g., see dimension reordering code in llama/block.py).
- Run the server with
--throughput eval
to test inference code and check that you have no shape errors.
-
Edit
model.py
:- Create distributed model wrappers using code from the 🤗 Transformers implementation.
- Check that you can run a Petals client and get correct results for inference, forward, backward passes and all model types (the outputs are close to a locally hosted model).
- Check that
AutoDistributedModel.from_pretrained(...)
,AutoDistributedModelForCausalLM.from_pretrained(...)
, and similar functions correctly load the model from Hugging Face Hub.
-
(optional) Share your code by making a pull request to Petals:
- We'll review your pull request and may add it to the repo if the model is worth to be maintained by our team.
- If appropriate, we may add it to health.petals.ml and chat.petals.dev services.