Run a custom model with Petals

Starting with Petals 1.2.0, you don't have to convert a model to a special Petals-compatible format and can serve it directly from a Hugging Face hub repository (e.g., you can host smaller versions of BLOOM and LLaMA off-the-self).

Still, Petals supports only a predefined set of model architectures defined in the petals.models package. If you'd like to support a new architecture, you need to copy the src/petals/models/bloom or src/petals/models/llama directory and update all files to work with your new model.

We recommend to do that in the following order:

Prerequisites:
- Ensure that model weights are available on Hugging Face Hub (if necessary, you can use a private repo and the use_auth_token argument in both Petals client and server).
- Ensure that you have a small version of the model, so you can compare the Petals outputs to the outputs of a model running locally on your GPU.
- If you're stuck, don't hesitate to reach us out on Discord!
Edit config.py and __init__.py:
- Make sure that the config is correctly loaded from a Hugging Face Hub repo when using AutoDistributedConfig.from_pretrained(...).
Edit block.py:
- Make sure that you can run a Petals server with your model's blocks.
- Make sure the server returns correct results for forward and backward passes (the outputs are close the ones of a locally hosted block).
- You have to pay attention to the dimension order in attention caches (both keys and values), since many implementations use different dimension orders (e.g., see dimension reordering code in llama/block.py).
- Run the server with --throughput eval to test inference code and check that you have no shape errors.
Edit model.py:
- Create distributed model wrappers using code from the 🤗 Transformers implementation.
- Check that you can run a Petals client and get correct results for inference, forward, backward passes and all model types (the outputs are close to a locally hosted model).
- Check that AutoDistributedModel.from_pretrained(...), AutoDistributedModelForCausalLM.from_pretrained(...), and similar functions correctly load the model from Hugging Face Hub.
(optional) Share your code by making a pull request to Petals:
- We'll review your pull request and may add it to the repo if the model is worth to be maintained by our team.
- If appropriate, we may add it to health.petals.ml and chat.petals.dev services.

This project is a part of the BigScience research workshop.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run a custom model with Petals

Clone this wiki locally