Run Petals server on Windows

You can use WSL or Docker to run Petals on Windows. In this guide, we will show how to set up Petals on WSL (Windows Subsystem for Linux).

Tutorial

On Windows admin console, install WSL:
```
wsl --install
```
Open WSL, check that GPUs are available:
```
nvidia-smi
```

In WSL, install basic Python stuff:

sudo apt update
sudo apt install python3-pip python-is-python3

Then, install Petals:

python -m pip install git+https://github.com/bigscience-workshop/petals

Run the Petals server:
```
python -m petals.cli.run_server enoch/llama-65b-hf --adapters timdettmers/guanaco-65b
```
This will host a part of LLaMA-65B with optional Guanaco adapters on your machine. You can also host meta-llama/Llama-2-70b-hf, meta-llama/Llama-2-70b-chat-hf, bigscience/bloom, bigscience/bloomz, and other compatible models from 🤗 Model Hub, or add support for new model architectures.

🦙 Want to host LLaMA 2? Request access to its weights at the ♾️ Meta AI website and 🤗 Model Hub, generate an 🔑 access token, then add the --token YOUR_TOKEN argument to the commands above.

If you want to share multiple GPUs, you should run a Petals server for each. Open a separate WSL console for each, then run this in the first console:
```
CUDA_VISIBLE_DEVICES=0 python -m petals.cli.run_server enoch/llama-65b-hf --adapters timdettmers/guanaco-65b
```
Do the same for each console, replacing CUDA_VISIBLE_DEVICES=0 with CUDA_VISIBLE_DEVICES=1, CUDA_VISIBLE_DEVICES=2, etc.
Once all blocks are loaded, check that your server is available on https://health.petals.dev/

Making a server directly available

Petals will use NAT traversal via relays by default, but you can make it available directly if your computer has a public IP address. We recommend doing it when possible, since this allows other peers to connect to your server significantly faster.

In WSL, find out the IP address of your WSL container (172.X.X.X):
```
sudo apt install net-tools
ifconfig
```

Allow traffic to be routed into the WSL container (replace 172.X.X.X with your actual IP):

netsh interface portproxy add v4tov4 listenport=31330 listenaddress=0.0.0.0 connectport=31330 connectaddress=172.X.X.X

Set up your firewall (e.g., Windows Defender) to allow traffic from the outside world to the port 31330/tcp.
If you have a router, set it up to allow connections from the outside world (port 31330/tcp) to your computer (port 31330/tcp).

Run the Petals server with the parameter --port 31330:

python -m petals.cli.run_server enoch/llama-65b-hf --adapters timdettmers/guanaco-65b --port 31330

Ensure that the server prints This server is available directly (not via relays) after startup.

Troubleshooting

I get this error: hivemind.dht.protocol.ValidationError: local time must be within 3 seconds of others on WSL. What should I do?

Petals needs clocks on all nodes to be synchronized. Please set the date using an NTP server: ntpdate pool.ntp.org
I get this error: torch.cuda.OutOfMemoryError: CUDA out of memory. What should I do?

If you use an Anaconda env, run this before starting the server:
```
export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:128
```
If you use Docker, add this argument after --rm in the Docker command:
```
-e "PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:128"
```

This project is a part of the BigScience research workshop.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run Petals server on Windows

Tutorial

Making a server directly available

Troubleshooting

Clone this wiki locally