-
Notifications
You must be signed in to change notification settings - Fork 530
Run Petals server on Windows
You can use WSL or Docker to run Petals on Windows. In this guide, we will show how to set up Petals on WSL (Windows Subsystem for Linux).
-
This tutorial works on Windows 10-11 and NVIDIA GPUs with driver version >= 495 (this requirement is usually met for fresh installations).
If you have an AMD GPU, please proceed until you have errors, then upgrade to a patched version of Petals mentioned in the AMD GPU tutorial.
-
On Windows admin console, install WSL 2:
wsl --install
If you previously had WSL 1, please upgrade as explained here.
-
Open WSL, check that GPUs are available:
nvidia-smi
-
In WSL, install basic Python stuff:
sudo apt update sudo apt install python3-pip python-is-python3
-
Then, install Petals:
python -m pip install git+https://github.com/bigscience-workshop/petals
-
Run the Petals server:
python -m petals.cli.run_server stabilityai/StableBeluga2 --torch_dtype float16
This will host a part of Stable Beluga 2 on your machine. You can also host
meta-llama/Llama-2-70b-hf
,meta-llama/Llama-2-70b-chat-hf
, repos with LLaMA-65B,bigscience/bloom
,bigscience/bloomz
, and other compatible models from 🤗 Model Hub, or add support for new model architectures.❓ Got an error? Check out the "Troubleshooting" page. Most errors are covered there and are easy to fix, including:
hivemind.dht.protocol.ValidationError: local time must be within 3 seconds of others
Killed
torch.cuda.OutOfMemoryError: CUDA out of memory
- If you have an error about
auth_token
, see the "Want to host LLaMA 2?" section below. - If your error is not covered there, let us know in Discord and we'll help!
🦙 Want to host LLaMA 2? Request access to its weights at the ♾️ Meta AI website and 🤗 Model Hub, generate an 🔑 access token, then use this command:
python -m petals.cli.run_server meta-llama/Llama-2-70b-chat-hf --token YOUR_TOKEN_HERE
💪 Want to share multiple GPUs? In this case, you'd need to run a separate Petals server for each GPU. Open a separate WSL console for each GPU, then run this in the first console:
CUDA_VISIBLE_DEVICES=0 python -m petals.cli.run_server stabilityai/StableBeluga2 --torch_dtype float16
Do the same for each console, replacing
CUDA_VISIBLE_DEVICES=0
withCUDA_VISIBLE_DEVICES=1
,CUDA_VISIBLE_DEVICES=2
, etc. -
Once all blocks are loaded, check that your server is available on https://health.petals.dev/
Petals will use NAT traversal via relays by default, but you can make it available directly if your computer has a public IP address. We recommend doing it when possible, since this allows other peers to connect to your server significantly faster.
-
In WSL, find out the IP address of your WSL container (
172.X.X.X
):sudo apt install net-tools ifconfig
-
Allow traffic to be routed into the WSL container (replace
172.X.X.X
with your actual IP):netsh interface portproxy add v4tov4 listenport=31330 listenaddress=0.0.0.0 connectport=31330 connectaddress=172.X.X.X
-
Set up your firewall (e.g., Windows Defender) to allow traffic from the outside world to the port 31330/tcp.
-
If you have a router, set it up to allow connections from the outside world (port 31330/tcp) to your computer (port 31330/tcp).
-
Run the Petals server with the parameter
--port 31330
:python -m petals.cli.run_server stabilityai/StableBeluga2 --torch_dtype float16 --port 31330
-
Ensure that the server prints
This server is available directly
(notvia relays
) after startup.