Skip to content

Run Petals server on Windows

Alexander Borzunov edited this page Aug 24, 2023 · 27 revisions

You can use WSL or Docker to run Petals on Windows. In this guide, we will show how to set up Petals on WSL (Windows Subsystem for Linux).

Tutorial

  1. This tutorial works on Windows 10-11 and NVIDIA GPUs with driver version >= 495 (this requirement is usually met for fresh installations).

    If you have an AMD GPU, please proceed until you have errors, then upgrade to a patched version of Petals mentioned in the AMD GPU tutorial.

  2. On Windows admin console, install WSL 2:

    wsl --install

    If you previously had WSL 1, please upgrade as explained here.

  3. Open WSL, check that GPUs are available:

    nvidia-smi
  4. In WSL, install basic Python stuff:

    sudo apt update
    sudo apt install python3-pip python-is-python3
  5. Then, install Petals:

    python -m pip install git+https://github.com/bigscience-workshop/petals
  6. Run the Petals server:

    python -m petals.cli.run_server petals-team/StableBeluga2

    This will host a part of Stable Beluga 2 on your machine. You can also host meta-llama/Llama-2-70b-hf, meta-llama/Llama-2-70b-chat-hf, repos with LLaMA-65B, bigscience/bloom, bigscience/bloomz, and other compatible models from 🤗 Model Hub, or add support for new model architectures.

    Got an error? Check out the "Troubleshooting" page. Most errors are covered there and are easy to fix, including:

    • hivemind.dht.protocol.ValidationError: local time must be within 3 seconds of others
    • Killed
    • torch.cuda.OutOfMemoryError: CUDA out of memory
    • If you have an error about auth_token, see the "Want to host LLaMA 2?" section below.
    • If your error is not covered there, let us know in Discord and we will help!

    🦙 Want to host LLaMA 2? Request access to its weights at the ♾️ Meta AI website and 🤗 Model Hub, generate an 🔑 access token, then use this command:

    python -m petals.cli.run_server meta-llama/Llama-2-70b-chat-hf --token YOUR_TOKEN_HERE

    💪 Want to share multiple GPUs? In this case, you'd need to run a separate Petals server for each GPU. Open a separate WSL console for each GPU, then run this in the first console:

    CUDA_VISIBLE_DEVICES=0 python -m petals.cli.run_server petals-team/StableBeluga2

    Do the same for each console, replacing CUDA_VISIBLE_DEVICES=0 with CUDA_VISIBLE_DEVICES=1, CUDA_VISIBLE_DEVICES=2, etc.

  7. Once all blocks are loaded, check that your server is available on https://health.petals.dev If your server is listed as available through a "Relay", please read the section below.

Making the server directly available

If you have a NAT or a firewall, Petals will use relays for NAT/firewall traversal by default, which negatively impacts performance. If your computer has a public IP address, we strongly recommend to set up port forwarding to make the server available directly. We explain how to do it below.

  1. In WSL, find out the IP address of your WSL container (172.X.X.X):

    sudo apt install net-tools
    ifconfig
  2. Allow traffic to be routed into the WSL container (replace 172.X.X.X with your actual IP):

    netsh interface portproxy add v4tov4 listenport=31330 listenaddress=0.0.0.0 connectport=31330 connectaddress=172.X.X.X
  3. Set up your firewall (e.g., Windows Defender) to allow traffic from the outside world to the port 31330/tcp.

  4. If you have a router, set it up to allow connections from the outside world (port 31330/tcp) to your computer (port 31330/tcp).

  5. Run the Petals server with the parameter --port 31330:

    python -m petals.cli.run_server petals-team/StableBeluga2 --port 31330
  6. Ensure that the server prints This server is available directly (not via relays) after startup.