Skip to content

Quantization

Vladimir Mandic edited this page Jan 7, 2025 · 12 revisions

Quantization

Quantization is a process of:

  • storage-optimization
    reducing the memory footprint of the model by reducing the precision of parameters in a model
  • compute-optimization
    speed up the inference process by providing optimized kernels for native execution in quantized precision

For storage-only quantization, the model is quantized to lower precision but the operations are still performed in the original precision which means that each operation needs to be upcasted to the original precision before execution resulting in a performance overhead.

Example

Why use quantization?
Compare total memory requirements and end-to-end performance for SD-3.5-Large model in BF16 precision without and with BitsAndBytes on-the-fly quantization!

Model Quantization Performance Memory
SD3.5-Large BnB = None 0.18 it/s 48.10 GB
SD3.5-Large BnB = Transformer 0.28 it/s 30.48 GB
SD3.5-Large BnB = Transformer & Text-Encoder 0.47 it/s 18.22 GB

Note

Offloading reduces runtime requirements by moving parts of the model between GPU and CPU, but total memory requirements are always there
Performance gains are there simply because there is less shuffling of weights

Using Quantized Models

Quantization can be done in multiple ways:

  • on-the-fly by quantizing on-the-fly during model load
    available by selecting settings -> quantization for some quantization types
    sometimes referred to as pre mode
  • by quantizing immediately after model load
    available by selecting settings -> quantization for all quantization types
    sometimes referred to as post mode
  • by simply loading a pre-quantized model
    quantization type will be auto-determined at the start of the load
  • during model training itself
    out-of-scope for this document

Quantization Engines

Tip

If you're on Windows with a compatible GPU, you may try WSL2 for broader feature compatibiliy
See WSL Wiki for more details

SD.Next supports multiple quantization engines, each with multiple quantization schemes:

  • TorchAO: 4 int-based and 3 float-based quantization schemes
  • BitsAndBytes 3 float-based quantization schemes
  • Optimium.Quanto 3 int-based and 2 float-based quantizations schemes
  • GGUF with pre-quantized weights

Important

Not all quantization engines are available on all platforms, see notes below for details!
Using any quantization engine for the first time may result in failure as required libraries are downloaded and installed
Restart SD.Next and try again if you encounter any issues

BitsAndBytes

Typical models pre-quantized with bitsandbytes would have look like *nf4.safetensors or *fp8.safetensors

Note

BnB is allows for usage of balanced offload as well as fast quantization on-the-fly during load, thus it is considered most versatile choice, but it is not available on all platforms.

Limitations:

  • default bitsandbytes package only supports nVidia GPUs
    some quantization types require newer GPU with supported CUDA ops: e.g. nVidia Turing GPUs or newer
  • bitsandbytes relies on triton packages which are not available on windows unless manually compiled/installed
    without them, performance is significantly reduced
  • for nVidia: automatically installed as needed
  • for AMD/ROCm: link
  • for Intel/IPEX: link

Optimum-Quanto

Typical models pre-quantized with optimum.quanto would have look like *qint.safetensors.

Note

OQ is highly efficient with its qint8/qint4 quantization types, but it cannot be used with broad offloading methods

Limitations:

  • requires torch>=2.4.0
    if you're running older torch, you can try upgrading it or running sdnext with --reinstall flag
  • not compatible with balanced offload
  • not supported with Zluda since Zluda does not support torch 2.4

TorchAO

TorchAO is available for quantization on-the-fly during model load as well as post-load quantization
Limitations:

  • Requires torch==2.5.0

GGUF

GGUF is a binary file format used to package pre-quantized models.

GGUF is originally desiged by llama.cpp project and intended to be used with its GGML execution runtime.
However, without GGML, GGUF provides storage-only quantization which means that every operation needs to be upcast to current device precision before execution (typically FP16 or BF16) which comes with a significant performance overhead.

Warning

Right now, all popular T2I inference UIs (SD.Next, Forge, ComfyUI, InvokeAI etc.) are using GGUF as storage-only and as such usage of GGUF is not recommended!

  • gguf supports wide range of quantization types and is not platform or GPU dependent
  • gguf does not provide native GPU kernels which means that gguf is purely a storage optimization
  • gguf reduces model size and memory usage, but it does slow down model inference since all quantized weights are de-quantized on-the-fly

Limitations:

  • gguf is not compatible with model offloading as it would trigger de-quantization
  • note: only supported component in gguf binary format is UNET/Transformer
    you cannot load all-in-one single-file GGUF model

NNCF

NNCF provides full cross-platform storage-only quantization (referred to as model compression)
with optional platform-specific compute-optimization (available only on OpenVINO platform)

Note

Advantage of NNCF is that it does work on any platform: if you're having issues with optimum-quanto or bitsandbytes, try it out!

  • broad platform and GPU support
  • enable in Settings -> Compute -> Compress model weights with NNCF
  • see NNCF Wiki for more details

Errors

Caution

Using incompatible configurations will result in errors during model load:

  • BitsAndBytes nf4 quantization is not compatible with sequential offload

    Error: Blockwise quantization only supports 16/32-bit floats

  • Quanto qint quantization is not compatible with balanced offload

    Error: QBytesTensor.new() missing 5 required positional arguments

  • Quanto qint quantization is not compatible with sequential offload

    Error: Expected all tensors to be on the same device

Triton

Many quantization schemes rely on Triton compiler for Torch which is not available on all platforms If your installation fails, you can try building triton from sources or find pre-build binary wheels

Triton for Windows

A Triton fork is available for Windows and can be installed by running the following PowerShell script from your SD.Next installation folder:

install-triton.ps1

$ErrorActionPreference = "Stop"

# get the environment details
$VENV_DIR = if ($env:VENV_DIR) { $env:VENV_DIR } else { Resolve-Path "venv" }
$PYTHON = "$VENV_DIR\Scripts\python"
$PIP = "$VENV_DIR\Scripts\pip"

# construct the wheel filename
$sys_ver = & $PYTHON -VV
$sys_ver_major, $sys_ver_minor = $sys_ver.Split(" ")[1].Split(".")[0, 1]
$filename = "triton-3.1.0-cp$sys_ver_major$sys_ver_minor-cp$sys_ver_major$sys_ver_minor-win_amd64.whl"
$url = "https://github.com/woct0rdho/triton-windows/releases/latest/download/$filename"

# download and install the wheel
Invoke-WebRequest $url -OutFile $filename
& $PIP install $filename
Remove-Item $filename
Clone this wiki locally