Explanation of the 0.18215 factor in textual_inversion? #437

garrett361 · 2022-09-09T01:21:39Z

diffusers/examples/textual_inversion/textual_inversion.py

Line 501 in b2b3b1a

latents = latents * 0.18215

Hi, just a small question about the quoted script above which is bothering me: where does this 0.18215 number come from? What computation is being done? Is it from some paper? I have seen the same factor elsewhere, too, without explanation. Any guidance would be very helpful, thanks!

The text was updated successfully, but these errors were encountered:

CodeExplode · 2022-09-09T08:18:53Z

That's the exact same value used in the original textual inversion code for the 'learning rate' setting. https://github.com/rinongal/textual_inversion/blob/main/configs/stable-diffusion/v1-finetune.yaml

Going by wikipedia, it seems to be how much a weight value can shift on each batch iteration (I suspect the weights are 0 to 1 or -1 to 1), probably a scalar applied to the difference it currently has to the assumed ideal target weight, or something along those lines.

patil-suraj · 2022-09-09T09:02:05Z

Hey @garrett361

That comes from the original stable diffusion training.cf https://github.com/CompVis/stable-diffusion/blob/main/configs/stable-diffusion/v1-inference.yaml#L17

This is scale_factor which is used to scale the latents produced by the autoencoder before they are fed to the unet. Maybe @rromb can comment on why the scaling is necessary.

rromb · 2022-09-09T11:00:24Z

Hi @garrett361 @patil-suraj @CodeExplode

We introduced the scale factor in the latent diffusion paper. The goal was to handle different latent spaces (from different autoencoders, which can be scaled quite differently than images) with similar noise schedules. The scale_factor ensures that the initial latent space on which the diffusion model is operating has approximately unit variance. Hope this helps :)

garrett361 · 2022-09-09T13:07:09Z

The scale_factor ensures that the initial latent space on which the diffusion model is operating has approximately unit variance. Hope this helps :)

Perfect @rromb, yes, I was looking for the principle which led to one number versus another. (Sec. 4.3.2 and Appendices D.1 and G, for anyone looking.)

To make sure I'm understanding, it sounds like you arrived at scale_factor = 0.18215 by averaging over a bunch of examples generated by the vae, in order to ensure they have unit variance with the variance taken over all dimensions simultaneously? And scale_factor = 1 / std(z), schematically?

And if the above is right, I'm curious if you also tried instead whitening each latent individually, rather than using a single global scale for all latents? Or tried using LayerNorm or similar?

rromb · 2022-09-09T14:23:41Z

@garrett361 Yes, your understanding is correct. We did not play much with other normalization schemes because the simple rescaling worked out of the box.

ezhang7423 · 2022-09-20T02:00:49Z

Hypothetically if we were to retrain a latent diffusion model with more than one autoencoder, would you need a different scaling factor for each autoencoder to get approximately unit variance?

fepegar · 2022-12-19T01:13:26Z

In case this is useful for others, I've written some code to replicate the computation of that magic value. It seems to be a reasonable estimation!

from diffusers import AutoencoderKL
import torch
import torchvision
from torchvision.datasets.utils import download_and_extract_archive
from torchvision import transforms


num_workers = 4
batch_size = 12
# From https://github.com/fastai/imagenette
IMAGENETTE_URL = 'https://s3.amazonaws.com/fast-ai-imageclas/imagenette2.tgz'

torch.manual_seed(0)
torch.set_grad_enabled(False)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

pretrained_model_name_or_path = 'CompVis/stable-diffusion-v1-4'
vae = AutoencoderKL.from_pretrained(
    pretrained_model_name_or_path,
    subfolder='vae',
    revision=None,
)
vae.to(device)

size = 512
image_transform = transforms.Compose([
    transforms.Resize(size),
    transforms.CenterCrop(size),
    transforms.ToTensor(),
    transforms.Normalize([0.5], [0.5]),
])

root = 'dataset'
download_and_extract_archive(IMAGENETTE_URL, root)

dataset = torchvision.datasets.ImageFolder(root, transform=image_transform)
loader = torch.utils.data.DataLoader(
    dataset,
    batch_size=batch_size,
    shuffle=True,
    num_workers=num_workers,
)

all_latents = []
for image_data, _ in loader:
    image_data = image_data.to(device)
    latents = vae.encode(image_data).latent_dist.sample()
    all_latents.append(latents.cpu())

all_latents_tensor = torch.cat(all_latents)
std = all_latents_tensor.std().item()
normalizer = 1 / std
print(f'{normalizer = }')

Output:

normalizer = 0.19503

wj7486 · 2024-01-06T05:58:28Z

Hi @garrett361 @patil-suraj @CodeExplode

We introduced the scale factor in the latent diffusion paper. The goal was to handle different latent spaces (from different autoencoders, which can be scaled quite differently than images) with similar noise schedules. The scale_factor ensures that the initial latent space on which the diffusion model is operating has approximately unit variance. Hope this helps :)

Hello, excuse me. I would like to ask about using the Celeba dataset for my autoencoder kl model that I trained myself .As I want to train 128*128 resolution autoencoderkl model and I am using scale_factor. Is it normal for scale to be approximately 0.44 when using factor? I still cannot achieve the Fid mentioned in the paper when training LDM with this autoencoderkl.
Looking forward to your reply, thank you

guomc9 · 2024-06-18T08:48:27Z

From the perspective of latent variables, should we use B to represent the number of samples for N (where N=H×W×C) latent variables? When calculating the standard deviation, we should standardize the N latent variables. Therefore, the observed mean and std calculated from these B samples should both have the shape [1,N]. Then, by normalizing the samples using $$\frac{samples−mean}{std}$$, can we better ensure the uniformity and fairness of the scales of all latent variables while finetuning the unet of LDM?

jxtps · 2025-02-28T21:13:50Z

It appears that:

image = noise_scheduler.step(model_output, t, image, generator=None).prev_sample

effectively clamps the image to be in the [-1, 1] range.

This makes it essential that your VAE produces output in that range, since if it doesn't, then the decoder will be receiving LDM output that's in a different range than your VAE encoder's output.

garrett361 closed this as completed Sep 9, 2022

pcuenca mentioned this issue Oct 5, 2022

[Community] Move the number "0.18215" from the image2image process to VAE config #726

Closed

harubaru mentioned this issue Nov 3, 2022

Add Stable Diffusion finetuner example coreweave/kubernetes-cloud#97

Merged

harubaru mentioned this issue Dec 13, 2022

Implement Multi-GPU support for Stable Diffusion finetuner coreweave/kubernetes-cloud#117

Merged

wpeebles mentioned this issue Jan 30, 2023

why divide 0.18215 when sampling? facebookresearch/DiT#13

Closed

NathanYanJing mentioned this issue Mar 10, 2023

What's the meaning of 0.18215 facebookresearch/DiT#32

Closed

shanshuo mentioned this issue Oct 5, 2023

How generate image from noise vector with KL-reg autoencoder CompVis/latent-diffusion#187

Open

hongwen-sun mentioned this issue Apr 28, 2024

Is it neccesarry that the latent space has approximately unit variance？ Stability-AI/stable-audio-tools#62

Closed

lthero-big mentioned this issue May 26, 2024

Magic number when recovering latent from the autoencoder lthero-big/A-watermark-for-Diffusion-Models#5

Closed

zheng95z mentioned this issue Nov 21, 2024

About hte magicial scaling factor for each AOV zheng95z/rgbx#8

Closed

sunly92 mentioned this issue Nov 23, 2024

Missing Scaling factor explainingai-code/StableDiffusion-PyTorch#33

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explanation of the 0.18215 factor in textual_inversion? #437

Explanation of the 0.18215 factor in textual_inversion? #437

garrett361 commented Sep 9, 2022

CodeExplode commented Sep 9, 2022 •

edited

Loading

patil-suraj commented Sep 9, 2022

rromb commented Sep 9, 2022

garrett361 commented Sep 9, 2022

rromb commented Sep 9, 2022

ezhang7423 commented Sep 20, 2022

fepegar commented Dec 19, 2022 •

edited

Loading

wj7486 commented Jan 6, 2024 •

edited

Loading

guomc9 commented Jun 18, 2024

jxtps commented Feb 28, 2025

Explanation of the 0.18215 factor in textual_inversion? #437

Explanation of the 0.18215 factor in textual_inversion? #437

Comments

garrett361 commented Sep 9, 2022

CodeExplode commented Sep 9, 2022 • edited Loading

patil-suraj commented Sep 9, 2022

rromb commented Sep 9, 2022

garrett361 commented Sep 9, 2022

rromb commented Sep 9, 2022

ezhang7423 commented Sep 20, 2022

fepegar commented Dec 19, 2022 • edited Loading

wj7486 commented Jan 6, 2024 • edited Loading

guomc9 commented Jun 18, 2024

jxtps commented Feb 28, 2025

CodeExplode commented Sep 9, 2022 •

edited

Loading

fepegar commented Dec 19, 2022 •

edited

Loading

wj7486 commented Jan 6, 2024 •

edited

Loading