Author:
- Alen Smajic
Advisor:
Institutions:
-
Clone this repository.
git clone https://github.com/alen-smajic/Stable-Diffusion-Latent-Space-Explorer
-
Create a virtual environment and activate it.
python -m venv venv
-
Install PyTorch with CUDA (follow this PyTorch installation).
-
Install diffusers and transformers libraries.
pip install diffusers["torch"] transformers
-
Optional: Install xFormers for efficient attention.
pip install xformers
If you face any issues, you should try to match the version of the installed packages to these versions:
- torch 2.0.0
- cuda 11.8
- diffusers 0.14.0
- xformers 0.0.16
You can download the model weights using git-lfs.
git lfs install
git clone https://huggingface.co/stabilityai/stable-diffusion-2-1
The command above will create a local folder called ./stable-diffusion-2-1
on your disk.
The code in this repo was tested on:
✔️ Stable Diffusion 2.1 (native resolution 768x768)
✔️ Stable Diffusion 2.1 Base (native resolution 512x512)
✔️ Stable Diffusion 2 Inpainting (native resolution 512x512)
Below is a list of experiments that are currently supported. Each entry is linked to a tutorial of the specified experiment.
In order to run an experiment, you first need to define the experiment parameters using one of the experiment configuration files. Once this is done, you can run an experiment by calling the following script:
python run_sd_experiment.py --exp_config ./configs/experiments/{path to your config}
The script run_sd_experiment.py
expects an argument --exp_config
, which is the path to an experiment configuration file (e.g. ./configs/experiments/txt2img/single_inference.yaml
).
Each experiment run will produce a results folder with a unique name consisting of the date and time of execution, the model identifier and the experiment identifier (here you can find some exmaples). The subfolders of the results folder include:
configs
here will be a copy of the configuration file, which was used to start the experiment.images
stores the images generated by Stable Diffusion.embeddings
stores the unique latents for each generated image. You can reference these files to load the prompt embeddings and latent noise.gifs
stores the gifs generated by the experiment (each frame of the gif is contained within theimages
folder).
ℹ️ You can find more information on schedulers here. Morover, if you are unfamiliar with any concept from the Model Configurations
you can refer to the diffusers documentation. A great starting point is this Google Colab Notebook from Hugging Face, which introduces some of the basic components of Stable Diffusion within the diffusers library.
ℹ️ As of Stable Diffusion 2.0, you can also specify a negative prompt in addition to the simple (positive) prompt. This negative prompt can be used to eliminate unwanted details like blurry images for example.
ℹ️ Stable Diffusion 2.1 was fine-tuned on images of 768x768 pixels, while Stable Diffusion 2 Inpainting was trained on images of 512x512 pixels. You can define custom height and width values within the config file. It is recommended to use the native image resolution in one dimension, and a value larger thant that in the other one. Too high deviations from the native resolution will result in poor quality images.
All experiment runs in this tutorial have the same model configuration. Stable Diffusion 2.1 was used for txt2img
and img2img
experiments. For inpaint
was Stable Diffusion 2 Inpainting used. The most relevant parameters from the model configuration are listed below.
📄 scheduler
: DPMSolverMultistepScheduler
🔄 diffusion_steps
: 25
🎛️ guidance_scale
: 9.5
In this tutorial we will use the single_inference.yaml
configuration files for txt2img
, img2img
and inpaint
.
We will start by configuring txt2img
to generate 4 images that adhere to a prompt describing an astronaut riding a horse on the moon (you can find the configuration here).
⌨️ prompt
: "A photograph of an astronaut riding a horse on the moon.", negative prompt
: "black and white, blurry, painting, drawing"
🌱 rand_seed
: 0, height
: 768, width
: 768, images per prompt
: 4
Results |
---|
ℹ️ Note that your results may differ from the ones presented here, even with identical configurations and random seeds.
You can find here more information on reproducibility.
Next, we will configure img2img
to generate 4 variations of the image output-3_diffstep-25.png
from the previous experiment. The variations will
depict an astronaut riding a donkey on the moon (you can find the configuration here).
⌨️ prompt
: "A photograph of an astronaut riding a donkey on the moon.", negative prompt
: "black and white, blurry, painting, drawing"
🌱 rand_seed
: 42, height
: 768, width
: 768, images per prompt
: 4
🖼️ image
: "./experiments/2023-03-27_18-43-02_txt2img_single-inference/images/output-3_diffstep-25.png"
🦾 strength
: 0.8
Results |
---|
ℹ️ Note that the strength parameter scales the specified amount of diffusion steps. Even though we specified 25 diffusion steps in the config file, in actuality Stable Diffusion will start from from diffusion step 5.
Finally, we use inpaint
to generate 4 variations of the image output-0_diffstep-25.png
from the previous experiment, which depicts a donkey instead of a horse. The final 4 variations will be replacing the astronaut with a humanoid roboter using a manually generated mask (you can find the configuration here).
⌨️ prompt
: "A humanoid roboter astronaut.", negative prompt
: "black and white, blurry, painting, drawing, watermark"
🌱 rand_seed
: 0, height
: 768, width
: 768, images per prompt
: 4
🖼️ image
: "./experiments/2023-03-27_21-40-35_img2img_single-inference/images/output-0_diffstep-25.png"
🔲 mask
: "./resources/astronaut_mask.png"
Input image | Mask |
---|---|
Results |
---|
In this tutorial we will use the visualize_diffusion.yaml
configuration file for inpaint
. This experiment decodes the latent noise tensor after each diffusion step and creates a visualization of the whole diffusion process.
We can easily reuse textual embeddings and latent noise tensors to recreate images that were created by a previous experiment run.
ℹ️ Note that when loading prompt embeddings from a local file via the load_prompt_embeds
parameter, the parameter prompt
will be ignored. When loading a latent noise tensor from a local file via the load_latent_noise
paramter, the parameters rand_seed
, height
, width
and images_per_prompt
will be ignored.
ℹ️ If an experiment successfully loaded a latent noise tensor or a prompt embedding from a local file, it will print a message with the specified path to the console. This is a good way for you to verify that you specified the correct path.
⌨️ 💾 load_prompt_embeds
: "./experiments/2023-03-27_21-46-30_inpaint_single-inference/embeddings/output-1_diffstep-25.pt"
🌱 💾 load_latent_noise
: "./experiments/2023-03-27_21-46-30_inpaint_single-inference/embeddings/output-1_diffstep-25.pt"
🖼️ image
: "./experiments/2023-03-27_21-40-35_img2img_single-inference/images/output-0_diffstep-25.png"
🔲 mask
: "./resources/astronaut_mask.png"
GIF | Results |
---|---|
ℹ️ The results of this experiment are not in this repository due to their size.
In this tutorial we will use the random_walk.yaml
configuration file for txt2img
. We will create a visualization by performing a random walk within both the textual and image latent space starting from an initial image depicting a painting of a pirate ship.
To make our visualization more appealing, we will extend the image width from 768 to 1200.
☑️ prompt_rand_walk
: True, noise_rand_walk
: True
🧪 walk_directions
: 3, walk_steps
: 50, step_size
: 0.0095
⌨️ prompt
: "A beautiful painting of a pirate ship.", negative prompt
: "low quality, blurry, low resolution"
🌱 rand_seed
: 0, height
: 768, width
: 1200, images per prompt
: 1
Untitled.1.mp4
ℹ️ The visualization above depicts the random walk from an initial point in 3 random directions for 50 steps. Each time one such direction has been explored for 50 steps, the visualization walks all the way back to the initial image to explore a new direction.
ℹ️ The results of this experiment are not in this repository due to their size. Only the initial image is included for the sake of reproducibility of the next tutorial.
In this tutorial we will use the interpolation.yaml
configuration file for txt2img
. We will create a visualization by interpolating text embeddings and latent noise tensors, which are loaded from a pre-defined list. The list contains 8 prompts and 8 latent noise entries.
ℹ️ Besides listing raw text prompts and random seeds, one can directly reference an embeddings file from a previous experiment. When doing so for latent noise tensors, it is important to take care that the image resolution matches for all items of the inter_noises
parameter (random seed entries will use the heigth and width parameters, which are defined at the bottom of the config file).
As the second entry of the inter_prompts
and inter_noises
lists, we will link to the embeddings of the initial image from the previous tutorial depicting a painting of a pirate ship. Since the loaded latent noise embeddings are configured for 700x1200 image resolution, we will have to set this as the base resolution for the experiment.
🧪 interpolation_steps
: 30, interpolation_method
: slerp
⌨️ inter_prompts
:
- "A photograph of a dog with a funny hat"
- "./experiments/2023-03-30_19-53-28_txt2img_random-walk/embeddings/output-0_direction-0_randwalkstep-start.pt"
- "A digital illustration of a steampunk library with clockwork machines, 4k, detailed, trending on artstation, fantasy vivid colors"
- "A beautiful castle beside a waterfall in the woods, by Josef Thoma, matte painting, trending on artstation HQ"
- "A digital illustration of a medieval town, 4k, detailed, trending on artstation, fantasy"
- "A Hyperrealistic photograph of ancient Paris architectural ruins in a flooded apocalypse landscape of dead skyscrapers, eiffel tower,
*
" - "A Hyperrealistic photograph of a landscape with ancient human ruins
*
" - "A Hyperrealistic photograph of a futuristic city with cyberpunk skyscrapers
*
"
ℹ️ *
means that the prompt is longer in the config file.
ℹ️ negative prompts have been omitted from this overview but can be found in the config file.
🌱 inter_noises
: [2, "./experiments/2023-03-30_19-53-28_txt2img_random-walk/embeddings/output-0_direction-0_randwalkstep-start.pt", 1, 0, 0, 2, 1, 0], height
: 768, width
: 1200
Untitled.mp4
ℹ️ The results of this experiment are not in this repository due to their size.
This method was first described by @MaxRobinsonTheGreat in this repository. It is an evolutionary algorithm that allows the user to select the most dominant "gene" for the next batch of images that are being generated. The "genes" are represented by the textual embeddings and latent noise. There is no particular fitness function, as is usually the case with evolutionary algorithms. Instead, the user can choose the most preferred image or even redraw the batch if none of the produced images are preferred.
In this tutorial we will use the diffevolution.yaml configuration file for inpaint
to transform an image of a tiny spider into an alien-like creature using this mask. We will perform two experiments. For the first we will actively select the most preferred images over several generations, while for the second experiment we will leave the diffevolution process to randomly select the most dominant genes for 50 generations.
ℹ️ Besides being able to select the most dominant gene, the user can also specify a new prompt to further guide the diffevolution process in a new direction. This option isn't explored in this tutorial, but feel free to try it out.
🧬 genes_per_generation
: 5, step_size
: 0.025
⌨️ prompt:
"A highly detailed alien spider with multiple legs, HD, HQ, 4k, 8k.", negative prompt:
"black and white, blurry, painting, drawing, low resolution, watermark"
🌱 rand_seed
: 0, height
: 768, width
: 768
🖼️ image
: "./resources/tin_spider.png"
🔲 mask
: "./resources/tiny_spider_mask.png"
Image from the internet | Mask | Initial image producey by Stable Diffusion |
---|---|---|
Result of 1st experiment |
---|
ℹ️ The results of this experiment are not in this repository due to their size.
This animation was made by manually selecting the most dominant genes over 21 generations. For the next experiment we will automate the gene selection for 50 generations to explore an alternative evolutionary path starting from the same initial image.
Result of 2nd experiment |
---|
Untitled.2.mp4
ℹ️ The results of this experiment are not in this repository due to their size.
This method starts with an initial image and produces a video by outpainting the area outside the initial image. It is highly customizable and offers custom camera movement options. In this tutorial we will use the outpaint_walk.yaml configuration file for inpaint
to extend the content of Vincent van Gogh's The Starry Night painting into a beautiful visual animation.
The outpaint walk will produce a set of keyframes. The translation_factor
controls the translation between two keyframes, while the num_filler_frames
parameter is used to specify the amount of filler frames that will be inserted in-between, so that the animation appears smoother. A list of prompts and the corresponding frame duration for each prompt can be specified. In order to control the camera walk, one can select between 5 available options including: right, down, left, up, backwards
. In analogy to the prompts, for each camera action in the list there should be a frame duration specified. Make sure that the total amount of frames is equal between the frames_per_prompt
and frames_per_cam_action
parameters and that the individual lengths of both lists are equal to the list of prompts
and camera_actions
, respectively.
📶 translation_factor
: 0.4, 🎬 num_filler_frames
: 64
⌨️ prompts
:
- "A beautiful landscape in the style of the starry night, Vincent van Gogh, painting.|watermark, text"
- "A beautiful village landscape with a river in the style of the starry night, Vincent van Gogh, painting.|watermark, text"
- "A beautiful japanese landscape in the style of the starry night, Vincent van Gogh, painting.|watermark, text"
- "A dense forrest in the style of the starry night, Vincent van Gogh, painting.|watermark, text"
- "Paintings on a wall of a museum.|watermark, text"
🎦 frames_per_prompt
:
- 3
- 3
- 3
- 3
- 3
🎥 camera_actions
:
- right
- down
- left
- up
- backwards
🎦 frames_per_cam_action
:
- 3
- 3
- 3
- 3
- 3
🌱 seed_per_frame
: 100, rand_seed
: 0, height
: 512, width
: 648
ℹ️ Under seed_per_frame
you can specify a list of seeds that will be used for each frame. Since such a list can get quite long for longer videos, you can also specify just an initial seed (like in this example) and the algorithm will do the rest for you :)
Untitled.Project.mp4
ℹ️ The video is playing with double speed.
ℹ️ The results of this experiment are not in this repository due to their size.