Skip to content
Vladimir Mandic edited this page Dec 19, 2024 · 1 revision

Video

Video models are large and resource intensive
Biggest resource issue is final decode, since they are typically design to decode entire generated video at once to achieve temporal consistency
To reduce resource requirements, reduce number of generated frames and/or resolution

SD.Next support for video models is relatively basic with further optimizations pending community interest
Any future optimizations would likely have to go into partial loading and excecution instead of offloading inactive parts of the model

Warning

Any use on GPUs below 16GB and systems below 48GB RAM is experimental

Note

Latest video models use LLMs for prompting and due to that requires very long and descriptive prompt

Tip

You may need to enable sequential offload for maximum gpu memory savings or use balanced offload with maximally reduced min/max watermarks

Tip

Optionally enable pre-quantization using bnb for additional memory savings

Supported models

All video models are available as individually selectable scripts in either text or image interfaces

  • Stable Video Diffusion
    support for base, xt 1.0 and xt 1.1
  • CogVideoX
    support for 2B and 5B text-to-video and 5B image-to-video
  • Lightricks LTX-Video
    model size: 27.75gb
    support for text-to-video and image-to-video refrence values: steps 50, width 704, height 512, frames 161, guidance scale 3.0
  • Hunyuan Video
    model size: 40.92gb
    support for text-to-video, to use refrence values: steps 50, width 1280, height 720, frames 129, guidance scale 6.0
  • Genmo Mochi.1 Preview
    model size: 68.87gb
    support for text-to-video, to use refrence values: steps 64, width 848, height 480, frames 19, guidance scale 4.5
  • VGen
  • AnimateDiff

Interpolation

For all video modules, SD.Next supports adding interpolated frames to video for smoother output

Clone this wiki locally