Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Does it work with Spatiotemporal Skip Guidance? #118

Open
renschni opened this issue Dec 28, 2024 · 4 comments
Open

Does it work with Spatiotemporal Skip Guidance? #118

renschni opened this issue Dec 28, 2024 · 4 comments
Assignees

Comments

@renschni
Copy link

As title says, combining speed improvement with STG improvement would be great. Did anybody conduct tests already if it works?

@foreverpiano
Copy link
Collaborator

foreverpiano commented Dec 28, 2024

I think it is compatible with any CFG enhanced methods.
I have created an issue before. junhahyung/STGuidance#14
They don't have a clear monkey patch implementation. I will integrate the utils then.

@foreverpiano foreverpiano self-assigned this Dec 28, 2024
@renschni
Copy link
Author

renschni commented Jan 3, 2025

I tried to understand what exactly STG does and where it is applied. My naive approach was to compare the pipeline scripts of the fastvideo and STGuidance repos (pipeline_hunyuan_video.py). The first mentioning of stg happens in def call(
In the fastvideo version the codeblock it ends with:
...
vae_ver: str = "88-4c-sd",
enable_tiling: bool = False,
n_tokens: Optional[int] = None,
embedded_guidance_scale: Optional[float] = None,
**kwargs,
):

whereas in STG it adds:
     ....
     vae_ver: str = "88-4c-sd",
    enable_tiling: bool = False,
    n_tokens: Optional[int] = None,
    embedded_guidance_scale: Optional[float] = None,
    stg_mode: Optional[str] = None,
    stg_block_idx: List[int] = [-1],
    stg_scale: float = 0.0,
    **kwargs,
):

Then further down the pipeline in the section

perform guidance

I think is where the magic happens:

            # perform guidance
            if self.do_classifier_free_guidance:
                noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
                noise_pred = noise_pred_uncond + self.guidance_scale * (
                    noise_pred_text - noise_pred_uncond
                )
            elif not self.do_classifier_free_guidance and self.do_spatio_temporal_guidance:
                with torch.autocast(
                    device_type="cuda", dtype=target_dtype, enabled=autocast_enabled
                ):
                    noise_pred_perturb = self.transformer(  # For an input image (129, 192, 336) (1, 256, 256)
                        latent_model_input,  # [2, 16, 33, 24, 42]
                        t_expand,  # [2]
                        text_states=prompt_embeds,  # [2, 256, 4096]
                        text_mask=prompt_mask,  # [2, 256]
                        text_states_2=prompt_embeds_2,  # [2, 768]
                        freqs_cos=freqs_cis[0],  # [seqlen, head_dim]
                        freqs_sin=freqs_cis[1],  # [seqlen, head_dim]
                        guidance=guidance_expand,
                        return_dict=[stg_block_idx, stg_mode, True],
                    )[
                        "x"
                    ]
                noise_pred = noise_pred_perturb + self._stg_scale * (
                    noise_pred - noise_pred_perturb
                )

I couldn't see any difference in the get_guidance_scale_embedding functions of the pipleine scripts so it really seems to be the math section in #perform guidance that does it.

of course, this property got added to the stack:

@property
def do_spatio_temporal_guidance(self):
    # return self._guidance_scale > 1 and self.transformer.config.time_cond_proj_dim is None
    return self._stg_scale > 1

I can't test it on my machine, but my naive approach would be to simply change the pipeline script and add the STG related codeblocks and see what happens :)

@foreverpiano
Copy link
Collaborator

Could you provide a few examples of how disabling certain guidance features affects the results? I want to check the differences before adding them to our repository. I'm waiting for their response before making any additions.

@renschni
Copy link
Author

renschni commented Jan 5, 2025

The project team provided some very good examples on their pages. I find this one here (mochi1 though) most impressive. I still don't fully understand how "reducing" the guidance allows for this jump in quality. But then again I don't fully understand transformer models. But I guess too much micro-management is never good with any intelligence, be it natural or artificial...
As soon as I have build my new workstation (as I said I can't test it currently) I will implement it locally and conduct tests with this repo here, STG injected.

https://junhahyung.github.io/STGuidance/assets/circle/mochi/sample2.mp4

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants