[Feature] Generalize STA kernel to work for any sequence length #225

alexarmbr · 2025-02-28T21:06:25Z

Hi, first of all thank you for the awesome work!!

I have been trying to get STA working with Wan2.1 which requires digging into the code, and I am wondering why the STA kernel only supports a sequence length of 115456 with text and 82994 without? I was looking at the kernel to try figure out why but it is not immediately obvious. If I comment out these assertions and run with difference sequence lengths, the kernel still runs but seems less accurate WRT the flex_attention baseline.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Generalize STA kernel to work for any sequence length #225

[Feature] Generalize STA kernel to work for any sequence length #225

alexarmbr commented Feb 28, 2025 •

edited

Loading

[Feature] Generalize STA kernel to work for any sequence length #225

[Feature] Generalize STA kernel to work for any sequence length #225

Comments

alexarmbr commented Feb 28, 2025 • edited Loading

alexarmbr commented Feb 28, 2025 •

edited

Loading