-
Notifications
You must be signed in to change notification settings - Fork 6k
Add Finegrained FP8 #11647
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
base: main
Are you sure you want to change the base?
Add Finegrained FP8 #11647
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Just for bookkeeping, relaying stuff from our DM. I had to make the following changes to make this PR work: Expanddiff --git a/src/diffusers/models/modeling_utils.py b/src/diffusers/models/modeling_utils.py
index 638c5fbfb..737525143 100644
--- a/src/diffusers/models/modeling_utils.py
+++ b/src/diffusers/models/modeling_utils.py
@@ -1238,8 +1238,8 @@ class ModelMixin(torch.nn.Module, PushToHubMixin):
}
# Dispatch model with hooks on all devices if necessary
- print(model.transformer_blocks[0].attn.to_q.weight)
- print(model.transformer_blocks[0].attn.to_q.weight_scale_inv)
+ # print(model.transformer_blocks[0].attn.to_q.weight)
+ # print(model.transformer_blocks[0].attn.to_q.weight_scale_inv)
if device_map is not None:
device_map_kwargs = {
"device_map": device_map,
diff --git a/src/diffusers/quantizers/finegrained_fp8/finegrained_fp8_quantizer.py b/src/diffusers/quantizers/finegrained_fp8/finegrained_fp8_quantizer.py
index 5dec8b0b8..7212befcd 100644
--- a/src/diffusers/quantizers/finegrained_fp8/finegrained_fp8_quantizer.py
+++ b/src/diffusers/quantizers/finegrained_fp8/finegrained_fp8_quantizer.py
@@ -90,9 +90,9 @@ class FinegrainedFP8Quantizer(DiffusersQuantizer):
Quantizes weights to FP8 format using Block-wise quantization
"""
# print("############ create quantized param ########")
- from accelerate.utils import set_module_tensor_to_device
+ # from accelerate.utils import set_module_tensor_to_device
- set_module_tensor_to_device(model, param_name, target_device, param_value)
+ # set_module_tensor_to_device(model, param_name, target_device, param_value)
module, tensor_name = get_module_from_name(model, param_name)
@@ -131,8 +131,8 @@ class FinegrainedFP8Quantizer(DiffusersQuantizer):
scale = scale.reshape(scale_orig_shape).squeeze().reciprocal()
# Load into the model
- module._buffers[tensor_name] = quantized_param.to(target_device)
- module._buffers["weight_scale_inv"] = scale.to(target_device)
+ module._parameters[tensor_name] = quantized_param.to(target_device)
+ module._parameters["weight_scale_inv"] = scale.to(target_device)
# print("_buffers[0]", module._buffers["weight_scale_inv"])
def check_if_quantized_param(
Inference code: import torch
from diffusers import FluxPipeline, AutoModel, FinegrainedFP8Config
from diffusers.quantizers.finegrained_fp8.utils import FP8Linear
model_id = "black-forest-labs/FLUX.1-dev"
dtype = torch.bfloat16
quantization_config = FinegrainedFP8Config(
modules_to_not_convert=["norm", "proj_out", "x_embedder"], # weight_block_size=(32, 32)
)
transformer = AutoModel.from_pretrained(
model_id,
subfolder="transformer",
quantization_config=quantization_config,
torch_dtype=dtype,
device_map="cuda"
)
pipe = FluxPipeline.from_pretrained(
model_id,
transformer=transformer,
torch_dtype=dtype,
)
pipe.to("cuda")
for name, module in pipe.transformer.named_modules():
if isinstance(module, FP8Linear) and getattr(module, "weight_scale_inv", None) is not None:
if module.weight_scale_inv.ndim == 1:
print(name, module.weight_scale_inv.shape)
print(f"Pipeline memory usage: {torch.cuda.max_memory_reserved() / 1024**3:.3f} GB")
prompt = "A cat holding a sign that says hello world"
image = pipe(
prompt, num_inference_steps=50, guidance_scale=4.5, max_sequence_length=512
).images[0]
image.save("output.png")
print(f"Pipeline memory usage: {torch.cuda.max_memory_reserved() / 1024**3:.3f} GB") The |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for starting this! Would be nice to also have some benchmarks:
- With and without finegrained FP8 quant (with visual outputs)
- With and without torch.compile
class FinegrainedFP8Quantizer(DiffusersQuantizer): | ||
""" | ||
FP8 quantization implementation supporting both standard and MoE models. | ||
Supports both e4m3fn formats based on platform. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we expand on this a bit? What are both e4m3fn
formats? How does that vary depending on the platform?
# Load into the model | ||
module._parameters[tensor_name] = quantized_param.to(target_device) | ||
module._parameters["weight_scale_inv"] = scale.to(target_device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have to tackle buffers as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by that? Aren’t weights usually just parameters?
src/diffusers/quantizers/finegrained_fp8/finegrained_fp8_quantizer.py
Outdated
Show resolved
Hide resolved
def _check_serialization_expected_slice(self, expected_slice, device): | ||
quantized_model = self.get_dummy_model(device) | ||
|
||
with tempfile.TemporaryDirectory() as tmp_dir: | ||
quantized_model.save_pretrained(tmp_dir, safe_serialization=False) | ||
loaded_quantized_model = FluxTransformer2DModel.from_pretrained( | ||
tmp_dir, torch_dtype=torch.bfloat16, use_safetensors=False | ||
).to(device=torch_device) | ||
|
||
inputs = self.get_dummy_tensor_inputs(torch_device) | ||
output = loaded_quantized_model(**inputs)[0] | ||
|
||
output_slice = output.flatten()[-9:].detach().float().cpu().numpy() | ||
|
||
self.assertTrue(numpy_cosine_similarity_distance(output_slice, expected_slice) < 1e-3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think instead of delegating certain calls to other methods, we can have all of the implementations under this one. This way, everything remains self-contained. Furthermore, since this test class doesn't have other tests, we don't have to modularize too much.
WDYT?
text_encoder = CLIPTextModel.from_pretrained( | ||
model_id, subfolder="text_encoder", torch_dtype=torch.bfloat16, cache_dir=cache_dir | ||
) | ||
text_encoder_2 = T5EncoderModel.from_pretrained( | ||
model_id, subfolder="text_encoder_2", torch_dtype=torch.bfloat16, cache_dir=cache_dir | ||
) | ||
tokenizer = CLIPTokenizer.from_pretrained( | ||
model_id, subfolder="tokenizer", cache_dir=cache_dir | ||
) | ||
tokenizer_2 = AutoTokenizer.from_pretrained( | ||
model_id, subfolder="tokenizer_2", cache_dir=cache_dir | ||
) | ||
vae = AutoencoderKL.from_pretrained( | ||
model_id, subfolder="vae", torch_dtype=torch.bfloat16, cache_dir=cache_dir | ||
) | ||
scheduler = FlowMatchEulerDiscreteScheduler() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to initialize these components like this.
For example, if we do:
transformer = FluxTransformer2DModel.from_pretrained(
model_id,
subfolder="transformer",
quantization_config=quantization_config,
torch_dtype=torch.bfloat16,
device_map=torch_device,
)
pipe = DiffusionPipeline.from_pretrained("black-forest-labs/FLUX.1-dev", transformer=transformer, torch_dtype=torch.bfloat16).to("cuda")
It should work. It's simpler and I would prefer this method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure will change that
# A difference of 0.06 in normalized pixel space (-1 to 1), corresponds to a difference of | ||
# 0.06 / 2 * 255 = 7.65 in pixel space (0 to 255). On our CI runners, the difference is about 0.04, | ||
# on DGX it is 0.06, and on audace it is 0.037. So, we are using a tolerance of 0.06 here. | ||
self.assertTrue(np.allclose(output, loaded_output, atol=0.06)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we reduce this tolerance?
…izer.py Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
…izer.py Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
What does this PR do?
Adds finegrained FP8