Add `TextToImage`, `StableDiffusion3Backbone` and `StableDiffusion3TextToImage` #1816

james77777778 · 2024-09-10T08:58:04Z

This is more of a draft, as we may need further discussion regarding the implementation.

Notes for reviewing:

There are several small modifications in this PR to ensure the numerical stability of all modules/layers. (e.g., LN and Softmax should run in float32)
StableDiffusion3Backbone is a large model that includes all necessary submodules, resulting in a very long init signature. Is this acceptable? How could we refactor it?
I have figured out how to drop T5 (another 5B model): simply skip it and zero-pad the embeddings from CLIP models.
Some ideas were borrowed from https://github.com/huggingface/diffusers , which helped simplify the implementation.
Defining a functional model in StableDiffusion3Backbone or StableDiffusion3TextToImage is challenging for me. It may be unnecessary for pure inference purposes.
I couldn't compile the entire text_to_image function due to unexpected OOM issues. However, when I split it into encode, denoise and decode functions, it worked fine.
I have written a rough but functional script that can convert the weights directly from https://huggingface.co/stabilityai/stable-diffusion-3-medium. Please refer to the colab.

Demo colab:
https://colab.research.google.com/drive/1rrQMs0nlKSEzYNhIJChQwgnrZNiydexS?usp=sharing

`"a cat holding a sign that says hello world"`	`"cute wallpaper art of a cat"`

TODO:

Rename model folder to stable_diffusion_3
Add docstrings
Add weight conversion script
Add tests

@divyashreepathihalli @mattdangerw @SamanehSaadat

BTW, I will be unavailable from 9/17~9/22

* Agg Vgg16 backbone * update names * update tests * update test * add image classifier * incorporate review comments * Update test case * update backbone test * add image classifier * classifier cleanup * code reformat * add vgg16 image classifier * make vgg generic * update doc string * update docstring * add classifier test * update tests * update docstring * address review comments * code reformat * update the configs * address review comments * fix task saved model test * update init * code reformatted

* Add ResNetV1 and ResNetV2 * Address comments

* Add CSP DarkNet * Add CSP DarkNet * snake_case function names * change use_depthwise to block_type

…Backbone` (keras-team#1769) * Add FeaturePyramidBackbone and update ResNetBackbone * Simplify the implementation * Fix CI * Make ResNetBackbone compatible with timm and add FeaturePyramidBackbone * Add conversion implementation * Update docstrings * Address comments

* Add DenseNet * fix testcase * address comments * nit * fix lint errors * move description

* add vit det vit_det_backbone * update docstring * code reformat * fix tests * address review comments * bump year on all files * address review comments * rename backbone * fix tests * change back to ViT * address review comments * update image shape

* Add MixTransformer * fix testcase * test changes and comments * lint fix * update config list * modify testcase for 2 layers

* update input_image_shape -> image_shape * update docstring example * code reformat * update tests

add missing __init__ file to vit_det

This is a temporary way to test out the keras-hub branch. - Does a global rename of all symbols during package build. - Registers the "old" name on symbol export for saving compat. - Adds a github action to publish every commit to keras-hub as a new package. - Removes our descriptions on PyPI temporarily, until we want to message this more broadly.

* Add `CLIPTokenizer`, `T5XXLTokenizer`, `CLIPTextEncoder` and `T5XXLTextEncoder`. * Make CLIPTextEncoder as Backbone * Add `T5XXLPreprocessor` and remove `T5XXLTokenizer` Add `CLIPPreprocessor` * Use `tf = None` at the top * Replace manual implementation of `CLIPAttention` with `MultiHeadAttention`

* Bounding box utils * - Correct test cases * - Remove hard tensorflow dtype * - fix api gen * - Fix import for test cases - Use setup for converters test case * - fix api_gen issue * - FIx api gen * - Fix api gen error * - Correct test cases as per new api changes

* mobilenet_v3 added in keras-nlp * minor bug fixed in mobilenet_v3_backbone * formatting corrected * refactoring backbone * correct_pad_downsample method added * refactoring backbone * parameters updated * Testcaseupdated, expected output shape corrected * code formatted with black * testcase updated * refactoring and description added * comments updated * added mobilenet v1 and v2 * merge conflict resolved * version arg removed, and config options added * input_shape changed to image_shape in arg * config updated * input shape corrected * comments resolved * activation function format changed * minor bug fixed * minor bug fixed * added vision_backbone_test * channel_first bug resolved * channel_first cases working * comments resolved * formatting fixed * refactoring --------- Co-authored-by: ushareng <usha.rengaraju@gmail.com>

* migrating efficientnet models to keras-hub * merging changes from other sources * autoformatting pass * initial consolidation of efficientnet_backbone * most updates and removing separate implementation * cleanup, autoformatting, keras generalization * removed layer examples outside of effiicient net * many, mainly documentation changes, small test fixes

* Add ResNet_vd to ResNet backbone * Addressed requested parameter changes * Fixed tests and updated comments * Added new parameters to docstring

* Add `VAEImageDecoder` for StableDiffusionV3 * Use `keras.Model` for `VAEImageDecoder` and follows the coding style in `VAEAttention`

…TextEncoder` (keras-team#1802)

* add pyramid outputs * fix testcase * format fix * make common testcase for pyramid outputs * change default shape * simplify testcase * test case change and add channel axis

* Add `MMDiT` * Update * Update * Update implementation

* - Add formats, iou, utils for bounding box * - Add `AnchorGenerator`, `BoxMatcher` and `NonMaxSupression` layers * - Remove scope_name not required. * use default keras name scope * - Correct format error * - Remove layers as of now and keep them at model level till keras core supports them * - Correct api_gen

This reverts commit 439f0a5.

james77777778 · 2024-09-11T00:10:32Z

The commit history seems chaotic. I will try to rebase it today.

james77777778 · 2024-09-11T02:29:05Z

Since the keras-hub branch was force-pushed, I was unable to rebase it. Therefore, I submitted a new PR for SD3:
#1820

divyashreepathihalli and others added 25 commits August 12, 2024 17:17

Add ResNetBackbone and ResNetImageClassifier (keras-team#1765)

73b7bad

* Add ResNetV1 and ResNetV2 * Address comments

Add CSP DarkNet backbone and classifier (keras-team#1774)

26afc7e

* Add CSP DarkNet * Add CSP DarkNet * snake_case function names * change use_depthwise to block_type

Add DenseNet (keras-team#1775)

9860756

* Add DenseNet * fix testcase * address comments * nit * fix lint errors * move description

Merge remote-tracking branch 'upstream/master' into keras-hub

ececd14

Add Mix transformer (keras-team#1780)

fc485d6

* Add MixTransformer * fix testcase * test changes and comments * lint fix * update config list * modify testcase for 2 layers

update input_image_shape -> image_shape (keras-team#1785)

2797851

* update input_image_shape -> image_shape * update docstring example * code reformat * update tests

Create __init__.py (keras-team#1788)

18f8880

add missing __init__ file to vit_det

Add the ResNet_vd backbone (keras-team#1766)

be8888d

* Add ResNet_vd to ResNet backbone * Addressed requested parameter changes * Fixed tests and updated comments * Added new parameters to docstring

Add VAEImageDecoder for StableDiffusionV3 (keras-team#1796)

536474a

* Add `VAEImageDecoder` for StableDiffusionV3 * Use `keras.Model` for `VAEImageDecoder` and follows the coding style in `VAEAttention`

Replace Backbone with keras.Model in CLIPTextEncoder and `T5XXL…

0fbd84b

…TextEncoder` (keras-team#1802)

Add pyramid output for densenet, cspDarknet (keras-team#1801)

9143468

* add pyramid outputs * fix testcase * format fix * make common testcase for pyramid outputs * change default shape * simplify testcase * test case change and add channel axis

Add MMDiT for StableDiffusionV3 (keras-team#1806)

791d7f6

* Add `MMDiT` * Update * Update * Update implementation

fix timm conversion script

439f0a5

Revert "fix timm conversion script"

dbaeabd

This reverts commit 439f0a5.

Fix timm conversion for rersnet (keras-team#1814)

49def20

Add TextToImage and StableDiffusion3TextToImage

52e90a5

james77777778 force-pushed the add-sdv3 branch from 5975c2a to 52e90a5 Compare September 10, 2024 09:22

divyashreepathihalli added the kokoro:force-run Runs Tests on GPU label Sep 10, 2024

kokoro-team removed the kokoro:force-run Runs Tests on GPU label Sep 10, 2024

mattdangerw force-pushed the keras-hub branch 2 times, most recently from 1826dce to 753047d Compare September 11, 2024 00:01

james77777778 closed this Sep 11, 2024

james77777778 deleted the add-sdv3 branch October 3, 2024 04:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `TextToImage`, `StableDiffusion3Backbone` and `StableDiffusion3TextToImage` #1816

Add `TextToImage`, `StableDiffusion3Backbone` and `StableDiffusion3TextToImage` #1816

james77777778 commented Sep 10, 2024 •

edited

Loading

james77777778 commented Sep 11, 2024 •

edited

Loading

james77777778 commented Sep 11, 2024

Add TextToImage, StableDiffusion3Backbone and StableDiffusion3TextToImage #1816

Add TextToImage, StableDiffusion3Backbone and StableDiffusion3TextToImage #1816

Conversation

james77777778 commented Sep 10, 2024 • edited Loading

james77777778 commented Sep 11, 2024 • edited Loading

james77777778 commented Sep 11, 2024

Add `TextToImage`, `StableDiffusion3Backbone` and `StableDiffusion3TextToImage` #1816

Add `TextToImage`, `StableDiffusion3Backbone` and `StableDiffusion3TextToImage` #1816

james77777778 commented Sep 10, 2024 •

edited

Loading

james77777778 commented Sep 11, 2024 •

edited

Loading