-
Notifications
You must be signed in to change notification settings - Fork 28.7k
Emu3: add model #33770
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Emu3: add model #33770
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
@add_start_docstrings( | ||
"The Emu3 Text Model which consists of transformer with self attention layers.", | ||
EMU3_START_DOCSTRING, | ||
) | ||
class Emu3TextModel(Emu3PreTrainedModel): | ||
config_class = Emu3TextConfig |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding LlamaModel
to bases messes up the auto-generated modeling file by adding new classes like Emu3TextAttention
and so on, while we have Emu3Attention
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be solved by #34487!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To try again!
I think this is ready for review. @ArthurZucker will you be reviewing or is there anyone I can tag for initial review? Btw, the repo consistency tests will fail because the modular doesn't import |
You can tag @Cyrilvallez ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot, great work! With the new modular version #34487, I think we can still improve a bit! Should be merged very soon, but this is already very nice imo if you don't want to wait 🤗
@add_start_docstrings( | ||
"The Emu3 Text Model which consists of transformer with self attention layers.", | ||
EMU3_START_DOCSTRING, | ||
) | ||
class Emu3TextModel(Emu3PreTrainedModel): | ||
config_class = Emu3TextConfig |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be solved by #34487!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Waiting for the updates regarding @Cyrilvallez 's PR, will review again once updated
heh, is something wrong with code owners? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks :)
Yeah, seems like it automatically tags all code owners depending on files touched/created... @ArthurZucker, would be nice to not tag that many people at once |
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice thanks for iterating! My only comment is that I have not personnaly looked enough at the MIMI
or the VQVAE
from Chameleon
you would know better, but the more standard the better!
A few nits but good to go IMO.
|
||
# autoregressively complete prompt | ||
output = model.generate(**inputs, max_new_tokens=50) | ||
print(processor.decode(output[0], skip_special_tokens=True)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice to have some expected outputs!
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
Let's merge 🚀 |
* model can convert to HF and be loaded back * nit * works in single batch generation but hallucinates * use the image tokens * add image generation * now it works * add tests * update * add modulare but it doesn't work for porting docstring :( * skip some tests * add slow tests * modular removed the import? * guess this works * update * update * fix copies * fix test * fix copies * update * docs * fix tests * last fix tests? * pls * repo consistency * more style * style * remove file * address comments * tiny bits * update after the new modular * fix tests * add one more cond in check attributes * decompose down/up/mid blocks * allow static cache generation in VLMs * nit * fix copies * Update docs/source/en/model_doc/emu3.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/model_doc/emu3.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/model_doc/emu3.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/model_doc/emu3.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/model_doc/emu3.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/model_doc/emu3.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/model_doc/emu3.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/model_doc/emu3.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * fix VAE upsampling * Update src/transformers/models/emu3/modular_emu3.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * address comments * state overwritten stuff explicitly * fix copies * add the flag for flex attn --------- Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
What does this PR do?
As per title. The code can work for generating text in single-batch scenarios but the generated text doesn't match input image. For batched generation, seems like the orig impl neither supports it mostly because image features from processor are returned with different shapes (smart resize to converse as much orig image size as possible). We can try to do padding similar to llava-next but I am not sure if will just work, I'll contact the authors
TODO:
extra-0
to smth like<image>
And for image generation: