Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
61 commits
Select commit Hold shift + click to select a range
253fd71
model can convert to HF and be loaded back
zucchini-nlp Sep 27, 2024
bfce946
nit
zucchini-nlp Sep 27, 2024
9f04cd9
works in single batch generation but hallucinates
zucchini-nlp Sep 28, 2024
6bfc608
use the image tokens
zucchini-nlp Oct 14, 2024
5486574
add image generation
zucchini-nlp Oct 22, 2024
7050c96
now it works
zucchini-nlp Oct 23, 2024
510ad04
add tests
zucchini-nlp Oct 24, 2024
f10f1e8
Merge remote-tracking branch 'upstream/main' into emu3
zucchini-nlp Oct 24, 2024
f25113e
update
zucchini-nlp Oct 25, 2024
dbe6b37
add modulare but it doesn't work for porting docstring :(
zucchini-nlp Oct 25, 2024
65436f1
skip some tests
zucchini-nlp Oct 25, 2024
17c5d93
Merge remote-tracking branch 'upstream/main' into emu3
zucchini-nlp Oct 25, 2024
0b26b80
add slow tests
zucchini-nlp Oct 25, 2024
9c966ac
Merge remote-tracking branch 'upstream/main' into emu3
zucchini-nlp Oct 25, 2024
2fd840c
modular removed the import?
zucchini-nlp Oct 25, 2024
468c7cb
guess this works
zucchini-nlp Oct 28, 2024
69ebfdd
Merge remote-tracking branch 'upstream/main' into emu3
zucchini-nlp Oct 28, 2024
62625ca
update
zucchini-nlp Oct 28, 2024
0c3ca61
Merge branch 'main' into emu3
zucchini-nlp Oct 29, 2024
51112c9
merge main
zucchini-nlp Nov 19, 2024
79295b8
update
zucchini-nlp Nov 19, 2024
3f7ac3b
Merge remote-tracking branch 'upstream/main' into emu3
zucchini-nlp Nov 19, 2024
e9357be
fix copies
zucchini-nlp Nov 19, 2024
ff1a353
fix test
zucchini-nlp Nov 19, 2024
75fa981
fix copies
zucchini-nlp Nov 20, 2024
378b797
update
zucchini-nlp Nov 20, 2024
6aeb36d
docs
zucchini-nlp Nov 20, 2024
c6c53ad
fix tests
zucchini-nlp Nov 20, 2024
bbe3d4c
last fix tests?
zucchini-nlp Nov 20, 2024
e3d1503
pls
zucchini-nlp Nov 20, 2024
c02587d
repo consistency
zucchini-nlp Nov 20, 2024
c341aa9
more style
zucchini-nlp Nov 20, 2024
e597f00
style
zucchini-nlp Nov 20, 2024
f35319a
remove file
zucchini-nlp Nov 20, 2024
31fc8f7
Merge branch 'main' into emu3
zucchini-nlp Nov 20, 2024
620e82b
address comments
zucchini-nlp Nov 20, 2024
4d9cff5
tiny bits
zucchini-nlp Jan 6, 2025
7440095
merge main
zucchini-nlp Jan 6, 2025
1bc1f3b
update after the new modular
zucchini-nlp Jan 7, 2025
4f13ae4
fix tests
zucchini-nlp Jan 7, 2025
80bc940
add one more cond in check attributes
zucchini-nlp Jan 7, 2025
25e387c
decompose down/up/mid blocks
zucchini-nlp Jan 8, 2025
094e754
allow static cache generation in VLMs
zucchini-nlp Jan 8, 2025
5050db4
nit
zucchini-nlp Jan 8, 2025
081a8c5
fix copies
zucchini-nlp Jan 8, 2025
783f274
Update docs/source/en/model_doc/emu3.md
zucchini-nlp Jan 9, 2025
f0c1275
Update docs/source/en/model_doc/emu3.md
zucchini-nlp Jan 9, 2025
1885532
Update docs/source/en/model_doc/emu3.md
zucchini-nlp Jan 9, 2025
d5a30b2
Update docs/source/en/model_doc/emu3.md
zucchini-nlp Jan 9, 2025
6ac924d
Update docs/source/en/model_doc/emu3.md
zucchini-nlp Jan 9, 2025
2aaab17
Update docs/source/en/model_doc/emu3.md
zucchini-nlp Jan 9, 2025
097be9c
Update docs/source/en/model_doc/emu3.md
zucchini-nlp Jan 9, 2025
d4af7c3
Update docs/source/en/model_doc/emu3.md
zucchini-nlp Jan 9, 2025
a782d0d
fix VAE upsampling
zucchini-nlp Jan 9, 2025
5821cd2
Update src/transformers/models/emu3/modular_emu3.py
zucchini-nlp Jan 10, 2025
21e0f38
address comments
zucchini-nlp Jan 10, 2025
69440ba
state overwritten stuff explicitly
zucchini-nlp Jan 10, 2025
3812687
Merge branch 'main' into emu3
zucchini-nlp Jan 10, 2025
6f57070
fix copies
zucchini-nlp Jan 10, 2025
d4bb4e4
Merge branch 'main' into emu3
zucchini-nlp Jan 10, 2025
7e42a1f
add the flag for flex attn
zucchini-nlp Jan 10, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/source/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -860,6 +860,8 @@
title: DePlot
- local: model_doc/donut
title: Donut
- local: model_doc/emu3
title: Emu3
- local: model_doc/flava
title: FLAVA
- local: model_doc/git
Expand Down
1 change: 1 addition & 0 deletions docs/source/en/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -137,6 +137,7 @@ Flax), PyTorch, and/or TensorFlow.
| [EfficientFormer](model_doc/efficientformer) | ✅ | ✅ | ❌ |
| [EfficientNet](model_doc/efficientnet) | ✅ | ❌ | ❌ |
| [ELECTRA](model_doc/electra) | ✅ | ✅ | ✅ |
| [Emu3](model_doc/emu3) | ✅ | ❌ | ❌ |
| [EnCodec](model_doc/encodec) | ✅ | ❌ | ❌ |
| [Encoder decoder](model_doc/encoder-decoder) | ✅ | ✅ | ✅ |
| [ERNIE](model_doc/ernie) | ✅ | ❌ | ❌ |
Expand Down
179 changes: 179 additions & 0 deletions docs/source/en/model_doc/emu3.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,179 @@
<!--Copyright 2024 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.

⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.

-->

# Emu3

## Overview

The Emu3 model was proposed in [Emu3: Next-Token Prediction is All You Need](https://arxiv.org/abs/2409.18869) by Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, Yingli Zhao, Yulong Ao, Xuebin Min, Tao Li, Boya Wu, Bo Zhao, Bowen Zhang, Liangdong Wang, Guang Liu, Zheqi He, Xi Yang, Jingjing Liu, Yonghua Lin, Tiejun Huang, Zhongyuan Wang.

Emu3 is a multimodal LLM that uses vector quantization to tokenize images into discrete tokens. Discretized image tokens are later fused with text token ids for image and text generation. The model can additionally generate images by predicting image token ids.


The abstract from the paper is the following:

*While next-token prediction is considered a promising path towards artificial general intelligence, it has struggled to excel in multimodal tasks, which are still dominated by diffusion models (e.g., Stable Diffusion) and compositional approaches (e.g., CLIP combined with LLMs). In this paper, we introduce Emu3, a new suite of state-of-the-art multimodal models trained solely with next-token prediction. By tokenizing images, text, and videos into a discrete space, we train a single transformer from scratch on a mixture of multimodal sequences. Emu3 outperforms several well-established task-specific models in both generation and perception tasks, surpassing flagship models such as SDXL and LLaVA-1.6, while eliminating the need for diffusion or compositional architectures. Emu3 is also capable of generating high-fidelity video via predicting the next token in a video sequence. We simplify complex multimodal model designs by converging on a singular focus: tokens, unlocking great potential for scaling both during training and inference. Our results demonstrate that next-token prediction is a promising path towards building general multimodal intelligence beyond language. We open-source key techniques and models to support further research in this direction.*

Tips:

- We advise users to set `processor.tokenizer.padding_side = "left"` before batched generation as it leads to more accurate results.

- Note that the model has been trained with a specific prompt format for chatting. Use `processor.apply_chat_template(my_conversation_dict)` to correctly format your prompts.

- Emu3 has two different checkpoints for image-generation and text-generation, make sure to use the correct checkpoint when loading the model. To generate an image, it is advised to use `prefix_constraints` so that the generated tokens are sampled only from possible image tokens. See more below for usage examples.

> [!TIP]
> Emu3 implementation in Transformers uses a special image token to indicate where to merge image embeddings. The special image token isn't new and uses one of the reserved tokens: `<|extra_0|>`. You have to add `<image>` to your prompt in the place where the image should be embedded for correct generation.


This model was contributed by [RaushanTurganbay](https://huggingface.co/RaushanTurganbay).
The original code can be found [here](https://github.com/baaivision/Emu3).


## Usage example

### Text generation inference

Here's how to load the model and perform inference in half-precision (`torch.bfloat16`) to generate textual output from text or text and image inputs:

```python
from transformers import Emu3Processor, Emu3ForConditionalGeneration
import torch
from PIL import Image
import requests

processor = Emu3Processor.from_pretrained("Emu3-community/Emu3-Chat-hf")
model = Emu3ForConditionalGeneration.from_pretrained("Emu3-community/Emu3-Chat-hf", torch_dtype=torch.bfloat16, device_map="cuda")

# prepare image and text prompt
url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)
prompt = "What do you see in this image?<image>"

inputs = processor(images=image, text=prompt, return_tensors="pt").to(model.device, dtype=torch.bfloat16)

# autoregressively complete prompt
output = model.generate(**inputs, max_new_tokens=50)
print(processor.decode(output[0], skip_special_tokens=True))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice to have some expected outputs!

```

### Image generation inference

Emu3 can also generate images from textual input. Here is how you can do it:

```python
processor = Emu3Processor.from_pretrained("Emu3-community/Emu3-Gen-hf")
model = Emu3ForConditionalGeneration.from_pretrained("Emu3-community/Emu3-Gen-hf", torch_dtype="bfloat16", device_map="auto", attn_implementation="flash_attention_2")


inputs = processor(
text=["a portrait of young girl. masterpiece, film grained, best quality.", "a dog running under the rain"],
padding=True,
return_tensors="pt",
return_for_image_generation=True,
)
inputs = inputs.to(device="cuda:0", dtype=torch.bfloat16)

neg_prompt = "lowres, bad anatomy, bad hands, text, error, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality, normal quality, jpeg artifacts, signature, watermark, username, blurry."
neg_inputs = processor(text=[neg_prompt] * 2, return_tensors="pt").to(device="cuda:0")

image_sizes = inputs.pop("image_sizes")
HEIGHT, WIDTH = image_sizes[0]
VISUAL_TOKENS = model.vocabulary_mapping.image_tokens

def prefix_allowed_tokens_fn(batch_id, input_ids):
height, width = HEIGHT, WIDTH
visual_tokens = VISUAL_TOKENS
image_wrapper_token_id = torch.tensor([processor.tokenizer.image_wrapper_token_id], device=model.device)
eoi_token_id = torch.tensor([processor.tokenizer.eoi_token_id], device=model.device)
eos_token_id = torch.tensor([processor.tokenizer.eos_token_id], device=model.device)
pad_token_id = torch.tensor([processor.tokenizer.pad_token_id], device=model.device)
eof_token_id = torch.tensor([processor.tokenizer.eof_token_id], device=model.device)
eol_token_id = processor.tokenizer.encode("<|extra_200|>", return_tensors="pt")[0]

position = torch.nonzero(input_ids == image_wrapper_token_id, as_tuple=True)[0][0]
offset = input_ids.shape[0] - position
if offset % (width + 1) == 0:
return (eol_token_id, )
elif offset == (width + 1) * height + 1:
return (eof_token_id, )
elif offset == (width + 1) * height + 2:
return (eoi_token_id, )
elif offset == (width + 1) * height + 3:
return (eos_token_id, )
elif offset > (width + 1) * height + 3:
return (pad_token_id, )
else:
return visual_tokens


out = model.generate(
**inputs,
max_new_tokens=50_000, # make sure to have enough tokens for one image
prefix_allowed_tokens_fn=prefix_allowed_tokens_fn,
return_dict_in_generate=True,
negative_prompt_ids=neg_inputs.input_ids, # indicate for Classifier-Free Guidance
negative_prompt_attention_mask=neg_inputs.attention_mask,
)

image = model.decode_image_tokens(out.sequences[:, inputs.input_ids.shape[1]: ], height=HEIGHT, width=WIDTH)
images = processor.postprocess(list(image.float()), return_tensors="PIL.Image.Image") # internally we convert to np but it's not supported in bf16 precision
for i, image in enumerate(images['pixel_values']):
image.save(f"result{i}.png")

```


## Emu3Config

[[autodoc]] Emu3Config

## Emu3VQVAEConfig

[[autodoc]] Emu3VQVAEConfig

## Emu3TextConfig

[[autodoc]] Emu3TextConfig

## Emu3Processor

[[autodoc]] Emu3Processor

## Emu3ImageProcessor

[[autodoc]] Emu3ImageProcessor
- preprocess

## Emu3VQVAE

[[autodoc]] Emu3VQVAE
- forward

## Emu3TextModel

[[autodoc]] Emu3TextModel
- forward

## Emu3ForCausalLM

[[autodoc]] Emu3ForCausalLM
- forward

## Emu3ForConditionalGeneration

[[autodoc]] Emu3ForConditionalGeneration
- forward
2 changes: 2 additions & 0 deletions docs/source/en/perf_infer_gpu_one.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,7 @@ FlashAttention-2 is currently supported for the following architectures:
* [Dbrx](https://huggingface.co/docs/transformers/model_doc/dbrx#transformers.DbrxModel)
* [DiffLlama](https://huggingface.co/docs/transformers/model_doc/diffllama#transformers.DiffLlamaModel)
* [DistilBert](https://huggingface.co/docs/transformers/model_doc/distilbert#transformers.DistilBertModel)
* [Emu3](https://huggingface.co/docs/transformers/model_doc/emu3)
* [Gemma](https://huggingface.co/docs/transformers/model_doc/gemma#transformers.GemmaModel)
* [Gemma2](https://huggingface.co/docs/transformers/model_doc/gemma2#transformers.Gemma2Model)
* [GPT2](https://huggingface.co/docs/transformers/model_doc/gpt2)
Expand Down Expand Up @@ -245,6 +246,7 @@ For now, Transformers supports SDPA inference and training for the following arc
* [DistilBert](https://huggingface.co/docs/transformers/model_doc/distilbert#transformers.DistilBertModel)
* [Dpr](https://huggingface.co/docs/transformers/model_doc/dpr#transformers.DprReader)
* [EncoderDecoder](https://huggingface.co/docs/transformers/model_doc/encoder_decoder#transformers.EncoderDecoderModel)
* [Emu3](https://huggingface.co/docs/transformers/model_doc/emu3)
* [Falcon](https://huggingface.co/docs/transformers/model_doc/falcon#transformers.FalconModel)
* [Gemma](https://huggingface.co/docs/transformers/model_doc/gemma#transformers.GemmaModel)
* [Gemma2](https://huggingface.co/docs/transformers/model_doc/gemma2#transformers.Gemma2Model)
Expand Down
30 changes: 30 additions & 0 deletions src/transformers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -428,6 +428,12 @@
"ElectraConfig",
"ElectraTokenizer",
],
"models.emu3": [
"Emu3Config",
"Emu3Processor",
"Emu3TextConfig",
"Emu3VQVAEConfig",
],
"models.encodec": [
"EncodecConfig",
"EncodecFeatureExtractor",
Expand Down Expand Up @@ -1222,6 +1228,7 @@
_import_structure["models.donut"].extend(["DonutFeatureExtractor", "DonutImageProcessor"])
_import_structure["models.dpt"].extend(["DPTFeatureExtractor", "DPTImageProcessor"])
_import_structure["models.efficientnet"].append("EfficientNetImageProcessor")
_import_structure["models.emu3"].append("Emu3ImageProcessor")
_import_structure["models.flava"].extend(["FlavaFeatureExtractor", "FlavaImageProcessor", "FlavaProcessor"])
_import_structure["models.fuyu"].extend(["FuyuImageProcessor", "FuyuProcessor"])
_import_structure["models.glpn"].extend(["GLPNFeatureExtractor", "GLPNImageProcessor"])
Expand Down Expand Up @@ -2243,6 +2250,15 @@
"load_tf_weights_in_electra",
]
)
_import_structure["models.emu3"].extend(
[
"Emu3ForCausalLM",
"Emu3ForConditionalGeneration",
"Emu3PreTrainedModel",
"Emu3TextModel",
"Emu3VQVAE",
]
)
_import_structure["models.encodec"].extend(
[
"EncodecModel",
Expand Down Expand Up @@ -5440,6 +5456,12 @@
ElectraConfig,
ElectraTokenizer,
)
from .models.emu3 import (
Emu3Config,
Emu3Processor,
Emu3TextConfig,
Emu3VQVAEConfig,
)
from .models.encodec import (
EncodecConfig,
EncodecFeatureExtractor,
Expand Down Expand Up @@ -6270,6 +6292,7 @@
from .models.donut import DonutFeatureExtractor, DonutImageProcessor
from .models.dpt import DPTFeatureExtractor, DPTImageProcessor
from .models.efficientnet import EfficientNetImageProcessor
from .models.emu3 import Emu3ImageProcessor
from .models.flava import (
FlavaFeatureExtractor,
FlavaImageProcessor,
Expand Down Expand Up @@ -7139,6 +7162,13 @@
ElectraPreTrainedModel,
load_tf_weights_in_electra,
)
from .models.emu3 import (
Emu3ForCausalLM,
Emu3ForConditionalGeneration,
Emu3PreTrainedModel,
Emu3TextModel,
Emu3VQVAE,
)
from .models.encodec import (
EncodecModel,
EncodecPreTrainedModel,
Expand Down
7 changes: 4 additions & 3 deletions src/transformers/generation/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -1634,17 +1634,18 @@ def _get_cache(
cache_dtype = self.get_output_embeddings().weight.dtype

def get_layer_device_map(execution_device_map: Optional[dict] = None):
num_hidden_layers = self.config.get_text_config().num_hidden_layers
if execution_device_map is None:
return None
elif len(execution_device_map) == 1 and "" in execution_device_map:
return {idx: execution_device_map[""] for idx in range(self.config.num_hidden_layers)}
return {idx: execution_device_map[""] for idx in range(num_hidden_layers)}
layer_device_map = {}
for layer in execution_device_map:
for idx in range(self.config.num_hidden_layers):
for idx in range(num_hidden_layers):
if f".{idx}." in f"{layer}.":
layer_device_map[idx] = execution_device_map[layer]
break
for idx in range(self.config.num_hidden_layers):
for idx in range(num_hidden_layers):
if idx not in layer_device_map:
raise RuntimeError(f"layer {idx} has not been mapped to a device.")
return layer_device_map
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,7 @@
dpt,
efficientnet,
electra,
emu3,
encodec,
encoder_decoder,
ernie,
Expand Down
2 changes: 2 additions & 0 deletions src/transformers/models/auto/configuration_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -103,6 +103,7 @@
("efficientformer", "EfficientFormerConfig"),
("efficientnet", "EfficientNetConfig"),
("electra", "ElectraConfig"),
("emu3", "Emu3Config"),
("encodec", "EncodecConfig"),
("encoder-decoder", "EncoderDecoderConfig"),
("ernie", "ErnieConfig"),
Expand Down Expand Up @@ -420,6 +421,7 @@
("efficientformer", "EfficientFormer"),
("efficientnet", "EfficientNet"),
("electra", "ELECTRA"),
("emu3", "Emu3"),
("encodec", "EnCodec"),
("encoder-decoder", "Encoder decoder"),
("ernie", "ERNIE"),
Expand Down
3 changes: 3 additions & 0 deletions src/transformers/models/auto/modeling_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -499,6 +499,7 @@
("dbrx", "DbrxForCausalLM"),
("diffllama", "DiffLlamaForCausalLM"),
("electra", "ElectraForCausalLM"),
("emu3", "Emu3ForCausalLM"),
("ernie", "ErnieForCausalLM"),
("falcon", "FalconForCausalLM"),
("falcon_mamba", "FalconMambaForCausalLM"),
Expand Down Expand Up @@ -800,6 +801,7 @@
("blip", "BlipForConditionalGeneration"),
("blip-2", "Blip2ForConditionalGeneration"),
("chameleon", "ChameleonForConditionalGeneration"),
("emu3", "Emu3ForConditionalGeneration"),
("fuyu", "FuyuForCausalLM"),
("git", "GitForCausalLM"),
("idefics", "IdeficsForVisionText2Text"),
Expand Down Expand Up @@ -1428,6 +1430,7 @@
("deberta-v2", "DebertaV2Model"),
("distilbert", "DistilBertModel"),
("electra", "ElectraModel"),
("emu3", "Emu3TextModel"),
("flaubert", "FlaubertModel"),
("ibert", "IBertModel"),
("longformer", "LongformerModel"),
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/auto/processing_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,7 @@
("clipseg", "CLIPSegProcessor"),
("clvp", "ClvpProcessor"),
("colpali", "ColPaliProcessor"),
("emu3", "Emu3Processor"),
("flava", "FlavaProcessor"),
("fuyu", "FuyuProcessor"),
("git", "GitProcessor"),
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/auto/tokenization_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -186,6 +186,7 @@
),
),
("electra", ("ElectraTokenizer", "ElectraTokenizerFast" if is_tokenizers_available() else None)),
("emu3", ("GPT2Tokenizer", "GPT2TokenizerFast" if is_tokenizers_available() else None)),
("ernie", ("BertTokenizer", "BertTokenizerFast" if is_tokenizers_available() else None)),
("ernie_m", ("ErnieMTokenizer" if is_sentencepiece_available() else None, None)),
("esm", ("EsmTokenizer", None)),
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/chameleon/processing_chameleon.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,7 @@ class ChameleonProcessor(ProcessorMixin):

attributes = ["image_processor", "tokenizer"]
tokenizer_class = ("LlamaTokenizer", "LlamaTokenizerFast")
valid_kwargs = ["image_seq_length", "image_token"]
image_processor_class = "ChameleonImageProcessor"

def __init__(self, image_processor, tokenizer, image_seq_length: int = 1024, image_token: str = "<image>"):
Expand Down
Loading