environment issue #44

ParkSungHin · 2024-12-03T13:39:20Z

I'm using a 3090 GPU, and the rest of it was installed well except for the "Detectron2" package. Is "Detectron2" a must-have package?

haoningwu3639 · 2024-12-04T05:59:34Z

Sorry for the confusion.
When exporting the environment, I included all the libraries I commonly use.
However, for the StoryGen project, Detectron2 is not a necessary dependency and can be excluded.
In fact, you only need to focus on the key libraries and their versions: torch, diffusers, accelerate, xformers, and transformers.

ParkSungHin · 2024-12-08T03:38:49Z

Thank you for your kind response. Thanks to your support, I was able to successfully experiment with the task.
However, I have some questions regarding the experimental results that I'd like to ask. Could you assist me with a few inquiries?

To execute your GitHub task, I trained the model using only the StorySalon dataset (ebook) available on Hugging Face, with eight NVIDIA 3090 GPUs, following the exact code provided on GitHub. The inference.py script was also executed with the same code.

From the results, it seems that the model somewhat recognizes "cat," but it struggles to recognize "The black-haired man."

I am curious whether the root cause of this issue is:

a lack of sufficient training data
The following phenomena occurring during training
Modifications made to the stage2_config file to enable multi-GPU training.

Could you clarify or help identify the most likely cause?

The phenomena

024-12-07 09:46:46,212 model.pipeline [WARNING] - The following part of your input was truncated because CLIP can only handle sequences up to 77 tokens: ['dere banunii " suggests they are in a remote or isolated location, possibly in a desert or mountainous area. the phrase " aamtya a lee kaapi j 1 1 nana joree " could refer to a specific event or situation that led to their current situation.']

stage2_config.yml

validation_sample_logger:
num_inference_steps: 20
guidance_scale: 5
gradient_accumulation_steps: 24
train_steps: 50000
train_batch_size: 4
validation_steps: 500
checkpointing_steps: 1000
seed: 6666
mixed_precision: 'fp16'
learning_rate: 1e-5
scale_lr: false
lr_scheduler: cosine
lr_warmup_steps: 500
use_8bit_adam: true
adam_beta1: 0.9
adam_beta2: 0.999
adam_weight_decay: 0.01
adam_epsilon: 1.0e-08
max_grad_norm: 0.5

haoningwu3639 · 2025-01-06T06:01:28Z

Sorry for the late reply.
For the questions you raised, I have the following suggestions:

First of all, introducing more and higher quality data will help the generation effect; because this project is relatively old, the data quality of our StorySalon is not very advantageous compared with the current related work, but our proposed data processing pipeline can help to scale up high-quality data;
This warning means that your text prompt is too long, because CLIP's text encoder can only support inputs of up to 77 tokens, and longer parts will be truncated, which may also cause semantic problems in text embedding;
Changes to config will not impact the quality.
The model we proposed is better at single object-driven generation, and has demonstrated a certain ability to generate multiple object combinations, but due to the lack of specific training, the quality of multi-object compositional generation will be worse than single-object generation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

environment issue #44

environment issue #44

ParkSungHin commented Dec 3, 2024

haoningwu3639 commented Dec 4, 2024

ParkSungHin commented Dec 8, 2024 •

edited

Loading

haoningwu3639 commented Jan 6, 2025

environment issue #44

environment issue #44

Comments

ParkSungHin commented Dec 3, 2024

haoningwu3639 commented Dec 4, 2024

ParkSungHin commented Dec 8, 2024 • edited Loading

The phenomena

stage2_config.yml

haoningwu3639 commented Jan 6, 2025

ParkSungHin commented Dec 8, 2024 •

edited

Loading