Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

environment issue #44

Open
ParkSungHin opened this issue Dec 3, 2024 · 3 comments
Open

environment issue #44

ParkSungHin opened this issue Dec 3, 2024 · 3 comments

Comments

@ParkSungHin
Copy link

I'm using a 3090 GPU, and the rest of it was installed well except for the "Detectron2" package. Is "Detectron2" a must-have package?

@haoningwu3639
Copy link
Owner

Sorry for the confusion.
When exporting the environment, I included all the libraries I commonly use.
However, for the StoryGen project, Detectron2 is not a necessary dependency and can be excluded.
In fact, you only need to focus on the key libraries and their versions: torch, diffusers, accelerate, xformers, and transformers.

@ParkSungHin
Copy link
Author

ParkSungHin commented Dec 8, 2024

Thank you for your kind response. Thanks to your support, I was able to successfully experiment with the task.
However, I have some questions regarding the experimental results that I'd like to ask. Could you assist me with a few inquiries?

To execute your GitHub task, I trained the model using only the StorySalon dataset (ebook) available on Hugging Face, with eight NVIDIA 3090 GPUs, following the exact code provided on GitHub. The inference.py script was also executed with the same code.

From the results, it seems that the model somewhat recognizes "cat," but it struggles to recognize "The black-haired man."

boy2 whitecat2

infer_result

I am curious whether the root cause of this issue is:

  1. a lack of sufficient training data
  2. The following phenomena occurring during training
  3. Modifications made to the stage2_config file to enable multi-GPU training.

Could you clarify or help identify the most likely cause?

The phenomena

024-12-07 09:46:46,212 model.pipeline [WARNING] - The following part of your input was truncated because CLIP can only handle sequences up to 77 tokens: ['dere banunii " suggests they are in a remote or isolated location, possibly in a desert or mountainous area. the phrase " aamtya a lee kaapi j 1 1 nana joree " could refer to a specific event or situation that led to their current situation.']


stage2_config.yml

validation_sample_logger:
num_inference_steps: 20
guidance_scale: 5
gradient_accumulation_steps: 24
train_steps: 50000
train_batch_size: 4
validation_steps: 500
checkpointing_steps: 1000
seed: 6666
mixed_precision: 'fp16'
learning_rate: 1e-5
scale_lr: false
lr_scheduler: cosine
lr_warmup_steps: 500
use_8bit_adam: true
adam_beta1: 0.9
adam_beta2: 0.999
adam_weight_decay: 0.01
adam_epsilon: 1.0e-08
max_grad_norm: 0.5

@haoningwu3639
Copy link
Owner

Sorry for the late reply.
For the questions you raised, I have the following suggestions:

  1. First of all, introducing more and higher quality data will help the generation effect; because this project is relatively old, the data quality of our StorySalon is not very advantageous compared with the current related work, but our proposed data processing pipeline can help to scale up high-quality data;
  2. This warning means that your text prompt is too long, because CLIP's text encoder can only support inputs of up to 77 tokens, and longer parts will be truncated, which may also cause semantic problems in text embedding;
  3. Changes to config will not impact the quality.
  4. The model we proposed is better at single object-driven generation, and has demonstrated a certain ability to generate multiple object combinations, but due to the lack of specific training, the quality of multi-object compositional generation will be worse than single-object generation.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants