Skip to content
View gpt4video's full-sized avatar

Block or report gpt4video

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
gpt4video/README.md

GPT4Video

GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation

Zhanyu Wang, Longyue Wang*, Zhen Zhao, Minghao Wu, Chenyang Lyu, Huayang Li, Deng Cai, Luping Zhou*, Shuming Shi, Zhaopeng Tu

Tencent AI Lab, University of Sydney (*Correspondence)

✨ Demo video

11.24.1.mp4

Framework

image-20230924124604776 Video Encoding stage: The video encoding module employs a frozen ViT-L/14 model to capture raw video features, while the video abstraction module utilizes a transformer-based cross attention layer and two novel learnable tokens, designed to condense information along the temporal and spatial axes.

LLM reasoning: The core of GPT4Video is powered by a frozen LLaMA model, efficiently fine-tuned via LoRA. The LLM is trained with custom video-centric and safety-aligned data, enabling it to comprehend videos and generate appropriate video prompts (indicated by underlined text).

Video Generation: The prompts generated by LLM are then used as text inputs for the models in the Text-to-Video Model Gallery to create videos. We use ZeroScope as our video generation model in this work.

Training

first, install the requestments.

   pip install -r requestments.txt

training model with two gpus for 10 epoches.

    python train.py --devices 2 --max_epochs 10

Citation

@articles{wang2023gpt4video,
  title={GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation},
  author={Zhanyu Wang, Longyue Wang, Minghao Wu, Zhen Zhao, Chenyang Lyu, Huayang Li, Deng Cai, Luping Zhou, Shuming Shi, Zhaopeng Tu},
  journal = {CoRR},
  year={2023}
}

Popular repositories Loading

  1. GPT4Video GPT4Video Public

    Offical Code for GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation

    Python 132 5

  2. gpt4video.github.io gpt4video.github.io Public

    GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation

    JavaScript 1