Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Several questions for this model #37

Closed
sherlock666 opened this issue Sep 23, 2024 · 6 comments
Closed

Several questions for this model #37

sherlock666 opened this issue Sep 23, 2024 · 6 comments

Comments

@sherlock666
Copy link

  1. does it support for a group of images? (let's said 50 processed images) for model then output with saliency score

  2. i'm quite interested about how does the program process the video
    what i understand : (assume its a 30 fps video)
    the video will be separated to n2 seconds clip (n2 <= 150 )
    then... the 2 seconds clip which is 60 frames... all of them will be as input? (if not how do you do here? )

  3. is it possible to adjust the 2 seconds parameter?

  4. why the demo on the huggingface space the "Retrieved moments" sometime is more than 2 seconds? (which is longer than the clip we just got)

  5. for "Highlighted frames" some time it output all minus score , but seems it do capture the right things, is it reasonable? and possible to get more frames? (ex 5-->10)

thanks!!!

@awkrail
Copy link
Contributor

awkrail commented Sep 23, 2024

@sherlock666 Thank you for your interest.

does it support for a group of images? (let's said 50 processed images) for model then output with saliency score

A group of images mean sequential images? Currently, this is not supported but in the future version, I want to support it.
If you want to apply the current inference API to a group of images, please set self.video_feats to be the encoded visual vectors.

def encode_video(

Please see this method for details.

i'm quite interested about how does the program process the video
what i understand : (assume its a 30 fps video)
the video will be separated to n2 seconds clip (n2 <= 150 )
then... the 2 seconds clip which is 60 frames... all of them will be as input? (if not how do you do here? )

The current methods do not process all of the frames but 2fps frames. Hence, if the video is 150s, the number of frames that the model process is 75. This is because videos are redundant and processing all of the frames is quite computationally heavy.

is it possible to adjust the 2 seconds parameter?
why the demo on the huggingface space the "Retrieved moments" sometime is more than 2 seconds? (which is longer than the clip we just got)

Mm.. What does it mean?

for "Highlighted frames" some time it output all minus score , but seems it do capture the right things, is it reasonable? and possible to get more frames? (ex 5-->10)

Yes, this is expected. If you want to get more frames (in the demo), change the TOPK_HIGHLIGHT variable.

TOPK_HIGHLIGHT = 5

@sherlock666
Copy link
Author

sherlock666 commented Sep 23, 2024

thanks for reply

what i mean for :

is it possible to adjust the 2 seconds parameter?
why the demo on the huggingface space the "Retrieved moments" sometime is more than 2 seconds? (which is longer than the clip we just got)

Mm.. What does it mean?

1.mmm...i had seen some where said the video is separated to 2-seconds clip (ex: you demo video is 150 so it'll generate 75 clips) which match the inference code , (but not sure the 2 fps you mentioned) , well, just hope to know whether 2 seconds or 2 fps can be adjusted or not

  1. the hugging face part as i mention the "Retrieved moments"
    take moment 1 as example , how does the 55~85 come?

圖片

3.(sorry a new question)
for the inference i keep got the below error if i use cuda
while if i use cpu , the inference can work (use the latest code which download today)

/media/user/ch_2024_8T/project_202409_trial-lighthouse/lighthouse/frame_loaders/slowfast_loader.py:71: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:206.)
video_tensor = torch.from_numpy(video)
Traceback (most recent call last):
File "/media/user/ch_2024_8T/project_202409_trial-lighthouse/inference.py", line 15, in
model.encode_video('api_example/RoripwjYFp8_60.0_210.0.mp4')
File "/home/user/anaconda3/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/media/user/ch_2024_8T/project_202409_trial-lighthouse/lighthouse/models.py", line 233, in encode_video
video_feats, video_mask = self._vision_encoder.encode(video_path)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/media/user/ch_2024_8T/project_202409_trial-lighthouse/lighthouse/feature_extractor/vision_encoder.py", line 101, in encode
visual_features = [encoder(frames) for encoder, frames in zip(self._visual_encoders, frame_inputs)]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/media/user/ch_2024_8T/project_202409_trial-lighthouse/lighthouse/feature_extractor/vision_encoder.py", line 101, in
visual_features = [encoder(frames) for encoder, frames in zip(self._visual_encoders, frame_inputs)]
^^^^^^^^^^^^^^^
File "/home/user/anaconda3/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/media/user/ch_2024_8T/project_202409_trial-lighthouse/lighthouse/feature_extractor/vision_encoders/slowfast.py", line 96, in call
features = torch.HalfTensor(n_chunk, self.SLOWFAST_FEATURE_DIM,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: legacy constructor expects device type: cpu but device type: cuda was passed

@awkrail
Copy link
Contributor

awkrail commented Sep 23, 2024

@sherlock666

1.mmm...i had seen some where said the video is separated to 2-seconds clip (ex: you demo video is 150 so it'll generate 75 clips) which match the inference code , (but not sure the 2 fps you mentioned) , well, just hope to know whether 2 seconds or 2 fps can be adjusted or not

Sorry, not 2 fps but 1 frame per 2 second (so 0.5 fps, correctly). This fps is fixed because the model is trained on 0.5fps videos. If you want to change it, you need to extract frames, convert them into frame-level CLIP features, and train the model again. You can input different fps videos into the model trained on 0.5 fps, but I am not sure what's happen.

  1. the hugging face part as i mention the "Retrieved moments"
    take moment 1 as example , how does the 55~85 come?

Sorry, I could not understand what you are getting at. In this case, the model predicts the moments 55s~85s based on the input video and text query. Could you tell me your question for details? :)

3.(sorry a new question) for the inference i keep got the below error if i use cuda while if i use cpu , the inference can work (use the latest code which download today)

Thank you for reporting the issue. We will fix it next week.

@sherlock666
Copy link
Author

Thank you for your patience

What i mean is i know that :

the "Highlighted Frame"(which is the right bottom part of demo) is from the 2 seconds clips which sorted by the saliency score right?

but how does the "Retrieved Moments" worked and be predicted? (which is my question how does the 55s~85s come? which is 30 seconds)

my assumption:

  1. another model which do this work?
  2. by gathering those 2 seconds clip with some logic (ex: saliency score which -0.1 will concat and be one "Retreived Moment"

3. the hugging face part as i mention the "Retrieved moments"
take moment 1 as example , how does the 55~85 come?

Sorry, I could not understand what you are getting at. In this case, the model predicts the moments 55s~85s based on the input video and text query. Could you tell me your question for details? :)

@awkrail
Copy link
Contributor

awkrail commented Sep 24, 2024

@sherlock666
I got it.
I think that you misunderstood the model's prediction way. Please read this paper.
The moments and highlights (saliency scores) are predicted separately. See Section 4 for details.

awkrail added a commit that referenced this issue Sep 24, 2024
@awkrail
Copy link
Contributor

awkrail commented Sep 24, 2024

@sherlock666 I fixed the bug you reported. If you have any questions, please re-open the issue. Thanks.

@awkrail awkrail closed this as completed Sep 24, 2024
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants