audio file
Audio file in wav format to use as the model's input. Default file name is input.wav (source: https://freesound.org/people/InspectorJ/sounds/456440/)
text file
A text file containing sentences separated by new lines. Default file name is captions.txt
Cosine similarities
Cosine similarity between the input audio and the sentences in the text file.
Internet connection is required when running the script for the first time, as the model files will be automatically downloaded.
Running the script will compute the cosine similarities between the audio and the captions, using audio and language encoder models train by contrastive training.
You can switch the versions of the encoder model's weight (2022 or 2023) by specifying the version using the argument -v
or --version
.
For more information on arguments, try running python3 msclap.py --help
$ python3 msclap.py -t captions.txt -a input.wav -v 2023
INFO arg_utils.py (13) : Start!
INFO arg_utils.py (158) : env_id updated to 0
INFO arg_utils.py (163) : env_id: 0
INFO arg_utils.py (166) : CPU
INFO msclap.py (167) : input_text: ['Dog barking.', 'Birds whistling.', 'Car passing by.', 'Wind blowing.', 'Water flowing.', 'People talking.']
INFO msclap.py (170) : inference has started...
Similarity:
Birds whistling.: 0.41247469186782837
Wind blowing.: 0.2643369734287262
Water flowing.: 0.23884761333465576
Car passing by.: 0.22803542017936707
People talking.: 0.17387858033180237
Dog barking.: 0.11309497803449631
INFO msclap.py (192) : Script finished successfully.
Pytorch
ONNX opset=11
caption_model_2023.onnx.prototxt
audio_model_2023.onnx.prototxt