Skip to content

Repository of the WACV'24 paper "Can CLIP Help Sound Source Localization?"

Notifications You must be signed in to change notification settings

swimmiing/ACL-SSL

Repository files navigation

Audio-Grounded Contrastive Learning (WACV’24)

Official pytorch implementation of out paper:

Can CLIP Help Sound Source Localization?

Sooyoung Park*, Arda Senocak*, Joon Son Chung (* Equal Contribution)

WACV 2024

Introduction

image

This repo is pytorch implementation of Audio-Grounded Contrastive Learning (ACL). Code is very simple and easy to understand fastly.

Some of these codes are based on AudioToken, BEATs, TCL.

Demo: Hugging Face Spaces

Required packages

  • Python = 3.10.8
  • Pytorch = 1.13.0
  • transformers = 4.25.1

Installation

$ conda install -c nvidia cudatoolkit=11.7
$ conda install -c conda-forge cudnn
$ conda install python=3.10
$ pip install torch==1.13.0+cu117 torchvision==0.14.0+cu117 torchaudio==0.13.0 --extra-index-url https://download.pytorch.org/whl/cu117
$ pip install tensorboard
$ pip transformers==4.25.1
$ pip install opencv-python
$ pip install tqdm
$ pip install scikit-learn

Data preparation

Important Note: All audio samples must be converted to 16kHz, and for detailed instructions, refer to the readme in each dataset-specific directory.

Model preparation

Downloading pretrained model (audio backbone) in pretrain folder

Training

  • Ensure that you check the .sh files and set the $ export CUDA_VISIBLE_DEVICES=”**” according to your hardware setup.
  • Make sure that —model_name corresponds to the configuration file located at ./config/model/{-model_name}.yaml.
  • Model files (.pth) will be saved in the directory {—save_path}/Train_record/{-model_name}_{-exp_name}/.
  • Review the configuration settings in ./config/train/{-train_config}.yaml to ensure they match your training requirements.
  • Choose one of the following methods to initiate training:
$ sh SingleGPU_Experiment.sh. # For single GPU setup
$ sh Distributed_Experiment.sh. # For multi-GPU setup (DDP)

Test

  • Before testing, please review the .sh file and set the $ export CUDA_VISIBLE_DEVICES=”**” environment variable according to your hardware configuration.
  • Ensure that the —model_name parameter corresponds to the configuration file located at ./config/model/{-model_name}.yaml.
  • Model files (.pth) located in the directory {—save_path}/{-model_name}_{-exp_name}/Param_{-epochs}.pth will be used for testing.
  • The —epochs parameter can accept either an integer or a list of integers (e.g., 1, 2, 3).
  • If —epochs is left unspecified (null), the default model file {—save_path}/Train_record/{-model_name}_{-exp_name}/Param_best.pth will be used for testing.
$ sh Test_PTModels

Pretrained models

Important Note: After downloading the Param_best.pth file, move it to the directory {—save_path}/{-model_name}_{-exp_name}/ before use.

  • VGG-Sound 144k trained model: [Link]
    • This model was trained using a 2-GPU setup.
    • The reported numbers are the highest, with performance varying across different seeds, and the provided .pth link corresponds to the checkpoint used for the highest result.

Citation

If you use this project, please cite this project as:

@inproceedings{park2023clip,
      title={Can CLIP Help Sound Source Localization?}, 
      author={Sooyoung Park and Arda Senocak and Joon Son Chung},
      journal = {arXiv preprint arXiv:2311.04066},
      year={2023},
}

Releases

No releases published

Packages

No packages published