JamendoMaxCaps is a large-scale dataset of 200,000+ instrumental tracks sourced from the Jamendo platform. It includes generated music captions and enhanced imputed metadata. We also introduce a retrieval system that leverages both musical features and metadata to identify similar songs, which are then used to fill in missing metadata using a local large language model (LLLM). This dataset supports research in music-language understanding, retrieval, representation learning, and AI-generated music tasks.
✅ 200,000+ Instrumental Tracks from Jamendo
✅ State-of-the-Art Music Captions generated using a cutting-edge model
✅ Metadata Imputation using a retrieval-enhanced LLM (Llama-2)
✅ Comprehensive Musical and Metadata Features:
- 🎵 MERT-based audio embeddings
- 📝 Flan-T5 metadata embeddings
- 🔍 Imputed metadata fields (genre, tempo, mood, instrumentation)
git clone https://github.com/AMAAI-Lab/JamendoMaxCaps.git
cd JamendoMaxCaps
conda create -n jamendomaxcaps python=3.10
pip install -r requirements.txt
python extract_mert.py
Ensure input and output folders are correctly configured.
python process_metadata.py
Adjust input and output folder paths accordingly.
python build_retrival_system.py --weight_audio <weight_audio> --weight_metadata <weight_metadata>
python retrieve_similar_entries.py --config <config_file_path>
python metadata_imputation.py
If you use JamendoMaxCaps, please cite:
@article{royjamendomaxcaps2025,
author = {Abhinaba Roy, Renhang Liu, Tongyu Lu, Dorien Herremans},
title = {JamendoMaxCaps: A Large-Scale Music-Caption Dataset with Imputed Metadata},
year = {2025},
journal = {arXiv:xxxxx}
}
JamendoMaxCaps is built upon Creative Commons-licensed music from the Jamendo platform and leverages advanced AI models, including MERT, Flan-T5, and Llama-2. Special thanks to the research community for their invaluable contributions to open-source AI development!