This research project is aimed at developing a system which is able to take a mixed sound signal as an input and gives us separate sounds of specific music instruments like drums,piano, bass and vocals. With this proposed system, sound separation is possible without professional knowledge. The aim is to save human effort and unleash a potential market.
The problem of Audio Source Separation is a well-studied problem and is also known as the cocktail party problem, i.e. the problem of separating a particular audio source of interest in an environment full of auditory stimuli and noises. The problem has attracted the attention of researchers over past few years and a handful of solutions have been proposed that can solve the problem in a number of very special cases.
Nonetheless,there are several scenarios where the proposed methods fail, rendering the problem still unsolvable. Recently, the widespread success of online music providers that stream audio content to millions of users has made Audio Source Separation fashionable again in order to provide more interactive audio content to subscribers.
The objective of this project is to develop and evaluate Wave-U-Net Algorithm for audio source separation. Specifically, the project aims to develop a system for audio separation. We divided the Project into 3 major objectives to work on incrementally:
- Develop a model which separates vocals from the song.
- Improve the model which will further be able to separate all musical instruments from, a given song.
- Deployed the model onto a website with ability to store and access previously generated output by the user.
The proposed system implements Wave-U-Net Architecture with some improvements over existing Wave-U-Net Architecture. As a solution the system uses a Spectral Loss Function instead of MSE and the model is trained on a larger training dataset, so that it has a larger sample size to produce better outcomes.
-
Encoder: The encoder is the first part of the said architecture. It takes the input signal and maps it to a series of higher-level representations at different scales or resolutions. This is achieved using a series of 1D convolutional layers, with each layer reducing the temporal resolution of the signal while increasing the number of feature channels.The resultiong signal is downsampled in order to filter the important features from the signal.
-
Skip connections: One of the key innovations of the Wave-U-Net is the use of skip connections between the encoder and decoder. These connections allow the decoder to access information from earlier stages of the encoding process, which can help to preserve low-level features and improve separation performance.
-
Decoder: The decoder is the second part of the Wave-U-Net architecture. It takes the representations generated by the encoder and maps them back to the original temporal resolution, while simultaneously refining them to better separate the different sources in the signal. The decoder is also composed of a series of convolutional layers, but with each layer increasing the temporal resolution of the signal while decreasing the number of feature channels.
-
Output: The output of the decoder is in the form of multiple music files (.wav) indicating sounds of separated instruments and vocals.
-
Loss function: The Wave-U-Net is trained using a custom loss function that encourages the network to separate the different sources in the input signal. This loss function is typically based on the magnitude spectrogram of the signal and may incorporate additional constraints or regularization terms.
The project requires the following packages:
- TensorFlow v2.10.1
- Python 3.5 and above
- VS Code
- Google Colab/Jupyter Notebook
All required packages can be installed by running pip install -r requirements.txt
.
To run this on your local machine type-
python manage.py runserver