Monoaural audio source separation in the music domain using deep convolutional neural networks.
Given a piece of music as input, we want to separate the vocals from the accompanying instrumentation. There are many use cases for this - in music production, a producer may want to extract the vocals from a sample for creative purposes, or a VOIP (voice over IP) application may use it to enhance audio clarity.
We initially considered a WaveNet based approach [1], but since the models were large and potentially difficult to train, we decided it would be best to work with images by converting the input audio to spectrograms via Short Time Fourier Transforms (STFT). DeepConvSep is a state-of-the-art spectrogram-based model that uses an encoder-decoder strategy applied to the input spectrogram using a combination of vertical and horizontal convolutions in order to capture timbre and temporal attributes. It generates a soft mask of the same shape as the input spectrogram which can be applied to the input magnitude spectrogram to generate the isolated spectrograms, then the audio recovered via inverse STFT [2]. However, we found the separation by these methods to be lacking, and looked to make improvements on it. Audio sample comparisons between Wave-U-Net, DeepConvSep, and our model (Vocal-Mask) can be found in the Results section.
Our first attempt used Mel-scale Spectrograms, which apply a transformation matrix to the STFT to group and weight frequency bins according to the Mel scale. This evens out the spectrogram magnitudes across the frequency range. Without this, the lower freqencies tend to have much higher magnitudes than the higher frequencies, especially in recorded music. However, it is hard to recover the audio from a Mel-scale spectrogram, so this was abandoned in favor of using the same method as DeepConvSep. However, additional steps were taken in preprocessing.
The input stereo signal was converted to mono by averaging the left and right channels. We applied Mel perceptual weighting to the input spectrograms to even out the magnitudes across the frequency range. Then a power factor of 2 was applied to the spectrogram to further enhance the signal-to-noise ratio. After the mask is generated by the model, the mask is applied to the unweighted spectrogram and the audio recovered via inverse STFT. Using the full STFT preserves the phasing information of the original audio and is necessary for producing high quality target waveform estimations. However, the phasing information does not get passed into the model, which is one advancement that the Wave-U-Net model made use of by operating in the waveform domain.
For the vocal spectrogram, only the center column of the image is kept. This is converted to a binary mask and used as the target label with size (513,). We wanted to increase the receptive to target field ratio as this seemed to work well for Wave-U-Net. The mixture spectrograms pass through the convolutional neural network, which ends with a 513-way fully-connected layer with a sigmoid to constrain the output to the 0-1 range.
Model | Receptive Field | Target Field |
---|---|---|
DeepConvSep | 290ms | 290ms |
Wave-U-Net | 9.21s | 1.02s |
Vocal-Mask | 290ms | 11.6ms |
The receptive field is the length of the input waveform that the model processes at a time, while the target field is the length of the output generated by the model. Different window sizes of the receptive field for the input spectrogram were tested. Intuitively, larger windows give the network more context for making the prediction. We found that larger window sizes did indeed produce smaller losses, but at a diminishing rate. Larger window sizes also increased GPU memory consumption, so we decided to stay consistent with the window size used by DeepConvSep (25 stft frames ~ 290ms).
At inference time, the input waveforms are sliced into overlapping 290ms windows with an 11.6ms stride. Each window is converted to Mel-weighted spectrogram and passed through the network to generate the binary mask. The masks are then concatenated and applied to the pre-Mel-weighted spectrogram (which preserves magnitude and phasing information) to produce the isolated vocal-only spectrogram. The audio is then recovered via inverse STFT.
We used the MUSDB18 for this project. The dataset must be decoded using the SigSep Stems Decoder. The preprocess.py
script downsamples the input audio to hparams.sample_rate
and converts the downsampled audio to spectrogram.
Run the preprocess script to generate the spectrograms:
python preprocess.py <musdb root dir> <output dir>
The window size and striding for the slices are controlled by hparams.stft_frames
and hparams.stft_stride
, respectively.
python train.py <data dir> --checkpoint=<path to checkpoint file (*.pth)>
The first argument should be the same as the output directory of preprocess
.
A pretrained model for hparams.model_type = 'resnet18'
can be downloaded here: resnet18_step000007686.pth.
The window size can be modified with hparams.stft_frames
. Larger window sizes will require more GPU memory.
python generate.py <path to checkpoint file (*.pth)> <path to mixture wav>
This will generate a vocal wav file in the generated
directory. Below are the parameters in hparams.py
that control how the mask is applied during inference.
hparams.mask_at_eval
- IfTrue
, the model output will be converted to a binary mask. IfFalse
, it will be left as a softmask with values in the range (0,1).hparams.eval_mask_threshold
- Range (0,1). Lower values will allow more audio through, but may also let some instrumentation through as well. Values below this threshold will be set to 0.
A one-cycle learning rate scheduler [5] was used to train the model. A learning rate finder was used to find reasonable learning rate boundaries [4]. Based on the plot below, the learning range was selected to be from 1e-4 to 3e-3 from iterations 0 to 3458, then back down to 1e-4 from iterations 3458 to 6917, then trailing down to 1e-6 until iteration 7686.
The model was trained using the AdamW optimizer [6] with beta1
0.9, beta2
0.99, weight decay
0.3, and a batch size of 256.
Below are example soft masks generated by the model and the corresponding masked-mixture spectrograms.
Audio examples were taken from here for comparison purposes.
Mixture | Wave-U-Net | DeepConvSep | Vocal-Mask | Ground Truth |
---|---|---|---|---|
Sample 1 | Sample 1 | Sample 1 | Sample 1 | Sample 1 |
Sample 2 | Sample 2 | Sample 2 | Sample 2 | Sample 2 |
Sample 3 | Sample 3 | Sample 3 | Sample 3 | Sample 3 |
Sample 4 | Sample 4 | Sample 4 | Sample 4 | Sample 4 |
Sample 5 | Sample 5 | Sample 5 | Sample 5 | Sample 5 |
Note: Estimates and Ground Truth samples were downsampled to 16kHz to make a fair comparison.
Below are quantitative evaluations of the signal and separation quality based on BSS Eval metrics. Shown are the median SDR (Source-to-Distortion Ratio), SIR (Source-to-Inteferences Ratio), and SAR (Source-to-Artifacts Ratio) values when applying the model to the MusDB18 test set. Higher values indicate better separation and signal quality. For the Vocal-Mask model, two methods were evaluated for applying the mask. The first method was with a hard binary mask with 0.5 as the cutoff threshold for vocal content. The second method was using a soft mask with a noise gate at 0.1 such that all values below 0.1 were considered silent.
Model | SDR | SIR | SAR |
---|---|---|---|
DeepConvSep | 2.37 | 4.65 | 8.04 |
Wave-U-Net | 4.60 | 14.30 | 5.54 |
Vocal-Mask (Hard Mask) | 3.46 | 13.88 | 4.34 |
Vocal-Mask (Soft Mask) | 3.66 | 11.90 | 5.18 |
The model obtained best BSS Eval scores using the soft mask compared to the hard binary mask. Compared to DeepConvSep, which was also a spectrogram-based method, we were able to achieve significantly improved BSS Eval scores. Unfortunately, we fell short of Wave-U-Net's BSS Eval scores. Qualitatively, we found that our model worked best on acoustic or pop music where the vocals are more prominent in the mix. Fuller mixes have less variation between peaks and troughs in the spectrogram, which resulted in the model having a harder time picking out vocal features and creating less separation. Similarly, the model also has trouble with whispered vocals, as can be heard near the beginning of Sample 5, likely due to vocal signal being softer relative to the background. This is one area that Wave-U-Net worked well in where our model did not - Wave-U-Net was able to successfully generate whispered audio.
One challenge we faced during training was that the model overfit very quickly using fixed or monotonically decreasing learning rate schedulers. Switching to a one-cycle learning rate policy with high learning rates and weight decay regularization resulted in the validation loss hugging the training loss curve more closely without diverging. Experimentation was done with data augmentation in the pitch domain by pitch shifting the vocals before regenerating the mixtures, but this did not improve validation loss. There may be other augmentation procedures that would help however.
Overall, we made significant improvements on previous spectrogram based methods. Going forward, we would like to see how to incorporate phase information into our method.
[1] Lluis, F., Pons, J., and Xavier Serra. End-to-end music source separation: is it possible in the waveform domain? 2018. https://arxiv.org/abs/1810.12187
[2] Chandna, P., Miron, M., Janer, J., and Emilia Gómez. Monoaural Audio Source Separation Using Deep Convolutional Neural Networks. 2017. http://mtg.upf.edu/node/3680
[3] Rafii, Z., Liutkus, A., Fabian-Robert, S., Mimilakis, S.I., and Rachel Bittner. The MUSDB18 Corpus for music separation. 2017. https://sigsep.github.io/datasets/musdb.html
[4] Leslie N. Smith. Cyclic Learning Rates. 2015. https://arxiv.org/abs/1506.01186
[5] Leslie N. Smith. A disciplined approach to neural network hyper-parameters: Part 1 -- learning rate, batch size, momentum, and weight decay. 2018. https://arxiv.org/abs/1803.09820
[6] Loshchilov, I. and Frank Hutter. Decoupled Weight Decay Regularization. 2017. https://arxiv.org/abs/1711.05101