Code to reproduce the results on the private test set and our learning takeaways after participating in the challenge. Place the dataset in a folder named 'data' inside the repository.
Here's a non-exhaustive list of selected methods we tried, none of which improved the final EER:
- Adding
WavAugument
orSpecAugment
to the signal during the training phase - Using trainable
SincConv
on the raw signal as an approximation of learnable filters before passing the output to the ECAPA - Using alternative spectral representations like
LogSpectrograms
instead ofMelSpectrograms
Inverse-Mel filterbanks
that have been shown to be effective in literature for modelling high-pitched speakers like children.- Feeding the audio to Large scale Wave2Vec/Transformer models (Like Unispeech-SAT and WavLM) followed by a TDNN head. Reference
- Pretrained Unispeech-SAT models performed better than pretrained ECAPA models. Finetunned models didn't perform well due to a lack of training data - A direction worth exploring.
- Large-margin finetuning
Our best-performing model was an ECAPA-TDNN with the same configuration and pipeline as given in the baseline. We modified the baseline in 2 different ways:
- Performing Test Time Augmentation during evaluation as suggested by dienhoa.
- We expanded the training data of the model to include the entire train set, as well as the birth recordings in the dev set.
- The metric for evaluation is EER on trial pairs with one birth and one discharge recording; hence dev EER still retains its usefulness as a metric to judge a model's performance.
- In fact, we have a lower gap in dev vs test EER (20% vs 25.7%) than the baseline (22% vs 30%).
- To our surprise, including the entire dev set as training data led to a lower test EER.
Siddhant Rai Viksit [email protected]
Vinayak Abrol [email protected]