Automatic spoken language identification (LID) using deep learning.
We wanted to classify the spoken language within audio files, a process that usually serves as the first step for NLP or speech transcription.
We used two deep learning approaches using the Tensorflow and Caffe frameworks for different model configuration.
- /data
- Scripts to download training data from Voxforge and Youtube. For usage details see below.
- /Evaluation
- Prediction scripts for single audio files or list of files using Caffe
- /Preprocessing
- Includes all scripts to convert a WAV audio file into spectrogram and mel-filter spectrogram images using a Spark Pipeline.
- All scripts to create/extract the audio features
- To convert a directory of WAV audio files using the Spark pipeline run:
./run.sh --inputPath {input_path} --outputPath {output_path} | tee sparkline.log -
- /models
- All our Caffe models: Berlin_net, Topcoder, VGG_M
- Berlin_net: 3Conv Batch Normalisation, 2 FullyConnected Layer (Shallow Architecture)
- Topcoder_net: (Deep Architecture) inspired by Topcoder's spoken language identification challenge
- Finetuning of VGG_M
- /tensorflow
- All the code for setting up and training various models with Tensorflow.
- Includes training and prediction script. See
train.py
andpredict.py
. - Configure your learning parameters in
config.yaml
. - Add or change network under
/tensorflow/networks/instances/
.
- /tools
- Some handy scripts to clean filenames, normalize audio files and other stuff.
- /webserver
- A web demo to upload audio files for prediction.
- See the included README
- Caffe
- TensorFlow
- Spark
- Python 2.7
- OpenCV 2.4
- youtube_dl
// Install additional Python requirements
pip install -r requirements.txt
pip install youtube_dl
Downloads training data / audio samples from various sources.
- Downloads the audio samples from www.voxforge.org for some languages
/data/voxforge/download-data.sh
/data/voxforge/extract_tgz.sh {path_to_german.tgz} german
- Downloads various news channels from Youtube.
- Configure channels/sources in
youtube/sources.yml
python /data/youtube/download.py
We trained models for 2/4 languages (English, German, French, Spanish).
The top scoring networks were trained with 15.000 images per languages, a batch size of 64, and a learning rate of 0.001 that was decayed to 0.0001 after 7.000 iterations.
// Caffe:
/models/{model_name}/training.sh
// Tensorflow:
python /tensorflow/train.py
0 English,
1 German,
2 French,
3 Spanish
For training we used both the public Voxforge dataset and downloaded news reel videos from Youtube. Check out the /data directory for download scripts.