Espresso is an open-source, modular, extensible end-to-end neural automatic speech recognition (ASR) toolkit based on the deep learning library PyTorch and the popular neural machine translation toolkit fairseq
. Espresso supports distributed training across GPUs and computing nodes, and features various decoding approaches commonly employed in ASR, including look-ahead word-based language model fusion, for which a fast, parallelized decoder is implemented.
We provide state-of-the-art training recipes for the following speech datasets:
- September 2019: We are in an effort of isolating Espresso from fairseq, resulting in a standalone package that can be directly
pip install
ed.
- PyTorch version >= 1.2.0
- Python version >= 3.5
- For training new models, you'll also need an NVIDIA GPU and NCCL
- For faster training install NVIDIA's apex library with the
--cuda_ext
option
Currently Espresso only support installing from source.
To install fairseq from source and develop locally:
git clone https://github.com/freewym/espresso
cd espresso
pip install --editable .
pip install kaldi_io
pip install sentencepiece
cd speech_tools; make KALDI=<path/to/a/compiled/kaldi/directory>
add your Python path to PATH
variable in examples/asr_<dataset>/path.sh
, the current default is ~/anaconda3/bin
.
kaldi_io is required for reading kaldi scp files. sentencepiece is required for subword pieces training/encoding. Kaldi is required for data preparation, feature extraction and scoring for some datasets (e.g., Switchboard).
Espresso is MIT-licensed.
Please cite Espresso as:
@inproceedings{wang2019espresso,
title = {Espresso: A Fast End-to-end Neural Speech Recognition Toolkit},
author = {Yiming Wang and Tongfei Chen and Hainan Xu
and Shuoyang Ding and Hang Lv and Yiwen Shao
and Nanyun Peng and Lei Xie and Shinji Watanabe
and Sanjeev Khudanpur},
booktitle = {2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)},
year = {2019},
}