An exmpale of the data structure required by ESPnet. The required format can be found in data
|-- data
| |-- test
| | |-- spk2utt # Mapping a speaker-ID to a list of utterance-IDs
| | |-- text # Mapping a utterance-ID to a text
| | |-- utt2spk # Mappinng a utterance-ID to a speaker-ID
| | `-- wav.scp # Mappinng a utterance-ID to a path of audio file
| |-- train
| | |-- spk2utt
| | |-- text
| | |-- utt2spk
| | `-- wav.scp
| `-- valid
| |-- spk2utt
| |-- text
| |-- utt2spk
| `-- wav.scp
ESPnet follows the data strcutre developed by Kaldi-asr: A data-directory must contain some texts, wav.scp
, text
, and etc. which have common format to describe DNN corpus. The format is space
separated and must be two columns. The first column is taken as ID
and the second is some value.
<ID> <value>
<ID> <value>
...
e.g. wav.scp
shows <Sample ID> <Wave file path>
.
-
The directory name in
data
is arbitrary:train
,valid
, andtest
can betr
,cv
, andeval
for example. -
The path for
wav.scp
can be both absolute path or relative path from the base directory (egs2/<corpus-name>/<task-name>
, e.g.egs2/an4/asr1
).uttidA /absolute/path/uttidA.wav uttidB ./relative/path/uttidB.wav
- Maybe, we assume
monaural
and16bit-signed-integer-pcm
audio file. Any sampling rates are okay.- Please check your audio format in advance using
sox
orffmpeg
for example.soxi youraudio.wav
- Please check your audio format in advance using
- (THIS IS EXTENSION BY ESPNET, NOT KALDI FORMAT) Not only
wav
, but alsoflac
can be used (Supported audio format can be extended)uttidA a.flac uttidB b.flac
wavs
directory exists at the same level ofdata
in this example, but this is not requirement.
- Maybe, we assume
-
If you don't have speaker information, it can be dummy data because actually most recipes don't use speaker information.
uttidA dummy uttidB dummy uttidC dummy ...
-
spk2utt
can be gernated fromutt2spk
using utils/utt2spk_to_spk2utt.pl, andutt2spk
can be gernated from fromspk2utt
using utils/spk2utt_to_utt2spk.pl.utils/utt2spk_to_spk2utt.pl data/train/utt2spk > data/train/spk2utt utils/spk2utt_to_utt2spk.pl data/train/spk2utt > data/train/utt2spk
-
To check and force your directory to satisfy expected format, use utils/validate_data_dir.sh and utils/fix_data_dir.sh
# check utils/validate_data_dir.sh --no-feats data/train # Force format (This is irreversible change) utils/fix_data_dir.sh data/train
-
It's okay to contains the other files (
foo.txt
in the following example) in the data directory. They are not referred.data `-- test |-- foo.txt |-- spk2utt |-- text |-- utt2spk `-- wav.scp
|-- segments
|-- spk2utt
|-- text
|-- utt2spk
`-- wav.scp
If your audio data is long recording and each audio file includes multiple utterances, you need to put segments
file to specify the start time and end time of each utterance. The format is <utterance_id> <wav_id> <start_time> <end_time>
(in seconds).
sw02001-A_000098-001156 sw02001-A 0.98 11.56
...
Note that if using segments
, wav.scp
has <wav_id>
instead of utterance_id
.
sw02001-A /path/to/sw02001-A.wav
...
wav.scp
provides a feature to describe the file format of wavfiles converted by an arbitral command without saving the files actually. The usage is as following:
foo_id some_command /path/to/foo.wav |
Note that in our style, the environment variable for scripts are set by path.sh
, so please check that the command exists in the ${PATH}
.
If a line ends with |
, it indicates using this pipeline mechanism and our Python script derives the output data from the command via pipeline (We are using https://github.com/nttcslab-sp/kaldiio).
e.g. Change sampling rate, encoding, bits, etc.
sox input.wav -b16 -e unsigned-integer -r 16000 -t wav - |
e.g. Channel selection
sox stereo.wav -c 1 -t wav - |
sox stereo.wav -c 2 -t wav - |