This is a pytorch implementation of FFTNet described here. Work in progress.
- Install requirements
pip install -r requirements.txt
-
Download CMU_ARCTIC dataset.
-
Train the model and save. The default parameters are pretty much the same as int the original paper. Raise the flag --preprocess when execute the first time.
python train.py \
--preprocess \
--wav_dir your_downloaded_wav_dir \
--data_dir preprocessed_feature_dir \
--model_file saved_model_name \
- Use trained model to decode/reconstruct a wav file from the mcc feature.
python decode.py \
--infile wav_file
--outfile reconstruct_file_name
--data_dir preprocessed_feature_dir \
--model_file saved_model_name \
FFTNet_generator and FFTNet_vocoder are two files I used to test the model workability using torchaudio yesno dataset.
There are some files decoded in the samples folder.
- window size: 400 >> depend on minimum_f0 (cuz I use pyworld to get f0 and mcc coefficients)
- Zero padding.
- Injected noise.
- Voiced/unvoiced conditional sampling.
- Post-synthesis denoising.
- I combine two 1x1 convolution kernel to one 1x2 dilated kernel. This can remove redundant bias parameters and accelerate total speed.
- The author said in the middle layers the channels size are 128 not 256.
- My model will get stuck at the begining (loss aroung 4.x) for thousands of step, then go down very fast to 2.6 ~ 3.0. Use smaller learning rate can help a little bit.
Use the flag --radixs to specify each layer's radix.
# a radix-4 FFTNet with 1024 receptive field
python train.py --radixs 4 4 4 4 4
The original FFtNet use Radix-2 structure. In my experiment, a radix-4 network can still achieved similar result, even radix-8, and by reduce the number of layers, it can run faster.
Fig. 2 in the paper can be redraw as dilated structure with kernel size 2 (also means radix size 2).
If we draw all the lines;
and transpose the the graph to let the arrows go backward, you'll find a WaveNet dilated structure.
Add the flag --transpose, you can get a simplified version of WaveNet.
# a WaveNet-like structure model withou gated/residual/skip unit.
python train.py --transpose
In my experiment, the transposed models are more easy to train and have slightly lower training loss compare to FFTNet.