An Attention Based Open-Source End to End Speech Synthesis Framework, No CNN, No RNN, No MFCC!!!
Before Deep Expresssion, none of piplines in speech synthesis area is really end to end solution. No mattter Deep voice or Tacotron claimed by baidu or google company, and so forth.
Because
- None of them gave up traditional audio domain knowledges, like voice framing, STFT, MFCC, etc.. But I was always wandering for example, why can't we get kernals of MFCC through backpropagation?
- All piplines in speech synthesis area before Deep-Express need lots of preprocessing in text encoding and speech preprocessing and post processing.
For instance, WaveNet (Aaron van den Oord et al., 2016) require significant domain expertise to produce, involving elaborate text-analysis systems as well as a robust lexicon (Jonathan Shen et al., 2017). Both Tacotron (Yuxuan Wang et al., 2017) and Tacotron 2 (Jonathan Shen et al., 2017) and Deep Voice 3 (Wei Ping et al., 2017) use vocoder (Griffin-Lim or WORLD or WaveNet algorithms) for final audio synthesis.
Therefore, I wanna to try to open up Deep Expression framework, to synthesis audio signals from text directly.
In previous frameworks (Aaron van den Oord et al., 2016; Jonathan Shen et al., 2017; Yuxuan Wang et al., 2017; Wei Ping et al., 2017), people tended to normalized audio data, and may eventually loss of sound rhythm. Even though they claimed that they synthesized natural human voice, synthesized audios from their systems, there is still a gap with real vocals. In Deep Expression, model was trained by using 16bit-interger signals directly to synthesize amazing real human voice.
python == 3.6.1
numpy == 1.12.1
tensorflow == 1.3.0
scipy == 0.19.0
python preprocess.py
python train.py
It works well!!! This project is under revisement. Besides, this pipline is in demo stage.
- First time to use DNN-based model to synthesis speech
- First time to use chars-signals end to end method, also first time not to use mfcc in speech synthesis piplines.
- First time to sythesis very good speech not to hurt the sound rhythm.
- First end to end speech synthesis framework that don't need post process.
- Till now, Deep Expression is the most fast end-to-end model in speech synthesis area.
- A new algorithm, weight-share DNN, (w = tf.tile(tf.truncated_normal((step_size, D), mean=0.0, stddev=1, dtype=tf.float32, seed=None), [int(D/step_size), 1], name = 'w')) was introduced in this project.
If you publish work based on Deep Expression, please cite:
https://github.com/ttsunion/Deep-Expression
Layer-normalization and positional encoding function were copied from Kyubyong directly (https://github.com/Kyubyong/transformer). The remaining codes were all hard-coded myself.
-
Aaron van den Oord et al., 2016, WAVENET: A GENERATIVE MODEL FOR RAW AUDIO, https://arxiv.org/pdf/1609.03499.pdf
-
Jonathan Shen et al., 2017, NATURAL TTS SYNTHESIS BY CONDITIONING WAVENET ON MEL SPECTROGRAM PREDICTIONS, https://arxiv.org/pdf/1712.05884.pdf
-
Yuxuan Wang et al., 2017, TACOTRON: TOWARDS END-TO-END SPEECH SYNTHESIS, https://arxiv.org/pdf/1703.10135.pdf
-
Wei Ping et al., 2017, DEEP VOICE 3: 2000-SPEAKER NEURAL TEXT-TO-SPEECH, https://arxiv.org/pdf/1710.07654.pdf