Talk:WaveNet

Latest comment: 5 years ago by 2003:EF:13C6:FE20:39B0:9DA1:4D14:38BC in topic "Stealing somebody's voice": STS instead of TTS

Suggested edits

edit

I work for DeepMind, the company that invented this technology. I respect Wikipedia's editorial guidelines so do not want to edit this page directly. However, the stub is currently inaccurate and incomplete.

WaveNet was invented by DeepMind, not Google. This stub should be titled DeepMind WaveNet or simply WaveNet[1]: It incorrectly says it is competing with a yet-to-be released Adobe audio editing product. Wavenet is not a product and the Adobe product has not yet been released, so this is a false comparison. It makes unreferenced claims about the technology being able to steal someone's voice. The stub also does not explain how the technology works or how it might be applied. More detail on WaveNet can be found here; in the ArXiv paper[2] or in many of the press articles from the time of its release in September 2016.

A possible template for the article, including full references, follows in the next section:

Suggested edits to this page

edit

WaveNet is a deep neural network for generating raw audio created by researchers at London-based artificial intelligence firm DeepMind. The technique, outlined in a paper September 2016,[3] is able to generate realistic-sounding human voices by sampling real human speech and directly modelling waveforms. Tests with US-based English and Mandarin reportedly showed that the system outperforms the best existing Text-to-Speech systems from Google, although it is still less convincing than actual human speech.[4] WaveNet’s ability to generate raw waveforms means that it can model any kind of audio, including music[5]. Canada-based start-up Lyrebird offers similar technology, but is based on a different deep learning model[6].

History

Generating speech from text is becoming an increasingly common task thanks to the popularity of software such as Apple's Siri, Microsoft’s Cortana, Amazon’s Alexa, and the Google Assistant[7].

Most of today’s systems use a variation of a technique that involves stitching sounds fragments together to form recognisable sounds and words[8]. The most common of these is called concatenative TTS[9]. It consists of large library of speech fragments, recorded from a single speaker that are then combined - or concatenated - to produce complete words and sounds . The technique can often sound unnatural, with an unconvincing cadence and tone[10]. The reliance on a recorded library also makes it difficult to modify or change the voice[11].

Another technique, known as parametric TTS[12], uses mathematical models to recreate known sounds that are then assembled into words and sentences. The information required to generate the sounds is stored in the parameters of the model. The characteristics of the output speech are controlled via the inputs to the model, while the the speech is typically created using a voice synthesiser known as a vocoder. This can also result in unnatural sounding audio.

WaveNet

WaveNet is a type of feed-forward artificial neural network known as a deep convolutional neural network. These consist of layers of interconnected nodes similar to the brain’s neurons. The CNN takes a raw signal as an input and synthesises an output one sample at a time[13].

In the 2016 paper, the network was fed real waveforms of English and Mandarin speech. As these pass through the network, it learns a set of rules to describe how the audio waveform evolves over time. The trained network can then be used to create new speech-like waveforms from scratch at 16,000 samples per second. These waveforms include realistic breaths and lip smacks - but do not conform to any language[14].

WaveNet is able to accurately model different voices, with the accent and tone of the input correlating with the output. For example, if it is trained on with German, it produces German speech[15]. This ability to clone voices has raised ethical concerns about WaveNets ability to mimic anyone’s voice.

The capability also means that if the WaveNet is fed other inputs - such as music - its output will be musical. At the time of its release, DeepMind showed that WaveNet could produce classical sounding music[16].

Applications

At the time of its release, DeepMind said that WaveNet required too much computational processing power to be used in real world applications. [17]

BeeRightHere (talk) 13:45, 19 June 2017 (UTC)Reply

References

Thank you Brookie (Whisper...) - I really appreciate your help with this. One last question: is it possible to edit the title of a page? The page is currently called Google WaveNet, but the technology was invented by DeepMind[1]. If it is possible, I would suggest calling it just "WaveNet" or if it needs to be more specific "DeepMind WaveNet". Any thoughts or help gratefully received. I should also point out that I added a missing "from" in the text. I thought I was doing this on the talk page but realised I had done this on the live article. I am pointing this out for full transparency. Thank you again for your helpBeeRightHere (talk) 21:12, 20 June 2017 (UTC)Reply

As i have said before on this thread, I work for DeepMind and respect the Wikipedia process. I am grateful to the community for the work they have done to improve this page. I wondered if I can ask for your help with one last thing? Could someone move this page to make it clear that the technology was invented by DeepMind[2]? Currently the title says Google WaveNet, but should say either just WaveNet or DeepMind WaveNet[3]. Many thanks BeeRightHere (talk) 08:55, 21 June 2017 (UTC)Reply

"Stealing somebody's voice": STS instead of TTS

edit

With a contributor speaking of "stealing somebody's voice by means of sampling" above, I think what's missing in the article is whether WaveNet is not only capable of better Text-to-Speech by means of sampling somebody's voice, but also whether Speech-to-Speech would be possible with it. Basically, a modern voice actor reads a text into WaveNet, and the program will replace his voice with a desired, priorly sampled one (such as a dead voice actor whose voice one wishes to resurrect), where the progam would automatically identify all the speech parameters (such as intonation, pitch, modulation, timing, vibrato, speed, etc.) required for good acting, and applying it to the sampled voice one wishes to use. That would not only incredibly speed up the process by not having to manually set a lot of parameters anymore (think of actually, intuitively playing synthesizer keys, rather than having to arduously enter a lot of abstract notes by means of mouse clicks), it would also lead to a much more natural-sounding result. --2003:EF:13C6:FE20:39B0:9DA1:4D14:38BC (talk) 16:01, 30 April 2019 (UTC)Reply

Okay, been probing the DeepMind research paper section some. This June 2018 paper: Disentangled Sequential Autoencoder by Li & Mandt reads like pretty much describing a method for such STS (instead of TTS) by calling it "content swapping" and evaluating its empirical results, which means they've already achieved it in practice, where a preferrably two-digit amount of hours of recordings of both the source and target voice need to be sampled first before any random text recorded by one voice can be converted into the other voice while keeping the text and other features from the source recording after the conversion (which is notably different from what the French Ircam program has been able to do already back in 2015, which was simply applying stretching or shortening of simulated vocal chords and the other sex's resonant head size in order to create merely a sex-changed version of that exact same voice at a high quality while maintaining the recording's original text and all). Relevant quotes from the paper:
  • Page 1: "For audio, this allows us to convert a male speaker into a female speaker and vice versa [...]"
  • Page 5: "We also experiment on audio sequence data. Our disentangled representation allows us to convert speaker identities into each other while conditioning on the content of the speech."
  • Page 6: "We perform voice conversion experiments to demonstrate the disentanglement of the learned representation."
  • Page 8: "Our approach allows us to perform [...] feature swapping, such as voice conversion [...]. [...] An advantage of the model is that it separates dynamical from static features [...]." As far as I can tell from the paper, the dynamical features are those that WaveNet is required to convert, i. e. "voice identity", while static features are those WaveNet is to recognize and apply on the target voice, i. e. the read out text, the present mood, basically all the parameters required for acting (either that, or it's the other way around).
  • This January 2019 follow-up paper: Unsupervised speech representation learning using WaveNet autoencoders by Chorowsky et al. details a method to successfully enhance the proper automatic recognition and discrimination between dynamical and static features for "content swapping", notably including swapping voices, in order to make it more reliable.
  • Also revelant may be this 2014 paper: Towards End-to-End Speech Recognition with Recurrent Neural Networks by Graves and Jaitly about highly efficient automatic, speaker-independent speech-to-text recognition with a low error rate (think of the public Adobe Voco demonstration in November 2016 on how a feature like that can be used to change the spoken text of an existing recording by means of simple text editing with an absolutely natural, authentic, and genuine-sounding result; STS voice conversion combined with such simple text editing of existing recordings can make an enormously powerful tool). Adobe Premiere Pro used to temporarily have such a highly-efficient, low-error, speaker-independent feature built in during the latter 2000s for purposes of fast automatic keyword tagging of interview content, before it was removed around the early 2010s without giving a reason.
I guess that counts for the source requested by User:BeeRightHere above (or them pretty much questioning that any such source would exist) back in June 2017 on whether DeepMind can be used to "steal somebody's voice", and the answer seems positive, particularly by means of such a convenient method as STS instead of TTS. --2003:EF:13C6:FE20:39B0:9DA1:4D14:38BC (talk) 17:08, 30 April 2019 (UTC)Reply
It may be of further interest to the article that DeepMind is also working on methods (to be included in WaveNet as a feature/plugin called "LipNet"?) of efficient, automatic lip-reading, no matter whether audio is present or not, claiming they're already outdoing professional human lip-readers: [1], [2], and that according to this September 2018 paper: Sample Efficient Adaptive Text-to-Speech by Chen et al. (latest January 2019 revision of the same paper), DeepMind is working hard on drastically reducing the amount of real-life recordings required to sample an existing voice via WaveNet while maintaining high-quality results. --2003:EF:13C6:FE20:39B0:9DA1:4D14:38BC (talk) 18:48, 30 April 2019 (UTC)Reply