The multi-speaker model and speaker encoder model were trained on 84 VCTK speakers (48 KHz sampling rate), voice cloning was performed on other VCTK speakers (48 KHz sampling rate). The average duration of a cloning sample is 3.7 seconds. Boldface indicates the best results.
Speaker 0 (female)
Original speech
Cloned speech (speaker embedding adaptation with 1 sample)
Cloned speech (speaker embedding adaptation with 5 samples)
Cloned speech (speaker embedding adaptation with 10 samples)
Cloned speech (speaker embedding adaptation with 20 samples)
Cloned speech (speaker embedding adaptation with 50 samples)
Cloned speech (speaker embedding adaptation with 100 samples)
Cloned speech (whole model adaptation with 1 sample)
Cloned speech (whole model adaptation with 5 samples)
Cloned speech (whole model adaptation with 10 samples)
Cloned speech (whole model adaptation with 20 samples)
Cloned speech (whole model adaptation with 50 samples)
Cloned speech (whole model adaptation with 100 samples)