Text2Video: Text-driven Talking-head Video Synthesis with Phoneme-Pose Dictionary

Sibo Zhang, Jiahong Yuan, Miao Liao, Liangjun Zhang. (ICASSP 2022)

Abstract: With the advance of deep learning technology, automatic video generation from audio or text has become an emerging and promising research topic. In this paper, we present a novel approach to synthesize video from text. The method builds a phoneme-pose dictionary and trains a generative adversarial network (GAN) to generate video from interpolated phoneme poses. Compared to audio-driven video generation algorithms, our approach has a number of advantages: 1) It only needs a fraction of the training data used by an audio-driven approach; 2) It is more flexible and not subject to vulnerability due to speaker variation; 3) It significantly reduces the preprocessing, training and inference time. We perform extensive experiments to compare the proposed method with state-of-the-art talking face generation methods on a benchmark dataset and datasets of our own. The results demonstrate the effectiveness and superiority of our approach.

Index Terms: talking head video generation, text driven, multimodal synthesis, phoneme-pose dictionary

Result Video

12 min Presentation

0002659.pdf

Publication

Text2Video: Text-driven Talking-head Video Synthesis with Phonetic Dictionary.

Sibo Zhang, Jiahong Yuan, Miao Liao, Liangjun Zhang.

IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2022).

[PDF] [Project Page] [Demo Video] [Github]

Please using following BibTeX to cite:

@INPROCEEDINGS{9747380,

author={Zhang, Sibo and Yuan, Jiahong and Liao, Miao and Zhang, Liangjun},

booktitle={ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},

title={Text2video: Text-Driven Talking-Head Video Synthesis with Personalized Phoneme - Pose Dictionary},

year={2022},

volume={},

number={},

pages={2659-2663},

doi={10.1109/ICASSP43922.2022.9747380}

}