Abstract
Facial expression generation from pure textual descriptions is widely applied in human-computer interaction, computer-aided design, assisted education, etc. However, this task is challenging due to the intricate facial structure and the complex mapping between texts and images. Existing methods face limitations in generating high-resolution images or capturing diverse facial expressions. In this study, we propose a novel generation approach, named FaceCLIP, to tackle these problems. The proposed method utilizes a CLIP-based multi-stage generative adversarial model to produce vivid facial expressions with high resolutions. With strong semantic priors from multi-modal textual and visual cues, the proposed method effectively disentangles facial attributes, enabling attribute editing and semantic reasoning. To facilitate text-to-expression generation, we build a new dataset called the FET dataset, which contains facial expression images and corresponding textual descriptions. Experiments on the dataset demonstrate improved image quality and semantic consistency compared with state-of-the-art methods.
Author supplied keywords
Cite
CITATION STYLE
Fu, W. W., Gong, W. J., Yu, C. Y., Wang, W., & Gonzàlez, J. (2025). Facial Expression Generation from Text with FaceCLIP. Journal of Computer Science and Technology, 40(2), 359–377. https://doi.org/10.1007/s11390-024-3661-z
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.