Speaker Extraction With Co-Speech Gestures Cue

Zexu Pan; Xinyuan Qian; Haizhou Li

Journal ArticleOPEN ACCESS

Speaker Extraction With Co-Speech Gestures Cue

IEEE Signal Processing Letters (2022) 29 1467-1471

DOI: 10.1109/LSP.2022.3175130

31Citations

9Readers

Abstract

Speaker extraction seeks to extract the clean speech of a target speaker from a multi-talker mixture speech. There have been studies to use a pre-recorded speech sample or face image of the target speaker as the speaker cue. In human communication, co-speech gestures that are naturally timed with speech also contribute to speech perception. In this work, we explore the use of co-speech gestures sequence, e.g. hand and body movements, as the speaker cue for speaker extraction, which could be easily obtained from low-resolution video recordings, thus more available than face recordings. We propose two networks using the co-speech gestures cue to perform attentive listening on the target speaker, one that implicitly fuses the co-speech gestures cue in the speaker extraction process, the other performs speech separation first, followed by explicitly using the co-speech gestures cue to associate a separated speech to the target speaker. The experimental results show that the co-speech gestures cue is informative in associating with the target speaker.

Author supplied keywords

Cite

CITATION STYLE

APA

Pan, Z., Qian, X., & Li, H. (2022). Speaker Extraction With Co-Speech Gestures Cue. IEEE Signal Processing Letters, 29, 1467–1471. https://doi.org/10.1109/LSP.2022.3175130

Speaker Extraction With Co-Speech Gestures Cue

Abstract

Author supplied keywords

Cite

Register to see more suggestions