FastFoley: Non-autoregressive Foley Sound Generation Based on Visual Semantics

0Citations
Citations of this article
1Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Foley sound in movies and TV episodes is of great importance to bring a more realistic feeling to the audience. Traditionally, foley artists need to create the foley sound synchronous with the content occurring in the video using their expertise. However, it is quite laborious and time consuming. In this paper, we present FastFoley, a Transformer based non-autoregressive deep-learning method that can be used to synthesize a foley audio track from the silent video clip. Existing cross-model generation methods are still based on autoregressive models such as long short-term memory (LSTM) recurrent neural network. Our FastFoley offers a new non-autoregressive framework on the audio-visual task. Upon videos provided, FastFoley can synthesize associated audio files, which outperforms the LSTM based methods in time synchronization, sound quality, and sense of reality. Particularly, we have also created a dataset called Audio-Visual Foley Dataset(AVFD) for related foley work and make it open-source, which can be downloaded at https://github.com/thuhcsi/icassp2022-FastFoley.

Cite

CITATION STYLE

APA

Li, S., Zhang, L., Dong, C., Xue, H., Wu, Z., Sun, L., … Meng, H. (2023). FastFoley: Non-autoregressive Foley Sound Generation Based on Visual Semantics. In Communications in Computer and Information Science (Vol. 1765 CCIS, pp. 252–263). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-981-99-2401-1_23

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free