TIMIT-TTS: A Text-to-Speech Dataset for Multimodal Synthetic Media Detection

16Citations
Citations of this article
46Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

With the rapid development of deep learning techniques, the generation and counterfeiting of multimedia material has become increasingly simple. Current technology enables the creation of videos where both the visual and audio contents are falsified. While the multimedia forensics community has begun to address this threat by developing fake media detectors. However, the vast majority existing forensic techniques only analyze one modality at a time. This is an important limitation when authenticating manipulated videos, because sophisticated forgeries may be difficult to detect without exploiting cross-modal inconsistencies (e.g., across the audio and visual tracks). One important reason for the lack of multimodal detectors is a similar lack of research datasets containing multimodal forgeries. Existing datasets typically contain only one falsified modality, such as deepfaked videos with authentic audio tracks, or synthetic audio with no associated video. Currently, datasets are needed that can be used to develop, train, and test these forensic algorithms. In this paper, we propose a new audio-visual deepfake dataset containing multimodal video forgeries. We present a general pipeline for synthesizing deepfake speech content from a given video, facilitating the creation of counterfeit multimodal material. The proposed method uses Text-to-Speech (TTS) and Dynamic Time Warping (DTW) techniques to achieve realistic speech tracks. We use this pipeline to generate and release TIMIT-TTS, a synthetic speech dataset containing the most cutting-edge methods in the TTS field. This can be used as a standalone audio dataset, or combined with DeepfakeTIMIT and VidTIMIT video datasets to perform multimodal research. Finally, we present numerous experiments to benchmark the proposed dataset in both monomodal (i.e., audio) and multimodal (i.e., audio and video) conditions. This highlights the need for multimodal forensic detectors and more multimodal deepfake data.

Cite

CITATION STYLE

APA

Salvi, D., Hosler, B., Bestagini, P., Stamm, M. C., & Tubaro, S. (2023). TIMIT-TTS: A Text-to-Speech Dataset for Multimodal Synthetic Media Detection. IEEE Access, 11, 50851–50866. https://doi.org/10.1109/ACCESS.2023.3276480

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free