The emergence of photo-realistic deepfakes on a large scale has become a significant societal concern, which has garnered considerable attention from the research community. Several recent studies have identified the critical issue of "temporal inconsistency"resulting from the frame reassembling process of deepfake generation techniques. However, due to the lack of task-specific design, the spatio-temporal modeling of current methods remains insufficient in three critical aspects: 1) inapparent temporal changes are prone to be undermined compared to abundant spatial cues; 2) minor inconsistent regions are often concealed by motions with greater amplitude during downsampling; 3) capturing both transient inconsistencies and persistent motions simultaneously remains a significant challenge. In this paper, we propose a novel Dual-Modality Co-Learning framework tailored for these characteristics, which achieves more effectual deepfake detection with complementary information from RGB and optical flow modalities. In particular, we designed a Multi-Scale Motion Regularization module to encourage the network to equally prioritize both the significant spatial cues and the subtle temporal facial motion cues. Additionally, we developed a Multi-Span Cross-Attention module to effectively integrate the information from both RGB and optical flow modalities and improve the detection accuracy with multi-span predictions. Extensive experiments validate the effectiveness our ideas and demonstrate the superior performance of our approach.
CITATION STYLE
Guan, J., Zhou, H., Guo, Z., Hu, T., Deng, L., Quan, C., … Zhao, Y. (2023). Dual-Modality Co-Learning for Unveiling Deepfake in Spatio-Temporal Space. In ICMR 2023 - Proceedings of the 2023 ACM International Conference on Multimedia Retrieval (pp. 85–94). Association for Computing Machinery, Inc. https://doi.org/10.1145/3591106.3592284
Mendeley helps you to discover research relevant for your work.