Abstract
As an impactful social task, recognizing and localizing forgery events in videos is getting active attention. Since forgery events occur in both auditory and visual modalities, detailed perception of multimodality is essential for accurate temporal forgery localization (TFL). Currently, most fake videos consist of only a small segment of fake content, which leads to the problem of class imbalance due to large differences in the proportion of fake and real content. Unfortunately, existing methods suffer significantly in performance owing to the fact that they take little account of the problem of class imbalance. To address this issue, we present a multimodal Mixed Attentional network with Balanced Class (MABC-Net) for temporal forgery localization. Specifically, we first propose the mixed-attentive feature learning (MAFL) module. This module captures audio-visual temporal features via a mixed learning strategy, which leverages two self-attention blocks and two cross-attention blocks. Moreover, a fusion-balanced localization (FBL) module is designed for alleviating the influence of the class imbalance problem. This benefits from an elegant combination of focal and boundary matching loss functions. Extensive experiments on TFL show that our MABC-Net is superior to the state-of-the-art methods and localizes more precise segment boundaries. Code is available at https://github.com/Tea7374/MABC-Net.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Agarwal, S., Farid, H., Fried, O., Agrawala, M.: Detecting deep-fake videos from phoneme-viseme mismatches. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 2814–2822 (2020)
Bagchi, A., Mahmood, J., Fernandes, D., Sarvadevabhatla, R.K.: Hear me out: Fusional approaches for audio augmented temporal action localization. arXiv preprint arXiv:2106.14118 (2021)
Cai, Z., Ghosh, S., Gedeon, T., Dhall, A., Stefanov, K., Hayat, M.: “glitch in the matrix!”: A large scale benchmark for content driven audio-visual forgery detection and localization. arXiv preprint arXiv:2305.01979 (2023)
Cai, Z., Stefanov, K., Dhall, A., Hayat, M.: Do you really mean that? content driven audio-visual deepfake dataset and multimodal method for temporal forgery localization. In: Proceedings of the International Conference on Digital Image Computing: Techniques and Applications (DICTA), pp. 1–10 (2022)
Cao, K., Wei, C., Gaidon, A., Arechiga, N., Ma, T.: Learning imbalanced datasets with label-distribution-aware margin loss. Advances in neural information processing systems 32 (2019)
Cheng, H., Guo, Y., Wang, T., Li, Q., Ye, T., Nie, L.: Voice-face homogeneity tells deepfake. arXiv preprint arXiv:2203.02195 (2022)
Chugh, K., Gupta, P., Dhall, A., Subramanian, R.: Not made for each other- audio-visual dissonance-based deepfake detection and localization. In: Proceedings of the ACM International Conference on Multimedia (ACM MM) (2020)
Chung, J.S., Zisserman, A.: Out of time: Automated lip sync in the wild. In: Proceedings of the Asian Conference on Computer Vision Workshops (ACCVW), pp. 251–263 (2017)
Cozzolino, D., Nießner, M., Verdoliva, L.: Audio-visual person-of-interest deepfake detection. arXiv preprint arXiv:2204.03083 (2022)
Dolhansky, B., Bitton, J., Pflaum, B., Lu, J., Howes, R., Wang, M., Ferrer, C.C.: The deepfake detection challenge (dfdc) dataset. arXiv preprint arXiv:2006.07397 (2020)
Hong, Y., Han, S., Choi, K., Seo, S., Kim, B., Chang, B.: Disentangling label distribution for long-tailed visual recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6626–6636 (2021)
Ilyas, H., Javed, A., Malik, K.M.: Avfakenet: a unified end-to-end dense swin transformer deep learning model for audio-visual deepfakes detection. Appl. Soft Comput. 136, 110124 (2023)
Khalid, H., Tariq, S., Kim, M., Woo, S.S.: Fakeavceleb: A novel audio-video multimodal deepfake dataset. arXiv preprint arXiv:2108.05080 (2021)
Korshunov, P., et al.: Tampered speaker inconsistency detection with phonetically aware audio-visual features. In: Proceedings of the International Conference on Machine Learning (ICML), pp. 1–5 (2019)
Korshunov, P., Marcel, S.: Speaker inconsistency detection in tampered video. In: Proceedings of the European Signal Processing Conference (EUSIPCO), pp. 2375–2379 (2018)
Lin, T., Liu, X., Li, X., Ding, E., Wen, S.: Bmn: boundary-matching network for temporal action proposal generation. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 3889–3898 (2019)
Lin, T., Zhao, X., Su, H., Wang, C., Yang, M.: Bsn: boundary sensitive network for temporal action proposal generation. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19 (2018)
Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 2980–2988 (2017)
Lomnitz, M., Hampel-Arias, Z., Sandesara, V., Hu, S.: Multimodal approach for deepfake detection. In: Proceedings of the Applied Imagery Pattern Recognition Workshop (AIPRW), pp. 1–9 (2020)
Mittal, T., Bhattacharya, U., Chandra, R., Bera, A., Manocha, D.: Emotions don’t lie: an audio-visual deepfake detection method using affective cues. In: Proceedings of the ACM International Conference on Multimedia (ACM MM), pp. 2823–2832 (2020)
Nawhal, M., Mori, G.: Activity graph transformer for temporal action localization. arXiv preprint arXiv:2101.08540 (2021)
Ren, J., Yu, C., Ma, X., Zhao, H., Yi, S., et al.: Balanced meta-softmax for long-tailed visual recognition. Adv. Neural. Inf. Process. Syst. 33, 4175–4186 (2020)
Tian, Y., Li, D., Xu, C.: Unified multisensory perception: weakly-supervised audio-visual video parsing. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 436–454 (2020)
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 4489–4497 (2015)
Wang, G., Zhang, P., Xie, L., Huang, W., Zha, Y., Zhang, Y.: An audio-visual attention based multimodal network for fake talking face videos detection. arXiv preprint arXiv:2203.05178 (2022)
Woo, S., Park, J., Lee, J.Y., Kweon, I.S.: Cbam: convolutional block attention module. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19 (2018)
Xu, Z., Liu, R., Yang, S., Chai, Z., Yuan, C.: Learning imbalanced data with vision transformers. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 15793–15803 (2023)
Yang, W., Zhou, X., Chen, Z., Guo, B., Ba, Z., Xia, Z., Cao, X., Ren, K.: Avoid-df: audio-visual joint learning for detecting deepfake. IEEE Trans. Inf. Forensics Secur. 18, 2015–2029 (2023)
Yu, J., Cheng, Y., Zhao, R.W., Feng, R., Zhang, Y.: Mm-pyramid: Multimodal pyramid attentional network for audio-visual event localization and video parsing. In: Proceedings of the ACM International Conference on Multimedia (ACM MM), pp. 6241–6249 (2022)
Zhou, Y., Lim, S.N.: Joint audio-visual deepfake detection. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 14800–14809 (2021)
Acknowledgement
This work was supported in part by the Natural Science Foundation of China under Grant 62201524, Grant 62271455, and Grant 61971383; and in part by the Fundamental Research Funds for the Central Universities under Grant CUC23GZ016. This work was supported by the Horizontal Research Project under Grant No. HG23002. Supported by Public Computing Cloud, CUC.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Cheng, H., Yu, H., Fang, L., Ye, L. (2024). MABC-Net: Multimodal Mixed Attentional Network with Balanced Class for Temporal Forgery Localization. In: Zhai, G., Zhou, J., Ye, L., Yang, H., An, P., Yang, X. (eds) Digital Multimedia Communications. IFTC 2023. Communications in Computer and Information Science, vol 2067. Springer, Singapore. https://doi.org/10.1007/978-981-97-3626-3_20
Download citation
DOI: https://doi.org/10.1007/978-981-97-3626-3_20
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-3625-6
Online ISBN: 978-981-97-3626-3
eBook Packages: Behavioral Science and PsychologyBehavioral Science and Psychology (R0)Springer Nature Proceedings excluding Computer Science


