Content-Length: 314622 | pFad | https://unpaywall.org/10.1007%2F978-981-97-3626-3_20

43";ma=86400 MABC-Net: Multimodal Mixed Attentional Network with Balanced Class for Temporal Forgery Localization | Springer Nature Link
Skip to main content

MABC-Net: Multimodal Mixed Attentional Network with Balanced Class for Temporal Forgery Localization

  • Conference paper
  • First Online:
Digital Multimedia Communications (IFTC 2023)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 2067))

  • 499 Accesses

  • 1 Citation

Abstract

As an impactful social task, recognizing and localizing forgery events in videos is getting active attention. Since forgery events occur in both auditory and visual modalities, detailed perception of multimodality is essential for accurate temporal forgery localization (TFL). Currently, most fake videos consist of only a small segment of fake content, which leads to the problem of class imbalance due to large differences in the proportion of fake and real content. Unfortunately, existing methods suffer significantly in performance owing to the fact that they take little account of the problem of class imbalance. To address this issue, we present a multimodal Mixed Attentional network with Balanced Class (MABC-Net) for temporal forgery localization. Specifically, we first propose the mixed-attentive feature learning (MAFL) module. This module captures audio-visual temporal features via a mixed learning strategy, which leverages two self-attention blocks and two cross-attention blocks. Moreover, a fusion-balanced localization (FBL) module is designed for alleviating the influence of the class imbalance problem. This benefits from an elegant combination of focal and boundary matching loss functions. Extensive experiments on TFL show that our MABC-Net is superior to the state-of-the-art methods and localizes more precise segment boundaries. Code is available at https://github.com/Tea7374/MABC-Net.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+
from $39.99 /Month
  • Starting from 10 chapters or articles per month
  • Access and download chapters and articles from more than 300k books and 2,500 journals
  • Cancel anytime
View plans

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 119.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 159.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Agarwal, S., Farid, H., Fried, O., Agrawala, M.: Detecting deep-fake videos from phoneme-viseme mismatches. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 2814–2822 (2020)

    Google Scholar 

  2. Bagchi, A., Mahmood, J., Fernandes, D., Sarvadevabhatla, R.K.: Hear me out: Fusional approaches for audio augmented temporal action localization. arXiv preprint arXiv:2106.14118 (2021)

  3. Cai, Z., Ghosh, S., Gedeon, T., Dhall, A., Stefanov, K., Hayat, M.: “glitch in the matrix!”: A large scale benchmark for content driven audio-visual forgery detection and localization. arXiv preprint arXiv:2305.01979 (2023)

  4. Cai, Z., Stefanov, K., Dhall, A., Hayat, M.: Do you really mean that? content driven audio-visual deepfake dataset and multimodal method for temporal forgery localization. In: Proceedings of the International Conference on Digital Image Computing: Techniques and Applications (DICTA), pp. 1–10 (2022)

    Google Scholar 

  5. Cao, K., Wei, C., Gaidon, A., Arechiga, N., Ma, T.: Learning imbalanced datasets with label-distribution-aware margin loss. Advances in neural information processing systems 32 (2019)

    Google Scholar 

  6. Cheng, H., Guo, Y., Wang, T., Li, Q., Ye, T., Nie, L.: Voice-face homogeneity tells deepfake. arXiv preprint arXiv:2203.02195 (2022)

  7. Chugh, K., Gupta, P., Dhall, A., Subramanian, R.: Not made for each other- audio-visual dissonance-based deepfake detection and localization. In: Proceedings of the ACM International Conference on Multimedia (ACM MM) (2020)

    Google Scholar 

  8. Chung, J.S., Zisserman, A.: Out of time: Automated lip sync in the wild. In: Proceedings of the Asian Conference on Computer Vision Workshops (ACCVW), pp. 251–263 (2017)

    Google Scholar 

  9. Cozzolino, D., Nießner, M., Verdoliva, L.: Audio-visual person-of-interest deepfake detection. arXiv preprint arXiv:2204.03083 (2022)

  10. Dolhansky, B., Bitton, J., Pflaum, B., Lu, J., Howes, R., Wang, M., Ferrer, C.C.: The deepfake detection challenge (dfdc) dataset. arXiv preprint arXiv:2006.07397 (2020)

  11. Hong, Y., Han, S., Choi, K., Seo, S., Kim, B., Chang, B.: Disentangling label distribution for long-tailed visual recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6626–6636 (2021)

    Google Scholar 

  12. Ilyas, H., Javed, A., Malik, K.M.: Avfakenet: a unified end-to-end dense swin transformer deep learning model for audio-visual deepfakes detection. Appl. Soft Comput. 136, 110124 (2023)

    Article  Google Scholar 

  13. Khalid, H., Tariq, S., Kim, M., Woo, S.S.: Fakeavceleb: A novel audio-video multimodal deepfake dataset. arXiv preprint arXiv:2108.05080 (2021)

  14. Korshunov, P., et al.: Tampered speaker inconsistency detection with phonetically aware audio-visual features. In: Proceedings of the International Conference on Machine Learning (ICML), pp. 1–5 (2019)

    Google Scholar 

  15. Korshunov, P., Marcel, S.: Speaker inconsistency detection in tampered video. In: Proceedings of the European Signal Processing Conference (EUSIPCO), pp. 2375–2379 (2018)

    Google Scholar 

  16. Lin, T., Liu, X., Li, X., Ding, E., Wen, S.: Bmn: boundary-matching network for temporal action proposal generation. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 3889–3898 (2019)

    Google Scholar 

  17. Lin, T., Zhao, X., Su, H., Wang, C., Yang, M.: Bsn: boundary sensitive network for temporal action proposal generation. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19 (2018)

    Google Scholar 

  18. Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 2980–2988 (2017)

    Google Scholar 

  19. Lomnitz, M., Hampel-Arias, Z., Sandesara, V., Hu, S.: Multimodal approach for deepfake detection. In: Proceedings of the Applied Imagery Pattern Recognition Workshop (AIPRW), pp. 1–9 (2020)

    Google Scholar 

  20. Mittal, T., Bhattacharya, U., Chandra, R., Bera, A., Manocha, D.: Emotions don’t lie: an audio-visual deepfake detection method using affective cues. In: Proceedings of the ACM International Conference on Multimedia (ACM MM), pp. 2823–2832 (2020)

    Google Scholar 

  21. Nawhal, M., Mori, G.: Activity graph transformer for temporal action localization. arXiv preprint arXiv:2101.08540 (2021)

  22. Ren, J., Yu, C., Ma, X., Zhao, H., Yi, S., et al.: Balanced meta-softmax for long-tailed visual recognition. Adv. Neural. Inf. Process. Syst. 33, 4175–4186 (2020)

    Google Scholar 

  23. Tian, Y., Li, D., Xu, C.: Unified multisensory perception: weakly-supervised audio-visual video parsing. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 436–454 (2020)

    Google Scholar 

  24. Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 4489–4497 (2015)

    Google Scholar 

  25. Wang, G., Zhang, P., Xie, L., Huang, W., Zha, Y., Zhang, Y.: An audio-visual attention based multimodal network for fake talking face videos detection. arXiv preprint arXiv:2203.05178 (2022)

  26. Woo, S., Park, J., Lee, J.Y., Kweon, I.S.: Cbam: convolutional block attention module. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19 (2018)

    Google Scholar 

  27. Xu, Z., Liu, R., Yang, S., Chai, Z., Yuan, C.: Learning imbalanced data with vision transformers. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 15793–15803 (2023)

    Google Scholar 

  28. Yang, W., Zhou, X., Chen, Z., Guo, B., Ba, Z., Xia, Z., Cao, X., Ren, K.: Avoid-df: audio-visual joint learning for detecting deepfake. IEEE Trans. Inf. Forensics Secur. 18, 2015–2029 (2023)

    Article  Google Scholar 

  29. Yu, J., Cheng, Y., Zhao, R.W., Feng, R., Zhang, Y.: Mm-pyramid: Multimodal pyramid attentional network for audio-visual event localization and video parsing. In: Proceedings of the ACM International Conference on Multimedia (ACM MM), pp. 6241–6249 (2022)

    Google Scholar 

  30. Zhou, Y., Lim, S.N.: Joint audio-visual deepfake detection. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 14800–14809 (2021)

    Google Scholar 

Download references

Acknowledgement

This work was supported in part by the Natural Science Foundation of China under Grant 62201524, Grant 62271455, and Grant 61971383; and in part by the Fundamental Research Funds for the Central Universities under Grant CUC23GZ016. This work was supported by the Horizontal Research Project under Grant No. HG23002. Supported by Public Computing Cloud, CUC.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Long Ye .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Cheng, H., Yu, H., Fang, L., Ye, L. (2024). MABC-Net: Multimodal Mixed Attentional Network with Balanced Class for Temporal Forgery Localization. In: Zhai, G., Zhou, J., Ye, L., Yang, H., An, P., Yang, X. (eds) Digital Multimedia Communications. IFTC 2023. Communications in Computer and Information Science, vol 2067. Springer, Singapore. https://doi.org/10.1007/978-981-97-3626-3_20

Download citation

Keywords

Publish with us

Policies and ethics









ApplySandwichStrip

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier!      Saves Data!


--- a PPN by Garber Painting Akron. With Image Size Reduction included!

Fetched URL: https://unpaywall.org/10.1007%2F978-981-97-3626-3_20

Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy