Content-Length: 320485 | pFad | https://doi.org/10.1007/978-981-97-7232-2_7

a=86400 RSET: Remapping-Based Sorting Method for Emotion Transfer Speech Synthesis | Springer Nature Link
Skip to main content

RSET: Remapping-Based Sorting Method for Emotion Transfer Speech Synthesis

  • Conference paper
  • First Online:
Web and Big Data (APWeb-WAIM 2024)

Abstract

Although current Text-To-Speech (TTS) models are able to generate high-quality speech samples, there are still challenges in developing emotion intensity controllable TTS. Most existing TTS models achieve emotion intensity control by extracting intensity information from reference speeches. Unfortunately, limited by the lack of modeling for intra-class emotion intensity and the model’s information decoupling capability, the generated speech cannot achieve fine-grained emotion intensity control and suffers from information leakage issues. In this paper, we propose an emotion transfer TTS model, which defines a remapping-based sorting method to model intra-class relative intensity information, combined with Mutual Information (MI) to decouple speaker and emotion information, and synthesizes expressive speeches with perceptible intensity differences. Experiments show that our model achieves fine-grained emotion control while preserving speaker information.

H. Shi and J. Wang—Equal Contributions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+
from $39.99 /Month
  • Starting from 10 chapters or articles per month
  • Access and download chapters and articles from more than 300k books and 2,500 journals
  • Cancel anytime
View plans

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Cheng, P., Hao, W., Dai, S., Liu, J., Gan, Z., Carin, L.: Club: a contrastive log-ratio upper bound of mutual information. In: Proceedings of the 37th International Conference on Machine Learning, pp. 1779–1788 (2020)

    Google Scholar 

  2. Eyben, F., Weninger, F., Gross, F., Schuller, B.: Recent developments in openSMILE, the Munich open-source multimedia feature extractor. In: Proceedings of the 21st ACM International Conference on Multimedia, pp. 835–838 (2013)

    Google Scholar 

  3. Guo, Y., Du, C., Chen, X., Yu, K.: EmoDiff: intensity controllable emotional text-to-speech with soft-label guidance. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 1–5 (2023)

    Google Scholar 

  4. Im, C., Lee, S., Kim, S., Lee, S.: EMOQ-TTS: emotion intensity quantization for fine-grained controllable emotional text-to-speech. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6317–6321 (2022)

    Google Scholar 

  5. Inoue, S., Zhou, K., Wang, S., Li, H.: Hierarchical emotion prediction and control in text-to-speech synthesis. In: 2024 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 10601–10605 (2024)

    Google Scholar 

  6. Joachims, T.: Optimizing search engines using clickthrough data. In: Proceedings of the eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 133–142 (2002)

    Google Scholar 

  7. Kominek, J., Schultz, T., Black, A.W.: Synthesizer voice quality of new languages calibrated with mean mel cepstral distortion. In: First International Workshop on Spoken Languages Technologies for Under-Resourced Languages, pp. 63–68 (2008)

    Google Scholar 

  8. Kong, J., Kim, J., Bae, J.: HiFi-GAN: generative adversarial networks for efficient and high fidelity speech synthesis. In: Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, vol. 33, pp. 17022–17033 (2020)

    Google Scholar 

  9. Lee, Y., Rabiee, A., Lee, S.Y.: Emotional end-to-end neural speech synthesizer. arXiv preprint arXiv:1711.05447 (2017)

  10. Lei, Y., Yang, S., Xie, L.: Fine-grained emotion strength transfer, control and prediction for emotional speech synthesis. In: IEEE Spoken Language Technology Workshop, pp. 423–430 (2021)

    Google Scholar 

  11. Li, T., et al.: DiCLET-TTS: diffusion model based cross-lingual emotion transfer for text-to-speech - a study between English and mandarin. IEEE ACM Trans. Audio Speech Lang. Process. 31, 3418–3430 (2023)

    Google Scholar 

  12. Li, T., Yang, S., Xue, L., Xie, L.: Controllable emotion transfer for end-to-end speech synthesis. In: 12th International Symposium on Chinese Spoken Language Processing, pp. 1–5 (2021)

    Google Scholar 

  13. Matsumoto, K., Hara, S., Abe, M.: Controlling the strength of emotions in speech-like emotional sound generated by waveNet. In: 21st Annual Conference of the International Speech Communication Association, pp. 3421–3425 (2020)

    Google Scholar 

  14. Min, D., Lee, D.B., Yang, E., Hwang, S.J.: Meta-stylespeech: multi-speaker adaptive text-to-speech generation. In: Proceedings of the 38th International Conference on Machine Learning, pp. 7748–7759 (2021)

    Google Scholar 

  15. Parikh, D., Grauman, K.: Relative attributes. In: 2011 International Conference on Computer Vision, pp. 503–510 (2011)

    Google Scholar 

  16. Ren, Y., et al.: FastSpeech 2: fast and high-quality end-to-end text to speech. In: 9th International Conference on Learning Representations (2021)

    Google Scholar 

  17. Tang, H., Zhang, X., Cheng, N., Xiao, J., Wang, J.: ED-TTS: multi-scale emotion modeling using cross-domain emotion diarization for emotional speech synthesis. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 12146–12150 (2024)

    Google Scholar 

  18. Tang, H., Zhang, X., Wang, J., Cheng, N., Xiao, J.: EmoMix: emotion mixing via diffusion models for emotional speech synthesis. In: 24nd Annual Conference of the International Speech Communication Association, pp. 12–16 (2023)

    Google Scholar 

  19. Tang, H., Zhang, X., Wang, J., Cheng, N., Xiao, J.: QI-TTS: questioning intonation control for emotional speech synthesis. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 1–5 (2023)

    Google Scholar 

  20. Tits, N., Haddad, K.E., Dutoit, T.: Exploring transfer learning for low resource emotional TTS. In: Intelligent Systems and Applications - Proceedings of the 2019 Intelligent Systems Conference, vol. 1037, pp. 52–60 (2019)

    Google Scholar 

  21. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, pp. 5998–6008 (2017)

    Google Scholar 

  22. Wang, D., Deng, L., Yeung, Y.T., Chen, X., Liu, X., Meng, H.: VQMIVC: vector quantization and mutual information-based unsupervised speech representation disentanglement for one-shot voice conversion. In: 22nd Annual Conference of the International Speech Communication Association, pp. 1344–1348 (2021)

    Google Scholar 

  23. Wang, Y., et al.: Tacotron: towards end-to-end speech synthesis. In: 18th Annual Conference of the International Speech Communication Association, pp. 4006–4010 (2017)

    Google Scholar 

  24. Wang, Y., et al.: Style tokens: unsupervised style modeling, control and transfer in end-to-end speech synthesis. In: Proceedings of the 35th International Conference on Machine Learning, pp. 5167–5176 (2018)

    Google Scholar 

  25. Zeng, A., Chen, M., Zhang, L., Xu, Q.: Are transformers effective for time series forecasting? In: Thirty-Seventh AAAI Conference on Artificial Intelligence, vol. 37, pp. 11121–11128 (2023)

    Google Scholar 

  26. Zhang, Y.J., Pan, S., He, L., Ling, Z.H.: Learning latent representations for style control and transfer in end-to-end speech synthesis. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6945–6949 (2019)

    Google Scholar 

  27. Zhou, K., Sisman, B., Liu, R., Li, H.: Seen and unseen emotional style transfer for voice conversion with a new emotional speech dataset. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 920–924 (2021)

    Google Scholar 

  28. Zhou, K., Sisman, B., Rana, R., Schuller, B.W., Li, H.: Speech synthesis with mixed emotions. IEEE Trans. Affect. Comput. 14, 3120–3134 (2023)

    Article  Google Scholar 

  29. Zhu, X., Yang, S., Yang, G., Xie, L.: Controlling emotion strength with relative attribute for end-to-end speech synthesis. In: IEEE Automatic Speech Recognition and Understanding Workshop, pp. 192–199 (2019)

    Google Scholar 

Download references

Acknowledgement

This paper is supported by the Key Research and Development Program of Guangdong Province under grant No. 2021B0101400003

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xulong Zhang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Shi, H., Wang, J., Zhang, X., Cheng, N., Yu, J., Xiao, J. (2024). RSET: Remapping-Based Sorting Method for Emotion Transfer Speech Synthesis. In: Zhang, W., Tung, A., Zheng, Z., Yang, Z., Wang, X., Guo, H. (eds) Web and Big Data. APWeb-WAIM 2024. Lecture Notes in Computer Science, vol 14961. Springer, Singapore. https://doi.org/10.1007/978-981-97-7232-2_7

Download citation

Keywords

Publish with us

Policies and ethics

Profiles

  1. Xulong Zhang








ApplySandwichStrip

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier!      Saves Data!


--- a PPN by Garber Painting Akron. With Image Size Reduction included!

Fetched URL: https://doi.org/10.1007/978-981-97-7232-2_7

Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy