A Multi-Modal Approach to Emotion-Aware Automatic Speech Recognition Using Dy namic Emotion Trajectories and Global Style Tokens.
DOI:
https://doi.org/10.37376/ljst.v15i2.7647Keywords:
Emotion-aware ASR, Global Style Tokens, Speech Emotion Recognition, Emotion Trajectory, Multi modal Fusion.Abstract
This paper presents a novel approach to emotion-aware Automatic Speech Recognition (ASR) by
integrating emotion detection and dynamic emotion trajectory modelling. Our system combines
acoustic features and physiological signals to achieve more accurate and contextually aware
emotion recognition. The use of Global Style Tokens (GSTs) enhances the system's ability to de
tect nuanced emotional transitions in speech, outperforming existing state-of-the-art models.
We evaluate the system using the IEMOCAP and RAVDESS datasets, achieving emotion classifi
cation accuracies of 88.1% and 85.6%, respectively. The system also achieves an F1-score of 0.87
on IEMOCAP and 0.85 on RAVDESS, with Area Under the Curve (AUC) scores of 0.90 and 0.88,
and Root Mean Squared Error (RMSE) values of 0.13 and 0.14. Additionally, we propose future
enhancements, including expanding multimodal inputs, improving real-time scalability, han
dling mixed emotions, and adapting the model for cross-lingual and low-power environments.
Our findings contribute to the development of emotionally intelligent ASR systems capable of
improving human-computer interactions in various applications.
Downloads
References
Anagnostopoulos, C., Iliou, T., Giannoukos, I. (2015) ‘Features and
classifiers for emotion recognition from speech: A survey’, En
gineering Applications of Artificial Intelligence, 43, pp. 369–377.
Busso, C., Bulut, M., Lee, C.-C., Kazemzadeh, A., Mower, E., Kim, S.,
Narayanan, S. S. (2008) ‘IEMOCAP: Interactive emotional dy
adic motion capture database’, Language Resources and Evalu
ation, 42(4), pp. 335–359.
Cowen, A. S., Keltner, D. (2021) ‘Semantic space theory of emotion’,
Trends in Cognitive Sciences, 25(2), pp. 124–136.
Huang, Z., Zhang, P., Wu, Z. (2022). Transformer-Based Speech
Emotion Recognition with Emotional Context Learning. IEEE
Transactions
on
Affective
https://doi.org/10.1109/TAFFC.2022.3148267
Computing.
Ivanko, D., Kim, J., Woo, S. (2023) ‘Multimodal lip-reading and emo
tional speech analysis’, IEEE Transactions on Affective Compu
ting, 14(1), pp. 22–35.
Kim, S., Park, J., & Kang, S. (2022) ‘Deep Emotion Networks for
Speech Emotion Recognition’, Speech Communication, 132, pp.
–54. https://doi.org/10.1016/j.specom.2022.04.002
Kyung, T., Lee, J., & Park, S. (2023) ‘Prosodic representation using
Global Style Tokens for emotional speech recognition’, Speech
Communication, 145, pp. 13–24.
Li, T., Zhao, J., & Ren, Y. (2021) ‘Hybrid Emotion-Aware ASR with
CNN-RNN Architecture for Emotional Speech Recognition’,
IEEE
Transactions
on
Affective
https://doi.org/10.1109/TAFFC.2021.3068827
Computing.
Livingstone, S. R., Russo, F. A. (2018) ‘The Ryerson Audio-Visual Da
tabase of Emotional Speech and Song (RAVDESS)’, PLOS ONE,
(5), e0196391.
Martin, O., Valente, F., Schuller, B. (2021) ‘Feature engineering for
emotional speech recognition using MFCC and modulation
spectral features’, Computer Speech & Language, 68, 101179.
Qu, Y., Liu, Z., Cao, P. (2022) ‘Emotional speech data augmentation
using generative adversarial networks’, IEEE/ACM Transac
tions on Audio, Speech, and Language Processing, 30, pp. 1545
Sahu, S., Gupta, V., Rao, S. (2019) ‘Multimodal emotion recognition
with static acoustic features and deep neural networks’, Neural
Networks, 116, pp. 184–194.
Tripathi, S., Singh, N., Verma, R. (2022) ‘Sequential modelling of dy
namic emotion patterns in conversational speech’, Pattern
Recognition Letters, 157, pp. 123–130.
Wu, W., Zhang, C., & Woodland, P. (2023a) ‘Integrating Emotion
Recognition with Speech Recognition and Speaker Diarisation
for Conversations’, Interspeech, https://doi.org/10.21437/In
terspeech.2023-293
Wu, L., Zhang, Q., Wang, Y. (2023b) ‘Joint ASR–AER modelling using
multi-task learning with shared acoustic encodin’, IEEE Signal
Processing Letters, 30, pp. 291–295.
Zhang, Z., Wu, D., Zhao, X. (2020a) ‘Multimodal affective computing:
Models, datasets, and challenges’, Information Fusion, 61, pp.
–119.
Zhang, S., Han, X., Xu, C. (2020b) ‘CNN-LSTM Hybrid Model for
Speech Emotion Recognition’, ICASSP 2020, pp. 2872–2876.
https://doi.org/10.1109/ICASSP.2020.9054557
Zhao, X., Yang, W., Wang, H. (2021) ‘Recurrent Attention Networks
for Emotion Recognition’ Pattern Recognition Letters, 144, pp.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 Libyan Journal of Science &Technology

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.






LJST Copy rights form