A Multi-Modal Approach to Emotion-Aware Automatic Speech Recognition Using Dy namic Emotion Trajectories and Global Style Tokens.

Authors

  • Abeer A. Aoun
  • Karim B. Dabbabi

DOI:

https://doi.org/10.37376/ljst.v15i2.7647

Keywords:

Emotion-aware ASR, Global Style Tokens, Speech Emotion Recognition, Emotion Trajectory, Multi modal Fusion.

Abstract

This paper presents a novel approach to emotion-aware Automatic Speech Recognition (ASR) by
integrating emotion detection and dynamic emotion trajectory modelling. Our system combines
acoustic features and physiological signals to achieve more accurate and contextually aware
emotion recognition. The use of Global Style Tokens (GSTs) enhances the system's ability to de
tect nuanced emotional transitions in speech, outperforming existing state-of-the-art models.
We evaluate the system using the IEMOCAP and RAVDESS datasets, achieving emotion classifi
cation accuracies of 88.1% and 85.6%, respectively. The system also achieves an F1-score of 0.87
on IEMOCAP and 0.85 on RAVDESS, with Area Under the Curve (AUC) scores of 0.90 and 0.88,
and Root Mean Squared Error (RMSE) values of 0.13 and 0.14. Additionally, we propose future
enhancements, including expanding multimodal inputs, improving real-time scalability, han
dling mixed emotions, and adapting the model for cross-lingual and low-power environments.
Our findings contribute to the development of emotionally intelligent ASR systems capable of
improving human-computer interactions in various applications.

Downloads

Download data is not yet available.

Author Biographies

Abeer A. Aoun

Oil Libya Company, Bashier Sadawi Street, P.O. Box 2655, Tripoli, Libya.

Karim B. Dabbabi

Research Unite of Analyse and Processing of Electrical and Energetic Systems, Faculty of Sciences, El-Manar University, 2092, Tunis-Tunisia

References

Anagnostopoulos, C., Iliou, T., Giannoukos, I. (2015) ‘Features and

classifiers for emotion recognition from speech: A survey’, En

gineering Applications of Artificial Intelligence, 43, pp. 369–377.

Busso, C., Bulut, M., Lee, C.-C., Kazemzadeh, A., Mower, E., Kim, S.,

Narayanan, S. S. (2008) ‘IEMOCAP: Interactive emotional dy

adic motion capture database’, Language Resources and Evalu

ation, 42(4), pp. 335–359.

Cowen, A. S., Keltner, D. (2021) ‘Semantic space theory of emotion’,

Trends in Cognitive Sciences, 25(2), pp. 124–136.

Huang, Z., Zhang, P., Wu, Z. (2022). Transformer-Based Speech

Emotion Recognition with Emotional Context Learning. IEEE

Transactions

on

Affective

https://doi.org/10.1109/TAFFC.2022.3148267

Computing.

Ivanko, D., Kim, J., Woo, S. (2023) ‘Multimodal lip-reading and emo

tional speech analysis’, IEEE Transactions on Affective Compu

ting, 14(1), pp. 22–35.

Kim, S., Park, J., & Kang, S. (2022) ‘Deep Emotion Networks for

Speech Emotion Recognition’, Speech Communication, 132, pp.

–54. https://doi.org/10.1016/j.specom.2022.04.002

Kyung, T., Lee, J., & Park, S. (2023) ‘Prosodic representation using

Global Style Tokens for emotional speech recognition’, Speech

Communication, 145, pp. 13–24.

Li, T., Zhao, J., & Ren, Y. (2021) ‘Hybrid Emotion-Aware ASR with

CNN-RNN Architecture for Emotional Speech Recognition’,

IEEE

Transactions

on

Affective

https://doi.org/10.1109/TAFFC.2021.3068827

Computing.

Livingstone, S. R., Russo, F. A. (2018) ‘The Ryerson Audio-Visual Da

tabase of Emotional Speech and Song (RAVDESS)’, PLOS ONE,

(5), e0196391.

Martin, O., Valente, F., Schuller, B. (2021) ‘Feature engineering for

emotional speech recognition using MFCC and modulation

spectral features’, Computer Speech & Language, 68, 101179.

Qu, Y., Liu, Z., Cao, P. (2022) ‘Emotional speech data augmentation

using generative adversarial networks’, IEEE/ACM Transac

tions on Audio, Speech, and Language Processing, 30, pp. 1545

Sahu, S., Gupta, V., Rao, S. (2019) ‘Multimodal emotion recognition

with static acoustic features and deep neural networks’, Neural

Networks, 116, pp. 184–194.

Tripathi, S., Singh, N., Verma, R. (2022) ‘Sequential modelling of dy

namic emotion patterns in conversational speech’, Pattern

Recognition Letters, 157, pp. 123–130.

Wu, W., Zhang, C., & Woodland, P. (2023a) ‘Integrating Emotion

Recognition with Speech Recognition and Speaker Diarisation

for Conversations’, Interspeech, https://doi.org/10.21437/In

terspeech.2023-293

Wu, L., Zhang, Q., Wang, Y. (2023b) ‘Joint ASR–AER modelling using

multi-task learning with shared acoustic encodin’, IEEE Signal

Processing Letters, 30, pp. 291–295.

Zhang, Z., Wu, D., Zhao, X. (2020a) ‘Multimodal affective computing:

Models, datasets, and challenges’, Information Fusion, 61, pp.

–119.

Zhang, S., Han, X., Xu, C. (2020b) ‘CNN-LSTM Hybrid Model for

Speech Emotion Recognition’, ICASSP 2020, pp. 2872–2876.

https://doi.org/10.1109/ICASSP.2020.9054557

Zhao, X., Yang, W., Wang, H. (2021) ‘Recurrent Attention Networks

for Emotion Recognition’ Pattern Recognition Letters, 144, pp.

–52. https://doi.org/10.1016/j.patrec.2021.01.004

Downloads

Published

2026-04-13

How to Cite

A. Aoun , A. ., & B. Dabbabi, K. . (2026). A Multi-Modal Approach to Emotion-Aware Automatic Speech Recognition Using Dy namic Emotion Trajectories and Global Style Tokens. Libyan Journal of Science &Amp;Technology, 15(2), 254–258. https://doi.org/10.37376/ljst.v15i2.7647

Issue

Section

Articles