End-to-End Multimodal Emotion Recognition with Deep Temporal and Cross-Modal Feature Integration

Rexcharles Enyinna   Donatus; Oludele  Awodele; Osondu E. Oguike Osondu E. Oguike; Amina   Sambo-Magaji

doi:10.48185/jaai.v7i1.1939

Authors

Rexcharles Enyinna Donatus
charlly4eyims@yahoo.com
Africa Centre of Excellence on Technology Enhanced Learning, National Open University of Nigeria, Abuja and 900108, Nigeria
Oludele Awodele Department of Computer Science, Babcock University, Ilishan-Remo and 121103, Ogun, Nigeria.
Osondu E. Oguike Osondu E. Oguike Department of Computer Science, University of Nigeria, Nsukka and 410101, Enugu, Nigeria.
Amina Sambo-Magaji Digital Literacy & Capacity Development Department, National Information Technology Development Agency, Abuja and 900104, Nigeria.

Keywords:

Audio–Visual Fusion, Bidirectional LSTM, Cross-Modal Feature Integration, Emotion Recognition, Temporal Modeling.

Abstract

Abstract: Emotion recognition has advanced significantly with the adoption of deep learning methods, yet reliable inference of affective states remains challenging under real-world conditions characterized by noise, occlusion, and expressive ambiguity. These limitations are particularly evident in unimodal systems that rely on a single source of affective information. To address this challenge, this study proposes a novel end-to-end multimodal framework for temporal emotion recognition that jointly models facial and vocal cues within a unified deep learning architecture. The proposed framework integrates deep residual networks for spatial and spectral feature encoding with bidirectional long short-term memory networks for sequence-level temporal modeling. Audio signals are represented using Mel-Frequency Cepstral Coefficients, while facial information is extracted from video frames, with both modalities processed using a shared ResNet-50 backbone to ensure consistent high-level representations. To enhance multimodal interaction, the framework incorporates attention mechanisms that refine modality-specific temporal features and explicitly align audio and visual representations prior to classification. The model is evaluated on the RAVDESS and CREMA-D benchmark datasets using strict subject-disjoint cross-validation to ensure unbiased assessment of generalization performance. Experimental results demonstrate that the proposed approach achieves classification accuracies of 91.22 percent on RAVDESS and 87.32 percent on CREMA-D, outperforming recent multimodal methods evaluated under comparable conditions. Confusion-matrix-based analyses further indicate improved discrimination among emotionally overlapping categories. These results demonstrate that jointly modeling deep spatial representations, temporal dynamics, and adaptive cross-modal interaction yields robust emotion recognition under realistic variability. The proposed framework provides a transparent and extensible foundation for future research in multimodal affective computing and emotion-aware intelligent systems.

Downloads

Download data is not yet available.

References

E. Ghaleb, J. Niehues, and and S. Asteriadis, “Multimodal Attention-Mechanism For Temporal Emotion Recognition,” IEEE Int. Conf. Image Process., pp. 251–255, 2020.

H. M. Shahzad, S. M. Bhatti, A. Jaffar, and M. Rashid, “A Multi-Modal Deep Learning Approach for Emotion Recognition,” Intell. Autom. Soft Comput., vol. 36, no. 2, pp. 1561–1570, 2023, doi: 10.32604/iasc.2023.032525.

Y. Fu, Q. Liu, Q. Song, P. Zhang, and G. Liao, “Multi-HM: A Chinese Multimodal Dataset and Fusion Framework for Emotion Recognition in Human–Machine Dialogue Systems,” Appl. Sci., vol. 15, no. 8, p. 4509, 2025.

S. Kalateh, L. A. Estrada-Jimenez, S. Nikghadam-Hojjati, and J. Barata, “A systematic review on multimodal emotion recognition: building blocks, current state, applications, and challenges,” IEEE Access, vol. 12, pp. 103976–104019, 2024.

P. Srinivas and P. Mishra, “Human Emotion Recognition by Integrating Facial and Speech Features: An Implementation of Multimodal Framework using CNN,” Int. J. Adv. Comput. Sci. Appl., vol. 13, no. 1, 2022.

S. Khuntia, A. Amjad, R. B. Tarekegen, and L.-C. Tai, “Deep Learning-Based Emotion Recognition Using Fusion of Multimodal Affective Data From Consumer-Grade Wearable ECG and Speech Sensors,” in 2025 IEEE International Conference on Consumer Electronics (ICCE), IEEE, 2025, pp. 1–6.

H. Lian, C. Lu, S. Li, Y. Zhao, C. Tang, and Y. Zong, “A survey of deep learning-based multimodal emotion recognition: Speech, text, and face,” Entropy, vol. 25, no. 10, p. 1440, 2023.

S. Zhang, Y. Yang, C. Chen, X. Zhang, Q. Leng, and X. Zhao, “Deep learning-based multimodal emotion recognition from audio, visual, and text modalities: A systematic review of recent advancements and future prospects,” Expert Syst. Appl., vol. 237, p. 121692, 2024.

M.-H. Yi, K.-C. Kwak, and J.-H. Shin, “HyFusER: hybrid multimodal transformer for emotion recognition using dual cross modal attention,” Appl. Sci., vol. 15, no. 3, p. 1053, 2025.

G. Udahemuka, K. Djouani, and A. M. Kurien, “Multimodal Emotion Recognition Using Visual, Vocal and Physiological Signals: A Review,” Appl. Sci., vol. 14, no. 17, 2024, doi: 10.3390/app14178071.

M. K. Chowdary, T. N. Nguyen, and D. J. Hemanth, “Deep learning-based facial emotion recognition for human–computer interaction applications,” Neural Comput. Appl., vol. 35, no. 32, pp. 23311–23328, 2023.

N.-H. Ho, H.-J. Yang, S.-H. Kim, and G. Lee, “Multimodal approach of speech emotion recognition using multi-level multi-head fusion attention-based recurrent neural network,” IEEE Access, vol. 8, pp. 61672–61686, 2020.

P. Varshney, R. Dey, V. Gulati, and D. K. Vishwakarma, “Speech Emotion Recognition: A Multimodal Approach Using Multi-Feature Fusion and Self-Attention,” in 2025 6th International Conference on Inventive Research in Computing Applications (ICIRCA), IEEE, 2025, pp. 1056–1063.

S. Poria, E. Cambria, R. Bajpai, and A. Hussain, “A review of affective computing: From unimodal analysis to multimodal fusion,” Inf. fusion, vol. 37, pp. 98–125, 2017.

A. I. Middya, B. Nag, and S. Roy, “Deep learning based multimodal emotion recognition using model-level fusion of audio–visual modalities,” Knowledge-Based Syst., vol. 244, p. 108580, 2022, doi: 10.1016/j.knosys.2022.108580.

M. Khan, W. Gueaieb, A. El Saddik, and S. Kwon, “MSER: Multimodal speech emotion recognition using cross-attention with deep fusion,” Expert Syst. Appl., vol. 245, p. 122946, 2024.

S. Zhang, S. Zhang, T. Huang, W. Gao, and Q. Tian, “Learning affective features with a hybrid deep model for audio–visual emotion recognition,” IEEE Trans. circuits Syst. video Technol., vol. 28, no. 10, pp. 3030–3043, 2017.

M. S. Hossain and G. Muhammad, “Emotion recognition using deep learning approach from audio–visual emotional big data,” Inf. Fusion, vol. 49, pp. 69–78, 2019.

Z. Farhoudi and S. Setayeshi, “Fusion of deep learning features with mixture of brain emotional learning for audio-visual emotion recognition,” Speech Commun., vol. 127, no. June 2020, pp. 92–103, 2021, doi: 10.1016/j.specom.2020.12.001.

V. Gupta, S. Juyal, G. P. Singh, C. Killa, and N. Gupta, “Emotion recognition of audio/speech data using deep learning approaches,” J. Inf. Optim. Sci., vol. 41, no. 6, pp. 1309–1317, 2020.

L. Schoneveld, A. Othmani, and H. Abdelkawy, “Leveraging recent advances in deep learning for audio-visual emotion recognition,” Pattern Recognit. Lett., vol. 146, pp. 1–7, 2021.

B. Mocanu, R. Tapu, and T. Zaharia, “Multimodal Emotion Recognition using Cross Modal Audio-Video Fusion with Attention and Deep Metric Learning,” Image Vis. Comput., vol. 133, pp. 1–18, 2023.

S. S. Poorna, V. Menon, and S. Gopalan, “Hybrid CNN-BiLSTM architecture with multiple attention mechanisms to enhance speech emotion recognition,” Biomed. Signal Process. Control, vol. 100, p. 106967, 2025.

W. Fan, X. Xu, G. Zhou, X. Deng, and X. Xing, “Coordination Attention based Transformers with bidirectional contrastive loss for multimodal speech emotion recognition,” Speech Commun., vol. 169, p. 103198, 2025.

F. Liu, Z. Fu, Y. Wang, and Q. Zheng, “TACFN: transformer-based adaptive cross-modal fusion network for multimodal emotion recognition,” arXiv Prepr. arXiv2505.06536, 2025.

M. Aly, “Revo

lutionizing online education: Advanced facial expression recognition for real-time student progress tracking via deep learning model,” Multimed. Tools Appl., vol. 84, no. 13, pp. 12575–12614, 2025.

R. G. Praveen, E. Granger, and P. Cardinal, “Cross attentional audio-visual fusion for dimensional emotion recognition,” 16th IEEE Int. Conf. Autom. face gesture Recognit., pp. 1–8, 2021.

End-to-End Multimodal Emotion Recognition with Deep Temporal and Cross-Modal Feature Integration

Authors

Keywords:

Abstract

Downloads

References

Published

How to Cite

Issue

Section

Make a Submission

pointofinterest

Information

Current Issue

Contact

Other Links

Follow us

Publisher