End-to-End Multimodal Emotion Recognition with Deep Temporal and Cross-Modal Feature Integration
Keywords:
Audio–Visual Fusion, Bidirectional LSTM, Cross-Modal Feature Integration, Emotion Recognition, Temporal Modeling.Abstract
Abstract: Emotion recognition has advanced significantly with the adoption of deep learning methods, yet reliable inference of affective states remains challenging under real-world conditions characterized by noise, occlusion, and expressive ambiguity. These limitations are particularly evident in unimodal systems that rely on a single source of affective information. To address this challenge, this study proposes a novel end-to-end multimodal framework for temporal emotion recognition that jointly models facial and vocal cues within a unified deep learning architecture. The proposed framework integrates deep residual networks for spatial and spectral feature encoding with bidirectional long short-term memory networks for sequence-level temporal modeling. Audio signals are represented using Mel-Frequency Cepstral Coefficients, while facial information is extracted from video frames, with both modalities processed using a shared ResNet-50 backbone to ensure consistent high-level representations. To enhance multimodal interaction, the framework incorporates attention mechanisms that refine modality-specific temporal features and explicitly align audio and visual representations prior to classification. The model is evaluated on the RAVDESS and CREMA-D benchmark datasets using strict subject-disjoint cross-validation to ensure unbiased assessment of generalization performance. Experimental results demonstrate that the proposed approach achieves classification accuracies of 91.22 percent on RAVDESS and 87.32 percent on CREMA-D, outperforming recent multimodal methods evaluated under comparable conditions. Confusion-matrix-based analyses further indicate improved discrimination among emotionally overlapping categories. These results demonstrate that jointly modeling deep spatial representations, temporal dynamics, and adaptive cross-modal interaction yields robust emotion recognition under realistic variability. The proposed framework provides a transparent and extensible foundation for future research in multimodal affective computing and emotion-aware intelligent systems.
Downloads
References
E. Ghaleb, J. Niehues, and and S. Asteriadis, “Multimodal Attention-Mechanism For Temporal Emotion Recognition,” IEEE Int. Conf. Image Process., pp. 251–255, 2020.
H. M. Shahzad, S. M. Bhatti, A. Jaffar, and M. Rashid, “A Multi-Modal Deep Learning Approach for Emotion Recognition,” Intell. Autom. Soft Comput., vol. 36, no. 2, pp. 1561–1570, 2023, doi: 10.32604/iasc.2023.032525.
Y. Fu, Q. Liu, Q. Song, P. Zhang, and G. Liao, “Multi-HM: A Chinese Multimodal Dataset and Fusion Framework for Emotion Recognition in Human–Machine Dialogue Systems,” Appl. Sci., vol. 15, no. 8, p. 4509, 2025.
S. Kalateh, L. A. Estrada-Jimenez, S. Nikghadam-Hojjati, and J. Barata, “A systematic review on multimodal emotion recognition: building blocks, current state, applications, and challenges,” IEEE Access, vol. 12, pp. 103976–104019, 2024.
P. Srinivas and P. Mishra, “Human Emotion Recognition by Integrating Facial and Speech Features: An Implementation of Multimodal Framework using CNN,” Int. J. Adv. Comput. Sci. Appl., vol. 13, no. 1, 2022.
S. Khuntia, A. Amjad, R. B. Tarekegen, and L.-C. Tai, “Deep Learning-Based Emotion Recognition Using Fusion of Multimodal Affective Data From Consumer-Grade Wearable ECG and Speech Sensors,” in 2025 IEEE International Conference on Consumer Electronics (ICCE), IEEE, 2025, pp. 1–6.
H. Lian, C. Lu, S. Li, Y. Zhao, C. Tang, and Y. Zong, “A survey of deep learning-based multimodal emotion recognition: Speech, text, and face,” Entropy, vol. 25, no. 10, p. 1440, 2023.
S. Zhang, Y. Yang, C. Chen, X. Zhang, Q. Leng, and X. Zhao, “Deep learning-based multimodal emotion recognition from audio, visual, and text modalities: A systematic review of recent advancements and future prospects,” Expert Syst. Appl., vol. 237, p. 121692, 2024.
M.-H. Yi, K.-C. Kwak, and J.-H. Shin, “HyFusER: hybrid multimodal transformer for emotion recognition using dual cross modal attention,” Appl. Sci., vol. 15, no. 3, p. 1053, 2025.
G. Udahemuka, K. Djouani, and A. M. Kurien, “Multimodal Emotion Recognition Using Visual, Vocal and Physiological Signals: A Review,” Appl. Sci., vol. 14, no. 17, 2024, doi: 10.3390/app14178071.
M. K. Chowdary, T. N. Nguyen, and D. J. Hemanth, “Deep learning-based facial emotion recognition for human–computer interaction applications,” Neural Comput. Appl., vol. 35, no. 32, pp. 23311–23328, 2023.
N.-H. Ho, H.-J. Yang, S.-H. Kim, and G. Lee, “Multimodal approach of speech emotion recognition using multi-level multi-head fusion attention-based recurrent neural network,” IEEE Access, vol. 8, pp. 61672–61686, 2020.
P. Varshney, R. Dey, V. Gulati, and D. K. Vishwakarma, “Speech Emotion Recognition: A Multimodal Approach Using Multi-Feature Fusion and Self-Attention,” in 2025 6th International Conference on Inventive Research in Computing Applications (ICIRCA), IEEE, 2025, pp. 1056–1063.
S. Poria, E. Cambria, R. Bajpai, and A. Hussain, “A review of affective computing: From unimodal analysis to multimodal fusion,” Inf. fusion, vol. 37, pp. 98–125, 2017.
A. I. Middya, B. Nag, and S. Roy, “Deep learning based multimodal emotion recognition using model-level fusion of audio–visual modalities,” Knowledge-Based Syst., vol. 244, p. 108580, 2022, doi: 10.1016/j.knosys.2022.108580.
M. Khan, W. Gueaieb, A. El Saddik, and S. Kwon, “MSER: Multimodal speech emotion recognition using cross-attention with deep fusion,” Expert Syst. Appl., vol. 245, p. 122946, 2024.
S. Zhang, S. Zhang, T. Huang, W. Gao, and Q. Tian, “Learning affective features with a hybrid deep model for audio–visual emotion recognition,” IEEE Trans. circuits Syst. video Technol., vol. 28, no. 10, pp. 3030–3043, 2017.
M. S. Hossain and G. Muhammad, “Emotion recognition using deep learning approach from audio–visual emotional big data,” Inf. Fusion, vol. 49, pp. 69–78, 2019.
Z. Farhoudi and S. Setayeshi, “Fusion of deep learning features with mixture of brain emotional learning for audio-visual emotion recognition,” Speech Commun., vol. 127, no. June 2020, pp. 92–103, 2021, doi: 10.1016/j.specom.2020.12.001.
V. Gupta, S. Juyal, G. P. Singh, C. Killa, and N. Gupta, “Emotion recognition of audio/speech data using deep learning approaches,” J. Inf. Optim. Sci., vol. 41, no. 6, pp. 1309–1317, 2020.
L. Schoneveld, A. Othmani, and H. Abdelkawy, “Leveraging recent advances in deep learning for audio-visual emotion recognition,” Pattern Recognit. Lett., vol. 146, pp. 1–7, 2021.
B. Mocanu, R. Tapu, and T. Zaharia, “Multimodal Emotion Recognition using Cross Modal Audio-Video Fusion with Attention and Deep Metric Learning,” Image Vis. Comput., vol. 133, pp. 1–18, 2023.
S. S. Poorna, V. Menon, and S. Gopalan, “Hybrid CNN-BiLSTM architecture with multiple attention mechanisms to enhance speech emotion recognition,” Biomed. Signal Process. Control, vol. 100, p. 106967, 2025.
W. Fan, X. Xu, G. Zhou, X. Deng, and X. Xing, “Coordination Attention based Transformers with bidirectional contrastive loss for multimodal speech emotion recognition,” Speech Commun., vol. 169, p. 103198, 2025.
F. Liu, Z. Fu, Y. Wang, and Q. Zheng, “TACFN: transformer-based adaptive cross-modal fusion network for multimodal emotion recognition,” arXiv Prepr. arXiv2505.06536, 2025.
M. Aly, “Revo
lutionizing online education: Advanced facial expression recognition for real-time student progress tracking via deep learning model,” Multimed. Tools Appl., vol. 84, no. 13, pp. 12575–12614, 2025.
R. G. Praveen, E. Granger, and P. Cardinal, “Cross attentional audio-visual fusion for dimensional emotion recognition,” 16th IEEE Int. Conf. Autom. face gesture Recognit., pp. 1–8, 2021.
Published
How to Cite
Issue
Section
Copyright (c) 2026 REXCHARLES DONATUS

This work is licensed under a Creative Commons Attribution 4.0 International License.
