Human speech is not only a verbose medium of communication but it also conveys emotions. The past decade has seen a lot of research going on with speech data which becomes especially important for human-computer interaction and also healthcare, security, and entertainment. This paper proposes the TLEFuzzyNet model, a three-stage pipeline for emotion recognition from speech. The first stage includes feature extraction by data augmentation of speech signals and extraction of Mel spectrograms, followed by the use of three pretrained transfer learning CNN models namely, ResNet18, Inception_v3, and GoogleNet whose prediction scores are fed to the third stage. In the final stage, we assign Fuzzy Ranks using a modified Gompertz function which gives the final prediction scores after considering the individual scores from the three CNN models. We have used the Surrey Audio-Visual Expressed Emotion (SAVEE), the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS), and the Berlin Database of Emotional Speech (EmoDB) datasets to evaluate the TLEFuzzyNet model which has achieved state-of-the-art performance and is hence a dependable framework for Speech emotion recognition(SER). All the codes are available using GitHub link: https://github.com/KaramSahoo/SpeechEmotionRecognitionFuzzy
CITATION STYLE
Sahoo, K. K., Dutta, I., Ijaz, M. F., Wozniak, M., & Singh, P. K. (2021). TLEFuzzyNet: Fuzzy Rank-Based Ensemble of Transfer Learning Models for Emotion Recognition from Human Speeches. IEEE Access, 9, 166518–166530. https://doi.org/10.1109/ACCESS.2021.3135658
Mendeley helps you to discover research relevant for your work.