Latent factor analysis for synthesized speech quality-of-experience assessment

  • Gupta R
  • Falk T
N/ACitations
Citations of this article
26Readers
Mendeley users who have this article in their library.

Abstract

Text-to-speech (TTS) systems are evolving and making way into numerous commercial systems, such as smartphones and assistive technologies. Notwithstanding, their user perceived quality-of-experience (QoE) is still low compared to natural speech, with distortions arising across numerous perceptual dimensions, such as voice pleasantness, comprehension, and appropriateness of intonation, to name a few. Unfortunately, the effects of such perceptual dimensions on overall perceived QoE is still unknown, particularly across listeners of different genders, thus making it difficult for TTS developers to further improve system quality. To overcome this limitation, this study makes use of exploratory factor analysis (EFA), confirmatory factor analysis (CFA), and model invariance tests to shed light on factors responsible for QoE perception across natural and synthesized speech, as well as male and female listeners. Experimental EFA/CFA results on a publicly available database of commercial TTS systems showed the emergence of two key perceptual dimensions responsible for TTS QoE, namely ‘listening pleasure’ and ‘prosody’. Model invariance tests validated the reliability of the model across male and female listeners, as well as across natural and synthetic voices.

Cite

CITATION STYLE

APA

Gupta, R., & Falk, T. H. (2017). Latent factor analysis for synthesized speech quality-of-experience assessment. Quality and User Experience, 2(1). https://doi.org/10.1007/s41233-017-0005-6

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free