Speech quality estimation with deep lattice networks

  • Chinen M
  • Skoglund J
  • Hines A
7Citations
Citations of this article
7Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

Intrusive subjective speech quality estimation of mean opinion score (MOS) often involves mapping a raw similarity score extracted from differences between the clean and degraded utterance onto MOS with a fitted mapping function. More recent models such as support vector regression (SVR) or deep neural networks use multidimensional input, which allows for a more accurate prediction than one-dimensional (1-D) mappings but does not provide the monotonic property that is expected between similarity and quality. We investigate a multidimensional mapping function using deep lattice networks (DLNs) to provide monotonic constraints with input features provided by ViSQOL. The DLN improved the speech mapping to 0.24 mean-square error on a mixture of datasets that include voice over IP and codec degradations, outperforming the 1-D fitted functions and SVR as well as PESQ and POLQA. Additionally, we show that the DLN can be used to learn a quantile function that is well-calibrated and a useful measure of uncertainty. The quantile function provides an improved mapping of data driven similarity representations to human interpretable scales, such as quantile intervals for predictions instead of point estimates.

Cite

CITATION STYLE

APA

Chinen, M., Skoglund, J., & Hines, A. (2021). Speech quality estimation with deep lattice networks. The Journal of the Acoustical Society of America, 149(6), 3851–3861. https://doi.org/10.1121/10.0005130

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free