Speech quality estimation with deep lattice networks

Michael Chinen; Jan Skoglund; Andrew Hines

Journal ArticleOPEN ACCESS

Speech quality estimation with deep lattice networks

Chinen M
Skoglund J
Hines A

The Journal of the Acoustical Society of America (2021) 149(6) 3851-3861

DOI: 10.1121/10.0005130

7Citations

7Readers

Abstract

Intrusive subjective speech quality estimation of mean opinion score (MOS) often involves mapping a raw similarity score extracted from differences between the clean and degraded utterance onto MOS with a fitted mapping function. More recent models such as support vector regression (SVR) or deep neural networks use multidimensional input, which allows for a more accurate prediction than one-dimensional (1-D) mappings but does not provide the monotonic property that is expected between similarity and quality. We investigate a multidimensional mapping function using deep lattice networks (DLNs) to provide monotonic constraints with input features provided by ViSQOL. The DLN improved the speech mapping to 0.24 mean-square error on a mixture of datasets that include voice over IP and codec degradations, outperforming the 1-D fitted functions and SVR as well as PESQ and POLQA. Additionally, we show that the DLN can be used to learn a quantile function that is well-calibrated and a useful measure of uncertainty. The quantile function provides an improved mapping of data driven similarity representations to human interpretable scales, such as quantile intervals for predictions instead of point estimates.

Cite

CITATION STYLE

APA

Chinen, M., Skoglund, J., & Hines, A. (2021). Speech quality estimation with deep lattice networks. The Journal of the Acoustical Society of America, 149(6), 3851–3861. https://doi.org/10.1121/10.0005130

Speech quality estimation with deep lattice networks

Abstract

Cite

Register to see more suggestions