This work proposes a method to improve the performance of automatic phonetic alignment of speech data. The method uses a deep convolutional neural network (CNN) trained on a combination of acoustic features extracted from labeled data to fine tune the position of each boundary within a fixed-size window around the original boundary position. The proposed method is robust to speaker identity, which means that a system trained with enough labeled data can be used to fine tune alignment on any speech file, regardless of speaker identity. With an absolute gain between 20% and 33% in cross speaker scenario, our results demonstrate the applicability of deep learning for this task.
CITATION STYLE
Cuozzo, L. G. D., Silva, D. A., Neto, M. U., Simões, F. O., & Nagle, E. J. (2018). CNN-Based Phonetic Segmentation Refinement with a Cross-Speaker Setup. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 11122 LNAI, pp. 448–456). Springer Verlag. https://doi.org/10.1007/978-3-319-99722-3_45
Mendeley helps you to discover research relevant for your work.