Gaussian segmentation and tokenization for low cost language identification

Ana Montalvo; José Ramón Calvo De Lara; Gabriel Hernańdez-Sierra

Conference ProceedingsOPEN ACCESS

Gaussian segmentation and tokenization for low cost language identification

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2013) 8258 LNCS(PART 1) 551-558

DOI: 10.1007/978-3-642-41822-8_69

0Citations

2Readers

Abstract

Most common approaches to phonotactic language recognition deal with phone decoders as tokenizers. However, units that are not linked to phonetic definitions can be more universals, and therefore conceptually easier to adopt. It is assumed that the overall sound characteristics of all spoken languages can be covered by a broad collection of acoustic units, which can be characterized by acoustic segments. In this paper, such acoustic units, highly desirables for a more general language characterization, are delimited and clustered using Gaussian Mixture Model. A new segmentation method on acoustic units of the speech is proposed for later Gaussian modelling, looking for substitute the phonetic recognizer. This tokenizer is trained over untranscribed data, and it precedes the statistical language modeling phase. © Springer-Verlag 2013.

Author supplied keywords

Cite

CITATION STYLE

APA

Montalvo, A., Calvo De Lara, J. R., & Hernańdez-Sierra, G. (2013). Gaussian segmentation and tokenization for low cost language identification. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 8258 LNCS, pp. 551–558). https://doi.org/10.1007/978-3-642-41822-8_69

Gaussian segmentation and tokenization for low cost language identification

Abstract

Author supplied keywords

Cite

Register to see more suggestions