Diversification of serbian-french-english-spanish parallel corpus parcolab with spoken language data

Dušica Terzić; Saša Marjanović; Dejan Stosic; Aleksandra Miletic

Conference Proceedings

Diversification of serbian-french-english-spanish parallel corpus parcolab with spoken language data

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2020) 12284 LNAI 61-70

DOI: 10.1007/978-3-030-58323-1_6

1Citations

4Readers

Get full text

Abstract

In this paper we present the efforts to diversify Serbian-French-English-Spanish corpus ParCoLab. ParCoLab is the project led by CLLE research unit (UMR 5263 CNRS) at the University of Toulouse, France, and the Romance Department at the University of Belgrade, Serbia. The main goal of the project is to create a freely searchable and widely applicable multilingual resource with Serbian as the pivot language. Initially, the majority of the corpus texts represented written language. Since diversity of text types contributes to the usefulness and applicability of a parallel corpus, a great deal of effort has been made to include spoken language data in the ParCoLab database. Transcripts and translations of TED talks, films and cartoons have been included so far, along with transcripts of original Serbian films. Thus, the 17.6M-word database of mainly literary texts has been extended with spoken language data and it now contains 32.9M words.

Author supplied keywords

Cite

CITATION STYLE

APA

Terzić, D., Marjanović, S., Stosic, D., & Miletic, A. (2020). Diversification of serbian-french-english-spanish parallel corpus parcolab with spoken language data. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 12284 LNAI, pp. 61–70). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-030-58323-1_6

Diversification of serbian-french-english-spanish parallel corpus parcolab with spoken language data

Abstract

Author supplied keywords

Cite

Register to see more suggestions