We present in this paper the development of a specialized comparable corpora compilation tool, for which quality would be close to a manually compiled corpus. The comparability is based on three levels: domain, topic and type of discourse. Domain and topic can be filtered with the keywords used through web search. But the detection of the type of discourse needs a wide linguistic analysis. The first step of our work is to automate the detection of the type of discourse that can be found in a scientific domain (science and popular science) in French and Japanese languages. First, a contrastive stylistic analysis of the two types of discourse is done on both languages. This analysis leads to the creation of a reusable, generic and robust typology. Machine learning algorithms are then applied to the typology, using shallow parsing. We obtain good results, with an average precision of 80% and an average recall of 70% that demonstrate the efficiency of this typology. This classification tool is then inserted in a corpus compilation tool which is a text collection treatment chain realized through IBM UIMA system. Starting from two specialized web documents collection in French and Japanese, this tool creates the corresponding corpus.
CITATION STYLE
Goeuriot, L., Morin, E., & Daille, B. (2009). Compilation of Specialized Comparable Corpora in French and Japanese. In BUCC 2009 - 2nd Workshop on Building and Using Comparable Corpora: From Parallel to Non-Parallel Corpora at the ACL-IJCNLP 2009 - Proceedings (pp. 55–63). Association for Computational Linguistics (ACL). https://doi.org/10.3115/1690339.1690353
Mendeley helps you to discover research relevant for your work.