Linguistic resources available in the public domain, such as lemmatisers, part-of-speech taggers and parsers can be used for the development of MT systems: as separate processing modules or as annotation tools for the training corpus. For SMT this annotation is used for training factored models, and for the rule-based systems linguistically annotated corpus is the basis for creating analysis, generation and transfer dictionaries from corpora. However, the annotation in many cases is insufficient for rule-based MT, especially for the generation tasks. In this paper we analyze a specific case when the part-of-speech tagger does not provide information about de/het gender of Dutch nouns that is needed for our rule-based MT systems translating into Dutch. We show that this information can be derived from large annotated monolingual corpora using a set of context-checking rules on the basis of co-occurrence of nouns and determiners in certain morphosyntactic configurations. As not all contexts are sufficient for disambiguation, we evaluate the coverage and the accuracy of our method for different frequency thresholds in the news corpora. Further we discuss possible generalization of our method, and using it to automatically derive other types of linguistic information needed for rule-based MT: syntactic subcategorization frames, feature agreement rules and contextually appropriate collocates.
CITATION STYLE
Babych, B., Geiger, J., Rosell, M. G., & Eberle, K. (2014). Deriving de/het gender classification for Dutch nouns for rule-based MT generation tasks. In Proceedings of the 3rd Workshop on Hybrid Approaches to Translation, HyTra 2014 at the 14th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2014 (pp. 75–81). Association for Computational Linguistics (ACL). https://doi.org/10.3115/v1/w14-1014
Mendeley helps you to discover research relevant for your work.