A comparative study of dictionaries and corpora as methods for language resource addition

0Citations
Citations of this article
10Readers
Mendeley users who have this article in their library.
Get full text

Abstract

In this paper, we investigate the relative effect of two strategies for language resource addition for Japanese morphological analysis, a joint task of word segmentation and part-of-speech tagging. The first strategy is adding entries to the dictionary and the second is adding annotated sentences to the training corpus. The experimental results showed that addition of annotated sentences to the training corpus is better than the addition of entries to the dictionary. In particular, adding annotated sentences is especially efficient when we add new words with contexts of several real occurrences as partially annotated sentences, i.e. sentences in which only some words are annotated with word boundary information. According to this knowledge, we performed real annotation experiments on invention disclosure texts and observed word segmentation accuracy. Finally we investigated various language resource addition cases and introduced the notion of non-maleficence, asymmetricity, and additivity of language resources for a task. In the WS case, we found that language resource addition is non-maleficent (adding new resources causes no harm in other domains) and sometimes additive (adding new resources helps other domains). We conclude that it is reasonable for us, NLP tool providers, to distribute only one general-domain model trained from all the language resources we have.

Cite

CITATION STYLE

APA

Mori, S., & Neubig, G. (2016). A comparative study of dictionaries and corpora as methods for language resource addition. Language Resources and Evaluation, 50(2), 245–261. https://doi.org/10.1007/s10579-016-9354-7

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free