All human language technology demands substantial quantities of data for system training and development, plus stable benchmark data to measure ongoing progress. While creation of high quality linguistic resources is both costly and time consuming, such data has the potential to profoundly impact not just a single evaluation program but language technology research in general. GALE’s challenging performance targets demand linguistic data on a scale and complexity never before encountered. Resources cover multiple languages (Arabic, Chinese, and English) and multiple genres -- both structured (newswire and broadcast news) and unstructured (web text, including blogs and newsgroups, and broadcast conversation). These resources include significant volumes of monolingual text and speech, parallel text, and transcribed audio combined with multiple layers of linguistic annotation, ranging from word aligned parallel text and Treebanks to rich semantic annotation.
CITATION STYLE
Strassel, S., Christianson, C., McCary, J., Staderman, W., & Olive, J. (2011). Data Acquisition and Linguistic Resources. In Handbook of Natural Language Processing and Machine Translation (pp. 1–131). Springer New York. https://doi.org/10.1007/978-1-4419-7713-7_1
Mendeley helps you to discover research relevant for your work.