Creating Chinese-English comparable corpora

Degen Huang; Shanshan Wang; Fuji Ren

Journal ArticleOPEN ACCESS

Creating Chinese-English comparable corpora

IEICE Transactions on Information and Systems (2013) E96-D(8) 1853-1861

DOI: 10.1587/transinf.E96.D.1853

3Citations

6Readers

Abstract

Comparable Corpora are valuable resources for many NLP applications, and extensive research has been done on information mining based on comparable corpora in recent years. While there are not enough large-scale available public comparable corpora at present, this paper presents a bi-directional CLIR-based method for creating comparable corpora from two independent news collections in different languages. The original Chinese document collections and English documents collections are crawled from XinHuaNet respectively and formatted in a consistent manner. For each document from the two collections, the best query keywords are extracted to represent the essential content of the document, and then the keywords are translated into the language of the other collection. The translated queries are run against the collection in the same language to pick up the candidate documents in the other language and candidates are aligned based on their publication dates and the similarity scores. Results show that our approach significantly outperforms previous approaches to the construction of Chinese-English comparable corpora. Copyright © 2013 The Institute of Electronics, Information and Communication Engineers.

Author supplied keywords

Cite

CITATION STYLE

APA

Huang, D., Wang, S., & Ren, F. (2013). Creating Chinese-English comparable corpora. IEICE Transactions on Information and Systems, E96-D(8), 1853–1861. https://doi.org/10.1587/transinf.E96.D.1853

Creating Chinese-English comparable corpora

Abstract

Author supplied keywords

Cite

Register to see more suggestions