Scaling up word clustering

2Citations
Citations of this article
74Readers
Mendeley users who have this article in their library.

Abstract

Word clusters improve performance in many NLP tasks including training neural network language models, but current increases in datasets are outpacing the ability of word clusterers to handle them. In this paper we present a novel bidirectional, interpolated, refining, and alternating (BIRA) predictive exchange algorithm and introduce ClusterCat, a clusterer based on this algorithm. We show that ClusterCat is 3-85 times faster than four other well-known clusterers, while also improving upon the predictive exchange algorithm's perplexity by up to 18% . Notably, ClusterCat clusters a 2.5 billion token English News Crawl corpus in 3 hours. We also evaluate in a machine translation setting, resulting in shorter training times achieving the same translation quality measured in BLEU scores. ClusterCat is portable and freely available.

Cite

CITATION STYLE

APA

Dehdari, J., Tan, L., & Van Genabith, J. (2016). Scaling up word clustering. In NAACL-HLT 2016 - 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Demonstrations Session (pp. 42–46). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/n16-3009

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free