Abstract
Word clusters improve performance in many NLP tasks including training neural network language models, but current increases in datasets are outpacing the ability of word clusterers to handle them. In this paper we present a novel bidirectional, interpolated, refining, and alternating (BIRA) predictive exchange algorithm and introduce ClusterCat, a clusterer based on this algorithm. We show that ClusterCat is 3-85 times faster than four other well-known clusterers, while also improving upon the predictive exchange algorithm's perplexity by up to 18% . Notably, ClusterCat clusters a 2.5 billion token English News Crawl corpus in 3 hours. We also evaluate in a machine translation setting, resulting in shorter training times achieving the same translation quality measured in BLEU scores. ClusterCat is portable and freely available.
Cite
CITATION STYLE
Dehdari, J., Tan, L., & Van Genabith, J. (2016). Scaling up word clustering. In NAACL-HLT 2016 - 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Demonstrations Session (pp. 42–46). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/n16-3009
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.