Scaling up word clustering

Jon Dehdari; Liling Tan; Josef Van Genabith

Conference ProceedingsOPEN ACCESS

Scaling up word clustering

NAACL-HLT 2016 - 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Demonstrations Session (2016) 42-46

DOI: 10.18653/v1/n16-3009

2Citations

74Readers

Abstract

Word clusters improve performance in many NLP tasks including training neural network language models, but current increases in datasets are outpacing the ability of word clusterers to handle them. In this paper we present a novel bidirectional, interpolated, refining, and alternating (BIRA) predictive exchange algorithm and introduce ClusterCat, a clusterer based on this algorithm. We show that ClusterCat is 3-85 times faster than four other well-known clusterers, while also improving upon the predictive exchange algorithm's perplexity by up to 18% . Notably, ClusterCat clusters a 2.5 billion token English News Crawl corpus in 3 hours. We also evaluate in a machine translation setting, resulting in shorter training times achieving the same translation quality measured in BLEU scores. ClusterCat is portable and freely available.

Cite

CITATION STYLE

APA

Dehdari, J., Tan, L., & Van Genabith, J. (2016). Scaling up word clustering. In NAACL-HLT 2016 - 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Demonstrations Session (pp. 42–46). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/n16-3009

Scaling up word clustering

Abstract

Cite

Register to see more suggestions