An efficient algorithm for building a distributional thesaurus (and other Sketch Engine developments)

Pavel Rychlý; Adam Kilgarriff

Conference Proceedings

An efficient algorithm for building a distributional thesaurus (and other Sketch Engine developments)

Proceedings of the Annual Meeting of the Association for Computational Linguistics (2007) 41-44

DOI: 10.3115/1557769.1557783

56Citations

118Readers

Get full text

Abstract

Gorman and Curran (2006) argue that thesaurus generation for billion+-word corpora is problematic as the full computation takes many days. We present an algorithm with which the computation takes under two hours. We have created, and made publicly available, thesauruses based on large corpora for (at time of writing) seven major world languages. The development is implemented in the Sketch Engine (Kilgarriff et al., 2004). Another innovative development in the same tool is the presentation of the grammatical behaviour of a word against the background of how all other words of the same word class behave. Thus, the English noun constraint occurs 75% in the plural. Is this a salient lexical fact? To form a judge-ment, we need to know the distribution for all nouns. We use histograms to present the distribution in a way that is easy to grasp.

Cite

CITATION STYLE

APA

Rychlý, P., & Kilgarriff, A. (2007). An efficient algorithm for building a distributional thesaurus (and other Sketch Engine developments). In Proceedings of the Annual Meeting of the Association for Computational Linguistics (pp. 41–44). Association for Computational Linguistics (ACL). https://doi.org/10.3115/1557769.1557783

An efficient algorithm for building a distributional thesaurus (and other Sketch Engine developments)

Abstract

Cite

Register to see more suggestions