Word Embeddings are used widely in multiple Natural Language Processing (NLP) applications. They are coordinates associated with each word in a dictionary, inferred from statistical properties of these words in a large corpus. In this paper we introduce the notion of “concept” as a list of words that have shared semantic content. We use this notion to analyse the learnability of certain concepts, defined as the capability of a classifier to recognise unseen members of a concept after training on a random subset of it. We first use this method to measure the learnability of concepts on pretrained word embeddings. We then develop a statistical analysis of concept learnability, based on hypothesis testing and ROC curves, in order to compare the relative merits of various embedding algorithms using a fixed corpora and hyper parameters. We find that all embedding methods capture the semantic content of those word lists, but fastText performs better than the others.
CITATION STYLE
Sutton, A., & Cristianini, N. (2020). On the Learnability of Concepts: With Applications to Comparing Word Embedding Algorithms. In IFIP Advances in Information and Communication Technology (Vol. 584 IFIP, pp. 420–432). Springer. https://doi.org/10.1007/978-3-030-49186-4_35
Mendeley helps you to discover research relevant for your work.