Can We quantify domainhood? Exploring measures to assess domain-specificity in web corpora

1Citations
Citations of this article
2Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Web corpora are a cornerstone of modern Language Technology. Corpora built from the web are convenient because their creation is fast and inexpensive. Several studies have been carried out to assess the representativeness of general-purpose web corpora by comparing them to traditional corpora. Less attention has been paid to assess the representativeness of specialized or domain-specific web corpora. In this paper, we focus on the assessment of domain representativeness of web corpora and we claim that it is possible to assess the degree of domain-specificity, or domainhood, of web corpora. We present a case study where we explore the effectiveness of different measures - namely the Mann-Withney-Wilcoxon Test, Kendall correlation coefficient, Kullback–Leibler divergence, log-likelihood and burstiness - to gauge domainhood. Our findings indicate that burstiness is the most suitable measure to single out domain-specific words from a specialized corpus and to allow for the quantification of domainhood.

Cite

CITATION STYLE

APA

Santini, M., Strandqvist, W., Nyström, M., Alirezai, M., & Jönsson, A. (2018). Can We quantify domainhood? Exploring measures to assess domain-specificity in web corpora. In Communications in Computer and Information Science (Vol. 903, pp. 207–217). Springer Verlag. https://doi.org/10.1007/978-3-319-99133-7_17

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free