Web pages, text types, and linguistic features: Some issues

0Citations
Citations of this article
47Readers
Mendeley users who have this article in their library.
Get full text

Abstract

From a textual point of view, the web is a huge reservoir of documents. On the web virtually everything can be seen as a ‘document’ or better a ‘web page’. The sheer amount of texts available is just overwhelming. Furthermore, the web is mainly wild and uncontrolled. This becomes clear if we compare a ‘tamed’ resource of the paper world, like the British National Library, and the ‘untamed’ English Web. In: this empirical study, I investigated text typologies in a random sample of raw web pages, and not in a corpus of preselected and pre-processed documents. I realized that the textuality of web pages might be dissimilar from the textuality of linear documents (whether paper or electronic documents). This new textuality makes automatic feature extraction and application of NLP tools more troublesome. I also realized that the text typologies already available in the literature might not cover all web page types. The issues pointed out in this study do not have an easy solution. For the time being, my suggestion is to keep them in mind when assessing results from any automatic approach to web pages.

Cite

CITATION STYLE

APA

Santini, M. (2019). Web pages, text types, and linguistic features: Some issues. Zanry Reci, 21(1), 22–33. https://doi.org/10.18500/2311-0740-2019-1-21-22-33

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free