Web pages, text types, and linguistic features: Some issues

Marina Santini

Journal ArticleOPEN ACCESS

Web pages, text types, and linguistic features: Some issues

Santini M

Zanry Reci (2019) 21(1) 22-33

DOI: 10.18500/2311-0740-2019-1-21-22-33

0Citations

47Readers

Get full text

Abstract

From a textual point of view, the web is a huge reservoir of documents. On the web virtually everything can be seen as a ‘document’ or better a ‘web page’. The sheer amount of texts available is just overwhelming. Furthermore, the web is mainly wild and uncontrolled. This becomes clear if we compare a ‘tamed’ resource of the paper world, like the British National Library, and the ‘untamed’ English Web. In: this empirical study, I investigated text typologies in a random sample of raw web pages, and not in a corpus of preselected and pre-processed documents. I realized that the textuality of web pages might be dissimilar from the textuality of linear documents (whether paper or electronic documents). This new textuality makes automatic feature extraction and application of NLP tools more troublesome. I also realized that the text typologies already available in the literature might not cover all web page types. The issues pointed out in this study do not have an easy solution. For the time being, my suggestion is to keep them in mind when assessing results from any automatic approach to web pages.

Author supplied keywords

Cite

CITATION STYLE

APA

Santini, M. (2019). Web pages, text types, and linguistic features: Some issues. Zanry Reci, 21(1), 22–33. https://doi.org/10.18500/2311-0740-2019-1-21-22-33

Web pages, text types, and linguistic features: Some issues

Abstract

Author supplied keywords

Cite

Register to see more suggestions