The web has been a subject of research since its beginning, but it is difficult if not impossible to analyze the whole web, even if a database of all URLs would be freely accessible. Hundreds of studies have used commercial top websites lists as a shortcut, in particular the Alexa One Million Top Sites list. However, apart from the fact that Amazon decided to terminate Alexa, we question the usefulness of such lists for research as they have several shortcomings. Our analysis shows that top sites lists miss frequently visited websites and offer only little value for language-specific research. We present a heuristic-driven alternative based on the Common Crawl host-level web graph while also taking language-specific requirements into account.
CITATION STYLE
Alby, T., & Jäschke, R. (2022). Analyzing the Web: Are Top Websites Lists a Good Choice for Research? In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 13541 LNCS, pp. 11–25). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-031-16802-4_2
Mendeley helps you to discover research relevant for your work.