Comparison of text-similarity metrics for the purpose of identifying identical web pages during automated web application testing

Marek Zachara; Dariusz Pałka

Conference Proceedings

Comparison of text-similarity metrics for the purpose of identifying identical web pages during automated web application testing

Advances in Intelligent Systems and Computing (2016) 430 25-35

DOI: 10.1007/978-3-319-28561-0_3

4Citations

8Readers

Get full text

Abstract

The paper focuses on the evaluation of effectiveness of a number of algorithms used to assess text similarity. The purpose of such evaluation is to determine the best methods for comparing and identifying near-identical web pages. Such comparison of web pages is in turn a prerequisite for building new automated testing tools and security scanners. The goal is to build scanners that will be able to automatically test the web application behavior for a large range of supplied parameters (known as fuzzing). Such testing requires massive generation and processing of requests, which in turn require fast page comparison methods. The similarity comparison is performed on a shortened, tokenized version of web pages, using a test set of pages downloaded from popular websites. A methodology for the evaluation of similarity metrics is proposed, together with a quality metric for the intended task. Several tokenization strategies are also tested and their impact on the final result is assessed.

Author supplied keywords

Cite

CITATION STYLE

APA

Zachara, M., & Pałka, D. (2016). Comparison of text-similarity metrics for the purpose of identifying identical web pages during automated web application testing. In Advances in Intelligent Systems and Computing (Vol. 430, pp. 25–35). Springer Verlag. https://doi.org/10.1007/978-3-319-28561-0_3

Comparison of text-similarity metrics for the purpose of identifying identical web pages during automated web application testing

Abstract

Author supplied keywords

Cite

Register to see more suggestions