The paper focuses on the evaluation of effectiveness of a number of algorithms used to assess text similarity. The purpose of such evaluation is to determine the best methods for comparing and identifying near-identical web pages. Such comparison of web pages is in turn a prerequisite for building new automated testing tools and security scanners. The goal is to build scanners that will be able to automatically test the web application behavior for a large range of supplied parameters (known as fuzzing). Such testing requires massive generation and processing of requests, which in turn require fast page comparison methods. The similarity comparison is performed on a shortened, tokenized version of web pages, using a test set of pages downloaded from popular websites. A methodology for the evaluation of similarity metrics is proposed, together with a quality metric for the intended task. Several tokenization strategies are also tested and their impact on the final result is assessed.
CITATION STYLE
Zachara, M., & Pałka, D. (2016). Comparison of text-similarity metrics for the purpose of identifying identical web pages during automated web application testing. In Advances in Intelligent Systems and Computing (Vol. 430, pp. 25–35). Springer Verlag. https://doi.org/10.1007/978-3-319-28561-0_3
Mendeley helps you to discover research relevant for your work.