The advent of web had resulted in a plethora of information and data. However, its volume heterogeneity and unstructured organization makes information retrieval difficult. To the existing practice where website categorization is largely based on style rather than text, addition of an extra dimension in form of genre is expected to significantly improve the search outcome. Keeping this in view, we attempt to build a novel classification model to categorize websites into genres using thresholds of the web metrics. Statistical measures of central tendency are assumed to render a value that distinguish websites from a sample space containing News, Travel and Tourism, Entertainment and Social media. Through the statistical analysis of the data we find that the data distribution of all metrics which constitute the website properties are highly skewed. Hence, conventional analysis based on normal distribution statistics fails to apply. Adopting to a systematic empirical approach, we find that the classification performance measure identified through the Area Under the Curve is maximized around a threshold value which is twice the value of the “median-absolute-deviation” of the web metrics.
CITATION STYLE
Malhotra, R., & Sharma, A. (2019). An Empirical Study to Classify Website Using Thresholds from Data Characteristics. In Advances in Intelligent Systems and Computing (Vol. 904, pp. 433–446). Springer Verlag. https://doi.org/10.1007/978-981-13-5934-7_39
Mendeley helps you to discover research relevant for your work.