Optimized focused Web Crawler with Natural Language Processing based relevance measure in bioinformatics web sources

S. R. Mani Sekhar; G. M. Siddesh; Sunilkumar S. Manvi; K. G. Srinivasa

Journal ArticleOPEN ACCESS

Optimized focused Web Crawler with Natural Language Processing based relevance measure in bioinformatics web sources

Cybernetics and Information Technologies (2019) 19(2) 146-158

DOI: 10.2478/cait-2019-0021

24Citations

26Readers

Abstract

In the fast growing of digital technologies, crawlers and search engines face unpredictable challenges. Focused web-crawlers are essential for mining the boundless data available on the internet. Web-Crawlers face indeterminate latency problem due to differences in their response time. The proposed work attempts to optimize the designing and implementation of Focused Web-Crawlers using Master-Slave architecture for Bioinformatics web sources. Focused Crawlers ideally should crawl only relevant pages, but the relevance of the page can only be estimated after crawling the genomics pages. A solution for predicting the page relevance, which is based on Natural Language Processing, is proposed in the paper. The frequency of the keywords on the top ranked sentences of the page determines the relevance of the pages within genomics sources. The proposed solution uses a TextRank algorithm to rank the sentences, as well as ensuring the correct classification of Bioinformatics web page. Finally, the model is validated by being compared with a breadth first search web-crawler. The comparison shows significant reduction in run time for the same harvest rate.

Author supplied keywords

References Powered by Scopus

View more at Scopus

Cited by Powered by Scopus

View more at Scopus

Cite

CITATION STYLE

APA

Mani Sekhar, S. R., Siddesh, G. M., Manvi, S. S., & Srinivasa, K. G. (2019). Optimized focused Web Crawler with Natural Language Processing based relevance measure in bioinformatics web sources. Cybernetics and Information Technologies, 19(2), 146–158. https://doi.org/10.2478/cait-2019-0021

Readers' Seniority

PhD / Post grad / Masters / Doc 6

60%

Lecturer / Post doc 3

30%

Researcher 1

10%

Readers' Discipline

Computer Science 8

80%

Agricultural and Biological Sciences 1

10%

Linguistics 1

10%

Optimized focused Web Crawler with Natural Language Processing based relevance measure in bioinformatics web sources

Abstract

Author supplied keywords

References Powered by Scopus

Basic local alignment search tool

Focused crawling: A new approach to topic-specific Web resource discovery

Extracting Structured Data from Web Pages

Cited by Powered by Scopus

An Automated Word Embedding with Parameter Tuned Model for Web Crawling

LEARNING-based Focused WEB Crawler

Keyword weight optimization using gradient strategies in event focused web crawling

Register to see more suggestions

Cite

Readers' Seniority

Readers' Discipline