Optimized Focused Web Crawler with Natural Language Processing Based Relevance Measure in Bioinformatics Web Sources-Reference-Cited by-同舟云学术

Optimized Focused Web Crawler with Natural Language Processing Based Relevance Measure in Bioinformatics Web Sources

Published:2019-06-01 Issue:2 Volume:19 Page:146-158
ISSN:1314-4081
Container-title:Cybernetics and Information Technologies
language:en
Short-container-title:

Author:

Mani Sekhar S. R.¹,Siddesh G. M.²,Manvi Sunilkumar S.¹,Srinivasa K. G.³

Affiliation:

1. School of C & IT , Reva University , Bangalore , India

2. Deptarment of Information Science & Engineering, Ramaiah Institute of Technology , Bangalore , India

3. National Institute of Technical Teachers Training and Research , Chandigarh , India

Abstract

Abstract In the fast growing of digital technologies, crawlers and search engines face unpredictable challenges. Focused web-crawlers are essential for mining the boundless data available on the internet. Web-Crawlers face indeterminate latency problem due to differences in their response time. The proposed work attempts to optimize the designing and implementation of Focused Web-Crawlers using Master-Slave architecture for Bioinformatics web sources. Focused Crawlers ideally should crawl only relevant pages, but the relevance of the page can only be estimated after crawling the genomics pages. A solution for predicting the page relevance, which is based on Natural Language Processing, is proposed in the paper. The frequency of the keywords on the top ranked sentences of the page determines the relevance of the pages within genomics sources. The proposed solution uses a TextRank algorithm to rank the sentences, as well as ensuring the correct classification of Bioinformatics web page. Finally, the model is validated by being compared with a breadth first search web-crawler. The comparison shows significant reduction in run time for the same harvest rate.

Publisher

Walter de Gruyter GmbH

Subject

General Computer Science

Link

https://www.sciendo.com/pdf/10.2478/cait-2019-0021

Reference17 articles.

1. 1. Mihalcea, R., P. Tarau. TextRank: Bringing Order into Texts. University of North Texas, UNT Digital Library, 2004.

2. 2. Wan, Y., H. Tong. URL Assignment Algorithm of Crawler in Distributed System Based on Hash. – IEEE International Conference on Networking, Sensing and Control (ICNSC’2008), 2008, pp. 1632-1635.10.1109/ICNSC.2008.4525482

3. 3. Jalilian, O., H. Khotanlou. A New Fuzzy-Based Method to Weigh the Related Concepts in Semantic Focused Web Crawlers. – IEEE, 2011, pp. 23-27.10.1109/ICCRD.2011.5764237

4. 4. Mejdl, S., A. Althagafi., Dunren. Improving Relevance Prediction for Focused Web Crawlers. – In: 11th International Conference on Computer and Information Science, IEEE/ACIS,2012, pp.161-166.

5. 5. Pavani, K., G. P. Sajeev. A Novel Web Crawling Method for Vertical Search Engines. IEEE, 2017, pp. 1488-1493.10.1109/ICACCI.2017.8126051

Cited by 18 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Weakly supervised learning for an effective focused web crawler;Engineering Applications of Artificial Intelligence;2024-06

2. Developing a RBFN-Based Enhanced Web Crawler for Tamil Text Categorization;2024 5th International Conference for Emerging Technology (INCET);2024-05-24

3. A hunger-based scheduling strategy for distributed crawler;Expert Systems with Applications;2023-07

4. Crowd Control, Planning, and Prediction Using Sentiment Analysis: An Alert System for City Authorities;Applied Sciences;2023-01-26

5. A Topic-Specific Web Crawler using Deep Convolutional Networks;The International Arab Journal of Information Technology;2023