Affiliation:
1. Department of CSE & IT Jaypee Institute of Information Technology Noida India
Abstract
SummaryEvery user uses a search engine to find health information from websites. Content‐rich health websites are considered in our research as wrong information in these websites can threaten life. Search engines give a list of URLs related to their search keyword. Generally, the user follows the top websites displayed by the search engine. Newly constructed websites do not have ratings, hit counts, and reviews. The search engine does not display newly constructed websites in their top rank. In such a case, the newly constructed website with the same content as the website displayed at the top of the search engine loses the user's trust. Another problem is; the phishing website URLs are also displayed by the Google Search engine, which appear similar to the genuine websites. To solve the problem and enhance the trust of health websites which is not at the top of the search engine among users, we have proposed an approach that extracts all URLs based on the keyword. It identifies all legitimate URLs using a Machine Learning classifier. Address bar features, Domain name features, HTML, and JavaScript features were identified for the dataset of getting legitimate URLs. Three classifiers (Decision Tree, Random Forest, and Support Vector Machine) were trained and evaluated. Decision Tree has the highest training accuracy, 94.125, testing accuracy, 92.75, and precision score of 96.97. The cross‐validation score of all three models is almost 93. Therefore, Decision tree is used to identify legitimate websites. After getting the list of legitimate URLs, all the content of the legitimate website is extracted. A Semantic Similarity between top‐rank legitimate website content and legitimate websites is found using Natural language processing techniques. Then the websites are ranked based on similarity and the value of the trust is assigned from highly trustable to less trustable. We have compared and correlated our results with the Web of Trust, a reputation tool for trust analysis, and have achieved a positive correlation. Thus, our approach removes phishing websites and enhances the trust in other websites that are not at the top of the search engine.
Subject
Computational Theory and Mathematics,Computer Networks and Communications,Computer Science Applications,Theoretical Computer Science,Software
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献