Abstract
This paper discusses the tool for the main text and image extraction (extracting and parsing the important data) from a web document. This paper describes our proposed algorithm based on the Document Object Model (DOM) and natural language processing (NLP) techniques and other approaches for extracting information from web pages using various classification techniques such as support vector machine, decision tree techniques, naive Bayes, and K-nearest neighbor. The main aim of the developed algorithm was to identify and extract the main block of a web document that contains the text of the article and the relevant images. The algorithm on a sample of 45 web documents of different types was applied. In addition, the issue of web pages, from the structure of the document to the use of the Document Object Model (DOM) for their processing, was analyzed. The Document Object Model was used to load and navigation of the document. It also plays an important role in the correct identification of the main block of web documents. The paper also discusses the levels of natural language. These methods of automatic natural language processing help to identify the main block of the web document. In this way, the all-textual parts and images from the main content of the web document were extracted. The experimental results show that our method achieved a final classification accuracy of 88.18%.
Subject
Fluid Flow and Transfer Processes,Computer Science Applications,Process Chemistry and Technology,General Engineering,Instrumentation,General Materials Science