Text and metadata extraction from scanned Arabic documents using support vector machines-Reference-Cited by-同舟云学术

Text and metadata extraction from scanned Arabic documents using support vector machines

Published:2020-10-15 Issue: Volume: Page:016555152096125
ISSN:0165-5515
Container-title:Journal of Information Science
language:en
Short-container-title:Journal of Information Science

Author:

Qin Wenda¹,Elanwar Randa²^ORCID,Betke Margrit¹

Affiliation:

1. Boston University, USA

2. Electronics Research Institute, Egypt

Abstract

Text information in scanned documents becomes accessible only when extracted and interpreted by a text recognizer. For a recognizer to work successfully, it must have detailed location information about the regions of the document images that it is asked to analyse. It will need focus on page regions with text skipping non-text regions that include illustrations or photographs. However, text recognizers do not work as logical analyzers. Logical layout analysis automatically determines the function of a document text region, that is, it labels each region as a title, paragraph, or caption, and so on, and thus is an essential part of a document understanding system. In the past, rule-based algorithms have been used to conduct logical layout analysis, using limited size data sets. We here instead focus on supervised learning methods for logical layout analysis. We describe LABA, a system based on multiple support vector machines to perform logical Layout Analysis of scanned Books pages in Arabic. The system detects the function of a text region based on the analysis of various images features and a voting mechanism. For a baseline comparison, we implemented an older but state-of-the-art neural network method. We evaluated LABA using a data set of scanned pages from illustrated Arabic books and obtained high recall and precision values. We also found that the F-measure of LABA is higher for five of the tested six classes compared to the state-of-the-art method.

Funder

National Science Foundation

Publisher

SAGE Publications

Subject

Library and Information Sciences,Information Systems

Link

http://journals.sagepub.com/doi/pdf/10.1177/0165551520961256

Reference29 articles.

1. Analysis of the Logical Layout of Documents

2. Document Analysis System

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Semantic Document Layout Analysis of Handwritten Manuscripts;Computers, Materials & Continua;2023

2. A Review of Arabic Document Analysis Methods;2022 4th International Conference on Pattern Analysis and Intelligent Systems (PAIS);2022-10-12