Finite State Automata on Multi-Word Units for Efficient Text-Mining-Reference-Cited by-同舟云学术

Finite State Automata on Multi-Word Units for Efficient Text-Mining

Published:2024-02-06 Issue:4 Volume:12 Page:506
ISSN:2227-7390
Container-title:Mathematics
language:en
Short-container-title:Mathematics

Author:

Postiglione Alberto¹^ORCID

Affiliation:

1. Department of Business Science and Management & Innovation Systems, University of Salerno, Via San Giovanni Paolo II, 84084 Fisciano, Italy

Abstract

Text mining is crucial for analyzing unstructured and semi-structured textual documents. This paper introduces a fast and precise text mining method based on a finite automaton to extract knowledge domains. Unlike simple words, multi-word units (such as credit card) are emphasized for their efficiency in identifying specific semantic areas due to their predominantly monosemic nature, their limited number and their distinctiveness. The method focuses on identifying multi-word units within terminological ontologies, where each multi-word unit is associated with a sub-domain of ontology knowledge. The algorithm, designed to handle the challenges posed by very long multi-word units composed of a variable number of simple words, integrates user-selected ontologies into a single finite automaton during a fast pre-processing step. At runtime, the automaton reads input text character by character, efficiently locating multi-word units even if they overlap. This approach is efficient for both short and long documents, requiring no prior training. Ontologies can be updated without additional computational costs. An early system prototype, tested on 100 short and medium-length documents, recognized the knowledge domains for the vast majority of texts (over 90%) analyzed. The authors suggest that this method could be a valuable semantic-based knowledge domain extraction technique in unstructured documents.

Publisher

MDPI AG

Link

https://www.mdpi.com/2227-7390/12/4/506/pdf

Reference86 articles.

1. Big Data: A Survey;Chen;Mob. Netw. Appl.,2014

2. Data-Intensive Applications, Challenges, Techniques and Technologies: A Survey on Big Data;Zhang;Inf. Sci.,2014

3. Big Data Analytics: A Survey;Tsai;J. Big Data,2015

4. Big Data Technologies: A Survey;Oussous;J. King Saud Univ.-Comput. Inf. Sci.,2018

5. A Survey on Data-efficient Algorithms in Big Data Era;Adadi;J. Big Data,2021

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Predictive Maintenance with Linguistic Text Mining;Mathematics;2024-04-04