Abstract
Historical topic modeling and semantic concepts exploration in a large corpus of unstructured text remains a hard, opened problem. Despite advancements in natural languages processing tools, statistical linguistics models, graph theory and visualization, there is no framework that combines these piece-wise tools under one roof. We designed and constructed a Semantic Network Analysis Pipeline (SNAP) that is available as an open-source web-service that implements work-flow needed by a data scientist to explore historical semantic concepts in a text corpus. We define a graph theoretic notion of a semantic concept as a flow of closely related tokens through the corpus of text. The modular work-flow pipeline processes text using natural language processing tools, statistical content narrowing, creates semantic networks from lexical token chaining, performs social network analysis of token networks and creates a 3D visualization of the semantic concept flows through corpus for interactive concept exploration. Finally, we illustrate the framework’s utility to extract the information from a text corpus of Herman Melville’s novel Moby Dick, the transcript of the 2015–2016 United States (U.S.) Senate Hearings on Environment and Public Works, and the Australian Broadcast Corporation’s short news articles on rural and science topics.
Funder
National Science Foundation
Subject
Fluid Flow and Transfer Processes,Computer Science Applications,Process Chemistry and Technology,General Engineering,Instrumentation,General Materials Science
Reference54 articles.
1. Automatic extraction of semantic networks from text using Leximancer;Smith,2003
2. Principles of Semantic Networks: Explorations in the Representation of knowledge;Sowa,2014
3. A hidden Markov-model-based trainable speech synthesizer
4. Natural language processing: an introduction
5. Semantic Network Analysis Project (SNAP)
https://github.com/mcenek/SNAP