SystemT-Reference-Cited by-同舟云学术

SystemT

Published:2009-03-20 Issue:4 Volume:37 Page:7-13
ISSN:0163-5808
Container-title:ACM SIGMOD Record
language:en
Short-container-title:SIGMOD Rec.

Author:

Krishnamurthy Rajasekar¹,Li Yunyao¹,Raghavan Sriram¹,Reiss Frederick¹,Vaithyanathan Shivakumar¹,Zhu Huaiyu¹

Affiliation:

1. IBM Almaden Research Center

Abstract

As applications within and outside the enterprise encounter increasing volumes of unstructured data, there has been renewed interest in the area of information extraction (IE) -- the discipline concerned with extracting structured information from unstructured text. Classical IE techniques developed by the NLP community were based on cascading grammars and regular expressions. However, due to the inherent limitations of grammarbased extraction, these techniques are unable to: (i) scale to large data sets, and (ii) support the expressivity requirements of complex information tasks. At the IBM Almaden Research Center, we are developing SystemT, an IE system that addresses these limitations by adopting an algebraic approach. By leveraging well-understood database concepts such as declarative queries and costbased optimization, SystemT enables scalable execution of complex information extraction tasks. In this paper, we motivate the SystemT approach to information extraction. We describe our extraction algebra and demonstrate the effectiveness of our optimization techniques in providing orders of magnitude reduction in the running time of complex extraction tasks.

Publisher

Association for Computing Machinery (ACM)

Subject

Information Systems,Software

Link

https://dl.acm.org/doi/pdf/10.1145/1519103.1519105

Reference13 articles.

1. The Common Pattern Specification Language

2. Managing information extraction

Cited by 72 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Aligning Data with the Goals of an Organization and Its Workers: Designing Data Labeling for Social Service Case Notes;Proceedings of the CHI Conference on Human Factors in Computing Systems;2024-05-11

2. Generalized Core Spanner Inexpressibility via Ehrenfeucht-Fraïssé Games for FC;Proceedings of the ACM on Management of Data;2024-05-10

3. The Complexity of Aggregates over Extractions by Regular Expressions;Logical Methods in Computer Science;2023-08-09

4. Integrated Data Mapping Engine (DaME) for Financial Services;2022 IEEE International Conference on Big Data (Big Data);2022-12-17

5. Mapping of Financial Services datasets using Human-in-the-Loop;3rd ACM International Conference on AI in Finance;2022-10-26