SystemT

Author:

Krishnamurthy Rajasekar1,Li Yunyao1,Raghavan Sriram1,Reiss Frederick1,Vaithyanathan Shivakumar1,Zhu Huaiyu1

Affiliation:

1. IBM Almaden Research Center

Abstract

As applications within and outside the enterprise encounter increasing volumes of unstructured data, there has been renewed interest in the area of information extraction (IE) -- the discipline concerned with extracting structured information from unstructured text. Classical IE techniques developed by the NLP community were based on cascading grammars and regular expressions. However, due to the inherent limitations of grammarbased extraction, these techniques are unable to: (i) scale to large data sets, and (ii) support the expressivity requirements of complex information tasks. At the IBM Almaden Research Center, we are developing SystemT, an IE system that addresses these limitations by adopting an algebraic approach. By leveraging well-understood database concepts such as declarative queries and costbased optimization, SystemT enables scalable execution of complex information extraction tasks. In this paper, we motivate the SystemT approach to information extraction. We describe our extraction algebra and demonstrate the effectiveness of our optimization techniques in providing orders of magnitude reduction in the running time of complex extraction tasks.

Publisher

Association for Computing Machinery (ACM)

Subject

Information Systems,Software

Cited by 72 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. Aligning Data with the Goals of an Organization and Its Workers: Designing Data Labeling for Social Service Case Notes;Proceedings of the CHI Conference on Human Factors in Computing Systems;2024-05-11

2. Generalized Core Spanner Inexpressibility via Ehrenfeucht-Fraïssé Games for FC;Proceedings of the ACM on Management of Data;2024-05-10

3. The Complexity of Aggregates over Extractions by Regular Expressions;Logical Methods in Computer Science;2023-08-09

4. Integrated Data Mapping Engine (DaME) for Financial Services;2022 IEEE International Conference on Big Data (Big Data);2022-12-17

5. Mapping of Financial Services datasets using Human-in-the-Loop;3rd ACM International Conference on AI in Finance;2022-10-26

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3