Affiliation:
1. University of California at Davis
2. Microsoft Research Redmond
Abstract
Various document types that combine model and view (e.g., text files, webpages, spreadsheets) make it easy to organize (possibly hierarchical) data, but make it difficult to extract raw data for any further manipulation or querying. We present a general framework FlashExtract to extract relevant data from semi-structured documents using examples. It includes: (a) an interaction model that allows end-users to give examples to extract various fields and to relate them in a hierarchical organization using structure and sequence constructs. (b) an inductive synthesis algorithm to synthesize the intended program from few examples in
any
underlying domain-specific language for data extraction that has been built using our specified algebra of few core operators (map, filter, merge, and pair). We describe instantiation of our framework to three different domains: text files, webpages, and spreadsheets. On our benchmark comprising 75 documents, FlashExtract is able to extract intended data using an average of 2.36 examples in 0.84 seconds per field.
Publisher
Association for Computing Machinery (ACM)
Subject
Computer Graphics and Computer-Aided Design,Software
Reference24 articles.
1. OpenRefine. http://openrefine.org/. OpenRefine. http://openrefine.org/.
2. Header and Unit Inference for Spreadsheets Through Spatial Analyses
3. IEPAD
Cited by
48 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Survey of intelligent program synthesis techniques;International Conference on Algorithms, High Performance Computing, and Artificial Intelligence (AHPCAI 2023);2023-12-07
2. Programming by Example Made Easy;ACM Transactions on Software Engineering and Methodology;2023-11-24
3. FormaT5: Abstention and Examples for Conditional Table Formatting with Natural Language;Proceedings of the VLDB Endowment;2023-11
4. DataRinse: Semantic Transforms for Data Preparation Based on Code Mining;Proceedings of the VLDB Endowment;2023-08
5. Cornet: Learning Spreadsheet Formatting Rules by Example;Proceedings of the VLDB Endowment;2023-08