Affiliation:
1. University of Michigan
2. University of Amsterdam
3. IIT Delhi
4. Instabase
Abstract
Blueprint is a declarative domain-specific language for document extraction. Users describe document layout using spatial, textual, semantic, and numerical fuzzy constraints, and the language runtime extracts the field-value mappings that best satisfy the constraints in a given document.
We used Blueprint to develop several document extraction solutions in a commercial setting. This approach to the extraction problem proved powerful. Concise Blueprint programs were able to generate good accuracy on a broad set of use cases. However, a major goal of our work was to build a system that non-experts, and in particular non-engineers, could use effectively, and we found that writing declarative fuzzy constraint-based extraction programs was not intuitive for many users: a large up-front learning investment was required to be effective, and debugging was often challenging.
To address these issues, we developed a no-code IDE for Blueprint, called Studio, as well as program synthesis functionality for automatically generating Blueprint programs from training data, which could be created by labeling document samples in our IDE. Overall, the IDE significantly improved the Blueprint development experience and the results users were able to achieve.
In this paper, we discuss the design, implementation, and deployment of Blueprint and Studio. We compare our system with a state-of-the-art deep-learning based extraction tool and show that our system can achieve comparable accuracy results, with comparable development time, for appropriately-chosen use cases, while providing better interpretability and debuggability.
Publisher
Association for Computing Machinery (ACM)
Subject
General Earth and Planetary Sciences,Water Science and Technology,Geography, Planning and Development
Reference33 articles.
1. 2021 (accessed 12-July-2022). Blueprint source code on GitHub. https://github.com/instabase/blueprint-oss. 2021 (accessed 12-July-2022). Blueprint source code on GitHub. https://github.com/instabase/blueprint-oss.
2. 2021 (accessed 12-July-2022). LayoutLM using the SROIE dataset. https://www.kaggle.com/urbikn/layoutlm-using-the-sroie-dataset/notebook. 2021 (accessed 12-July-2022). LayoutLM using the SROIE dataset. https://www.kaggle.com/urbikn/layoutlm-using-the-sroie-dataset/notebook.
3. James Stuart Aitken . 2002 . Learning Information Extraction Rules: An Inductive Logic Programming Approach . In Proceedings of the 15th European Conference on Artificial Intelligence ( Lyon, France) (ECAI'02). IOS Press, NLD, 355--359. James Stuart Aitken. 2002. Learning Information Extraction Rules: An Inductive Logic Programming Approach. In Proceedings of the 15th European Conference on Artificial Intelligence (Lyon, France) (ECAI'02). IOS Press, NLD, 355--359.
4. Looking for a good fuzzy system interpretability index: An experimental approach
5. MIDV-2020: a comprehensive benchmark dataset for identity document analysis
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献