Affiliation:
1. University of Wisconsin-Madison
Abstract
Entity matching (EM) finds data instances that refer to the same real-world entity. In 2015, we started the Magellan project at UW-Madison, jointly with industrial partners, to build EM systems. Most current EM systems are stand-alone monoliths. In contrast, Magellan borrows ideas from the field of data science (DS), to build a new kind of EM systems, which is ecosystems of interoperable tools for multiple execution environments, such as on-premise, cloud, and mobile. This paper describes Magellan, focusing on the system aspects. We argue why EM can be viewed as a special class of DS problems and thus can benefit from system building ideas in DS. We discuss how these ideas have been adapted to build <code>PyMatcher</code> and <code>CloudMatcher</code>, sophisticated on-premise tools for power users and self-service cloud tools for lay users. These tools exploit techniques from the fields of machine learning, big data scaling, efficient user interaction, databases, and cloud systems. They have been successfully used in 13 companies and domain science groups, have been pushed into production for many customers, and are being commercialized. We discuss the lessons learned and explore applying the Magellan template to other tasks in data exploration, cleaning, and integration.
Publisher
Association for Computing Machinery (ACM)
Reference13 articles.
1. Workshop on Human-In-the-Loop Data Analytics http://hilda.io/. Workshop on Human-In-the-Loop Data Analytics http://hilda.io/.
2. Falcon
3. Toward a system building agenda for Data Integration (and Data Science);Doan A.;IEEE Data Eng. Bull.,2018
Cited by
25 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献