Affiliation:
1. University of Texas at Austin, USA
2. Microsoft Research, USA
Abstract
Data filtering in spreadsheets is a common problem faced by millions of end-users. The task of data filtering requires a computational model that can separate intended positive and negative string instances. We present a system, FIDEX, that can efficiently learn desired data filtering expressions from a small set of positive and negative string examples.
There are two key ideas of our approach. First, we design an expressive DSL to represent disjunctive filter expressions needed for several real-world data filtering tasks. Second, we develop an efficient synthesis algorithm for incrementally learning consistent filter expressions in the DSL from very few positive and negative examples. A DAG-based data structure is used to succinctly represent a large number of filter expressions, and two corresponding operators are defined for algorithmically handling positive and negative examples, namely, the intersection and subtraction operators. FIDEX is able to learn data filters for 452 out of 460 real-world data filtering tasks in real time (0.22s), using only 2.2 positive string instances and 2.7 negative string instances on average.
Publisher
Association for Computing Machinery (ACM)
Subject
Computer Graphics and Computer-Aided Design,Software
Cited by
24 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Data-Driven Insight Synthesis for Multi-Dimensional Data;Proceedings of the VLDB Endowment;2024-01
2. Search-Based Regular Expression Inference on a GPU;Proceedings of the ACM on Programming Languages;2023-06-06
3. Trace-Guided Inductive Synthesis of Recursive Functional Programs;Proceedings of the ACM on Programming Languages;2023-06-06
4. INTENT: Interactive Tensor Transformation Synthesis;The 35th Annual ACM Symposium on User Interface Software and Technology;2022-10-28
5. Spine: Scaling up Programming-by-Negative-Example for String Filtering and Transformation;Proceedings of the 2022 International Conference on Management of Data;2022-06-10