Abstract
AbstractType inference refers to the task of inferring the data type of a given column of data. Current approaches often fail when data contains missing data and anomalies, which are found commonly in real-world data sets. In this paper, we propose ptype, a probabilistic robust type inference method that allows us to detect such entries, and infer data types. We further show that the proposed method outperforms existing methods.
Funder
Engineering and Physical Sciences Research Council
Publisher
Springer Science and Business Media LLC
Subject
Computer Networks and Communications,Computer Science Applications,Information Systems
Reference36 articles.
1. Bahl L, Brown P, De Souza P, Mercer R (1986) Maximum mutual information estimation of hidden Markov model parameters for speech recognition. In: IEEE ICASSP’86, IEEE, vol 11, pp 49–52
2. Brown PF (1987) The acoustic-modeling problem in automatic speech recognition. Ph.D. Dissertation, Department of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA
3. Chandola V, Banerjee A, Kumar V (2009) Anomaly detection: a survey. ACM Comput Surv (CSUR) 41(3):15
4. Dasu T, Johnson T (2003) Exploratory data mining and data cleaning: an overview. In: Exploratory data mining and data cleaning, vol 479, 1st edn, John Wiley & Sons, Inc., New York, NY, USA, chap 1, pp 1–16
5. Dietterich TG (1998) Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput 10(7):1895–1923
Cited by
5 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. AI Assistants: A Framework for Semi-Automated Data Wrangling;IEEE Transactions on Knowledge and Data Engineering;2022
2. Automating Data Science;Metalearning;2022
3. Spread2RML;Proceedings of the 11th on Knowledge Capture Conference;2021-12-02
4. Semi-automatic Column Type Inference for CSV Table Understanding;SOFSEM 2021: Theory and Practice of Computer Science;2021
5. Muppets: Multipurpose Table Segmentation;Advances in Intelligent Data Analysis XIX;2021