Eleven quick tips for data cleaning and feature engineering-Reference-Cited by-同舟云学术

Eleven quick tips for data cleaning and feature engineering

Published:2022-12-15 Issue:12 Volume:18 Page:e1010718
ISSN:1553-7358
Container-title:PLOS Computational Biology
language:en
Short-container-title:PLoS Comput Biol

Author:

Chicco Davide^ORCID,Oneto Luca^ORCID,Tavazzi Erica

Abstract

Applying computational statistics or machine learning methods to data is a key component of many scientific studies, in any field, but alone might not be sufficient to generate robust and reliable outcomes and results. Before applying any discovery method, preprocessing steps are necessary to prepare the data to the computational analysis. In this framework, data cleaning and feature engineering are key pillars of any scientific study involving data analysis and that should be adequately designed and performed since the first phases of the project. We call “feature” a variable describing a particular trait of a person or an observation, recorded usually as a column in a dataset. Even if pivotal, these data cleaning and feature engineering steps sometimes are done poorly or inefficiently, especially by beginners and unexperienced researchers. For this reason, we propose here our quick tips for data cleaning and feature engineering on how to carry out these important preprocessing steps correctly avoiding common mistakes and pitfalls. Although we designed these guidelines with bioinformatics and health informatics scenarios in mind, we believe they can more in general be applied to any scientific area. We therefore target these guidelines to any researcher or practitioners wanting to perform data cleaning or feature engineering. We believe our simple recommendations can help researchers and scholars perform better computational analyses that can lead, in turn, to more solid outcomes and more reliable discoveries.

Publisher

Public Library of Science (PLoS)

Subject

Computational Theory and Mathematics,Cellular and Molecular Neuroscience,Genetics,Molecular Biology,Ecology,Modeling and Simulation,Ecology, Evolution, Behavior and Systematics

Reference225 articles.

1. A few useful things to know about machine learning;P. Domingos;Commun ACM.,2012

2. Data cleaning: detecting, diagnosing, and editing data abnormalities.;J Van den Broeck;PLoS Med,2005

3. Best practices in data cleaning: a complete guide to everything you need to do before and after collecting your data.;JW Osborne;Sage,2013

Cited by 16 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Machine Learning - Based Bleeding Risk Predictions in Atrial Fibrillation Patients on Direct Oral Anticoagulants;2024-05-27

2. Efficient management of pulmonary embolism diagnosis using a two-step interconnected machine learning model based on electronic health records data;Health Information Science and Systems;2024-03-06

3. Role of Artificial Intelligence in Multinomial Decisions and Preventative Nutrition in Alzheimer's Disease;Molecular Nutrition & Food Research;2024-01-04

4. CrossAAD: Cross-Chain Abnormal Account Detection;Lecture Notes in Computer Science;2024

5. A Theoretical framework for Harnessing Machine Learning for Digital Forensics in Online Social Networks;Lecture Notes in Networks and Systems;2024