Abstract
The initial step for a data scientist when addressing a business question is to identify the data type, as not all types can be employed in data mining analyses. Accordingly, the data scientist must select a suitable data type that corresponds to the data mining technique and classify the data into categorical and continuous types, regardless of the source of the data. Quality control is a significant factor for the data scientist, particularly if data collection was poorly administered or designed, leading to issues like missing values. Once the data scientist has acquired a relevant dataset, they should inspect the outliers associated with each feature to make sure the data is suitable for analysis. Observing outliers through data visualizations, such as scatter plots, is a common practice among data scientists, highlighting the crucial role of data type determination.
Reference21 articles.
1. AndrewsF. M.MessengerR. C. (1973). Multivariate nominal scale analysis; a report on a new analysis technique and a computer program. University of Michigan.
2. Uncertain distance-based outlier detection with arbitrarily shaped data objects
3. Bean, R. (2022). Why Becoming a Data-Driven Organization Is So Hard. Harvard Business Review Digital Articles, 1–6.
4. The Problem of Data Bias in the Pool of Published U.S. Appellate Court Opinions
5. Partnership on AI, data, and robotics