Affiliation:
1. University of the District of Columbia, USA
Abstract
Data science and big data analytics are still at the center of computer science and information technology. Students and researchers not in computer science often found difficulties in real data analytics using programming languages such as Python and Scala, especially when they attempt to use Apache-Spark in cloud computing environments-Spark Scala and PySpark. At the same time, students in information technology could find it difficult to deal with the mathematical background of data science algorithms. To overcome these difficulties, this chapter will provide a practical guideline to different users in this area. The authors cover the main algorithms for data science and machine learning including principal component analysis (PCA), support vector machine (SVM), k-means, k-nearest neighbors (kNN), regression, neural networks, and decision trees. A brief description of these algorithms will be explained, and the related code will be selected to fit simple data sets and real data sets. Some visualization methods including 2D and 3D displays will be also presented in this chapter.