Introduction to data science. A Python approach to concepts, techniques and applications. With contributions from Jordi Vitrià, Eloi Puertas Petia Radeva, Oriol Pujol, Sergio Escalera, Francesc Dantí and Lluís Garrido (Q523732)
From MaRDI portal
scientific article
Language | Label | Description | Also known as |
---|---|---|---|
English | Introduction to data science. A Python approach to concepts, techniques and applications. With contributions from Jordi Vitrià, Eloi Puertas Petia Radeva, Oriol Pujol, Sergio Escalera, Francesc Dantí and Lluís Garrido |
scientific article |
Statements
Introduction to data science. A Python approach to concepts, techniques and applications. With contributions from Jordi Vitrià, Eloi Puertas Petia Radeva, Oriol Pujol, Sergio Escalera, Francesc Dantí and Lluís Garrido (English)
0 references
21 April 2017
0 references
The book ``Introduction to Data Science'' is built as a starter presentation of concepts, techniques and approaches that constitute the initial contact with data science for scientists (computer scientists, statisticians, mathematicians, biologists and bioinformaticians, journalists and sociologists). Following an interlinked set of steps of data collection, noise detection, identification of hypotheses, visualization etc. applied on practical examples and datasets, the authors guide the readers through the common tasks of data analysis. The book is structured in ten chapters and it commences with an overview of its contents and a brief description of its fit with the four different strategies for exploring data: probing reality, pattern discovery, predicting future events and understanding people and the world. In the second chapter a set of toolboxes and basic commands in Python are introduced. The authors present briefly the libraries for numeric and scientific computation, the machine learning SCIKIT-learn and the PANDAS (Python data analysis libraries). The approaches for handling the data in Python (reading, selecting, plotting, etc.) are also discussed. In the third chapter some descriptive statistics are presented (using as example a dataset describing financial parameters of the US population). These include exploratory data analysis (summarization of data, analysis of distributions) and estimations of the mean, variance and other standard scores. In the fourth chapter the statistical inferences and hypothesis testing are presented, using the frequentist approach. The fifth chapter focuses on supervised learning with an emphasis on the two most frequently used models, the support vector machines (SVMs) and random forests (RFs). The sixth chapter discusses regression analysis. The authors present both linear regression (the simple form and the multiple and polynomial) and logistic regression. Next, in the seventh chapter, the unsupervised learning is introduced and the role of similarity measures and distances is discussed. The authors also examine a set of objective metrics for evaluating the quality of the clustering. In the eighth chapter the authors present a graph-based network analysis. The practical case of a Facebook dataset is discussed; the problem of drawing centrality in graphs is also presented and the Page rank is used as an example. In the ninth chapter the authors introduce the concept of filtering using recommender systems (content-based, collaborative or hybrid). In the tenth chapter the statistical natural language processing for sentiment analysis is introduced. The data cleaning and approaches for the representation of text are discussed on examples. The book concludes with a chapter on parallel computing focussing on the IPython computing architecture, multicore programming and distributed computing. The concepts are exemplified on a New York taxi trip example. The style of the book recommends it to both undergraduates and postgraduates and the concluding remarks and references provide guidance for the next steps in the study of particular topics. The book, albeit not exhaustive, offers a good introduction to the vast (and continuously expanding) domain of data science.
0 references
data science
0 references
Python
0 references
descriptive statistics
0 references
exploratory data analysis
0 references
statistical inference
0 references
hypothesis testing
0 references
supervised learning
0 references
regression analysis
0 references
unsupervised learning
0 references
clustering
0 references
network analysis
0 references
graphs
0 references
recommender systems
0 references
natural language processing
0 references
sentiment analysis
0 references
text representation
0 references
parallel computing
0 references
multicore programming
0 references
distributed computing
0 references