Veridical Data Science

The Practice of Responsible Data Analysis and Decision Making

Using real-world data case studies, this innovative and accessible textbook introduces an actionable framework for conducting trustworthy data science.

Most textbooks present data science as a linear analytic process involving a set of statistical and computational techniques without accounting for the challenges intrinsic to real-world applications. Veridical Data Science, by contrast, embraces the reality that most projects begin with an ambiguous domain question and messy data; it acknowledges that datasets are mere approximations of reality while analyses are mental constructs. 
Bin Yu and Rebecca Barter employ the innovative Predictability, Computability, and Stability (PCS) framework to assess the trustworthiness and relevance of data-driven results relative to three sources of uncertainty that arise throughout the data science life cycle: the human decisions and judgment calls made during data collection, cleaning, and modeling. By providing real-world data case studies, intuitive explanations of common statistical and machine learning techniques, and supplementary R and Python code, Veridical Data Science offers a clear and actionable guide for conducting responsible data science. Requiring little background knowledge, this lucid, self-contained textbook provides a solid foundation and principled framework for future study of advanced methods in machine learning, statistics, and data science. 

  • Presents the Predictability, Computability, and Stability (PCS) methodology for producing trustworthy data-driven results
  • Teaches how a data science project should be conducted from beginning to end, including extensive discussion of the data scientist's decision-making process
  • Cultivates critical thinking throughout the entire data science life cycle
  • Provides practical examples and illuminating case studies of real-world data analysis problems with associated code, exercises, and solutions
  • Suitable for advanced undergraduate and graduate students, domain scientists, and practitioners
Bin Yu is Chancellor's Distinguished Professor and Class of 1936 Second Chair in Statistics, EECS, and Computational Biology at the University of California, Berkeley, a 2006 Guggenheim Fellow, and a member of the US National Academy of Sciences and the American Academy of Arts and Sciences.

Rebecca L. Barter is Research Assistant Professor in Epidemiology at the University of Utah.
Contents vii
Acknowledgments xv
Preface xvii
I PART 1: AN INTRODUCTION TO VERIDICAL DATA SCIENCE 1
1 An introduction to veridical data science 3
2 The Data Science Life Cycle 23
3 Setting up your data science project 43
II PART 2: PREPARING, EXPLORING, AND DESCRIBING DATA 65
4 Data Preparation 67
5 Exploratory Data Analysis 109
6 Principal component analysis 149
7 Clustering 197
III PART 3: PREDICTION 253
8 An introduction to prediction problems 255
9 Predicting continuous responses with Least Squares 275
10 Extending the Least Squares algorithm 311
11 Predicting binary responses and logistic regression 353
12 Decision trees and random forest 403
13 Producing the final prediction results 437
14 Conclusion 473
Answers to True or False exercises 481

About

Using real-world data case studies, this innovative and accessible textbook introduces an actionable framework for conducting trustworthy data science.

Most textbooks present data science as a linear analytic process involving a set of statistical and computational techniques without accounting for the challenges intrinsic to real-world applications. Veridical Data Science, by contrast, embraces the reality that most projects begin with an ambiguous domain question and messy data; it acknowledges that datasets are mere approximations of reality while analyses are mental constructs. 
Bin Yu and Rebecca Barter employ the innovative Predictability, Computability, and Stability (PCS) framework to assess the trustworthiness and relevance of data-driven results relative to three sources of uncertainty that arise throughout the data science life cycle: the human decisions and judgment calls made during data collection, cleaning, and modeling. By providing real-world data case studies, intuitive explanations of common statistical and machine learning techniques, and supplementary R and Python code, Veridical Data Science offers a clear and actionable guide for conducting responsible data science. Requiring little background knowledge, this lucid, self-contained textbook provides a solid foundation and principled framework for future study of advanced methods in machine learning, statistics, and data science. 

  • Presents the Predictability, Computability, and Stability (PCS) methodology for producing trustworthy data-driven results
  • Teaches how a data science project should be conducted from beginning to end, including extensive discussion of the data scientist's decision-making process
  • Cultivates critical thinking throughout the entire data science life cycle
  • Provides practical examples and illuminating case studies of real-world data analysis problems with associated code, exercises, and solutions
  • Suitable for advanced undergraduate and graduate students, domain scientists, and practitioners

Author

Bin Yu is Chancellor's Distinguished Professor and Class of 1936 Second Chair in Statistics, EECS, and Computational Biology at the University of California, Berkeley, a 2006 Guggenheim Fellow, and a member of the US National Academy of Sciences and the American Academy of Arts and Sciences.

Rebecca L. Barter is Research Assistant Professor in Epidemiology at the University of Utah.

Table of Contents

Contents vii
Acknowledgments xv
Preface xvii
I PART 1: AN INTRODUCTION TO VERIDICAL DATA SCIENCE 1
1 An introduction to veridical data science 3
2 The Data Science Life Cycle 23
3 Setting up your data science project 43
II PART 2: PREPARING, EXPLORING, AND DESCRIBING DATA 65
4 Data Preparation 67
5 Exploratory Data Analysis 109
6 Principal component analysis 149
7 Clustering 197
III PART 3: PREDICTION 253
8 An introduction to prediction problems 255
9 Predicting continuous responses with Least Squares 275
10 Extending the Least Squares algorithm 311
11 Predicting binary responses and logistic regression 353
12 Decision trees and random forest 403
13 Producing the final prediction results 437
14 Conclusion 473
Answers to True or False exercises 481