Learn the art and science of predictive analytics techniques that get results
Predictive analytics is what translates big data into meaningful, usable business information. Written by a leading expert in the field, this guide examines the science of the underlying algorithms as well as the principles and best practices that govern the art of predictive analytics. It clearly explains the theory behind predictive analytics, teaches the methods, principles, and techniques for conducting predictive analytics projects, and offers tips and tricks that are essential for successful predictive modeling. Hands-on examples and case studies are included.
The ability to successfully apply predictive analytics enables businesses to effectively interpret big data; essential for competition today
This guide teaches not only the principles of predictive analytics, but also how to apply them to achieve real, pragmatic solutions
Explains methods, principles, and techniques for conducting predictive analytics projects from start to finish
Illustrates each technique with hands-on examples and includes as series of in-depth case studies that apply predictive analytics to common business scenarios
A companion website provides all the data sets used to generate the examples as well as a free trial version of software
Applied Predictive Analytics arms data and business analysts and business managers with the tools they need to interpret and capitalize on big data.
DEAN ABBOTT is President of Abbott Analytics, Inc. (San Diego). He is an internationally recognized data mining and predictive analytics expert with over two decades experience in fraud detection, risk modeling, text mining, personality assessment, planned giving, toxicology, and other applications. He is also Chief Scientist of SmarterRemarketer, a company focusing on behaviorally- and data-driven marketing and web analytics.
Introduction xxi Chapter 1 Overview of Predictive Analytics 1 What Is Analytics? 3 What Is Predictive Analytics? 3 Supervised vs. Unsupervised Learning 5 Parametric vs. Non-Parametric Models 6 Business Intelligence 6 Predictive Analytics vs. Business Intelligence 8 Do Predictive Models Just State the Obvious? 9 Similarities between Business Intelligence and Predictive Analytics 9 Predictive Analytics vs. Statistics 10 Statistics and Analytics 11 Predictive Analytics and Statistics Contrasted 12 Predictive Analytics vs. Data Mining 13 Who Uses Predictive Analytics? 13 Challenges in Using Predictive Analytics 14 Obstacles in Management 14 Obstacles with Data 14 Obstacles with Modeling 15 Obstacles in Deployment 16 What Educational Background Is Needed to Become a Predictive Modeler? 16 Chapter 2 Setting Up the Problem 19 Predictive Analytics Processing Steps: CRISP-DM 19 Business Understanding 21 The Three-Legged Stool 22 Business Objectives 23 Defining Data for Predictive Modeling 25 Defining the Columns as Measures 26 Defining the Unit of Analysis 27 Which Unit of Analysis? 28 Defining the Target Variable 29 Temporal Considerations for Target Variable 31 Defining Measures of Success for Predictive Models 32 Success Criteria for Classifi cation 32 Success Criteria for Estimation 33 Other Customized Success Criteria 33 Doing Predictive Modeling Out of Order 34 Building Models First 34 Early Model Deployment 35 Case Study: Recovering Lapsed Donors 35 Overview 36 Business Objectives 36 Data for the Competition 36 The Target Variables 36 Modeling Objectives 37 Model Selection and Evaluation Criteria 38 Model Deployment 39 Case Study: Fraud Detection 39 Overview 39 Business Objectives 39 Data for the Project 40 The Target Variables 40 Modeling Objectives 41 Model Selection and Evaluation Criteria 41 Model Deployment 41 Summary 42 Chapter 3 Data Understanding 43 What the Data Looks Like 44 Single Variable Summaries 44 Mean 45 Standard Deviation 45 The Normal Distribution 45 Uniform Distribution 46 Applying Simple Statistics in Data Understanding 47 Skewness 49 Kurtosis 51 Rank-Ordered Statistics 52 Categorical Variable Assessment 55 Data Visualization in One Dimension 58 Histograms 59 Multiple Variable Summaries 64 Hidden Value in Variable Interactions: Simpson s Paradox 64 The Combinatorial Explosion of Interactions 65 Correlations 66 Spurious Correlations 66 Back to Correlations 67 Crosstabs 68 Data Visualization, Two or Higher Dimensions 69 Scatterplots 69 Anscombe s Quartet 71 Scatterplot Matrices 75 Overlaying the Target Variable in Summary 76 Scatterplots in More Than Two Dimensions 78 The Value of Statistical Signifi cance 80 Pulling It All Together into a Data Audit 81 Summary 82 Chapter 4 Data Preparation 83 Variable Cleaning 84 Incorrect Values 84 Consistency in Data Formats 85 Outliers 85 Multidimensional Outliers 89 Missing Values 90 Fixing Missing Data 91 Feature Creation 98 Simple Variable Transformations 98 Fixing Skew 99 Binning Continuous Variables 103 Numeric Variable Scaling 104 Nominal Variable Transformation 107 Ordinal Variable Transformations 108 Date and Time Variable Features 109 ZIP Code Features 110 Which Version of a Variable Is Best? 110 Multidimensional Features 112 Variable Selection Prior to Modeling 117 Sampling 123 Example: Why Normalization Matters for K-Means Clustering 139 Summary 143 Chapter 5 Itemsets and Association Rules 145 Terminology 146 Condition 147 Left-Hand-Side, Antecedent(s) 148 Right-Hand-Side, Consequent, Output, Conclusion 148 Rule (Item Set) 148 Support 149 Antecedent Support 149 Confi dence, Accuracy 150 Lift 150 Parameter Settings 151 How the Data Is Organized 151 Standard Predictive Modeling Data Format 151 Transactional Format 152 Measures of Interesting Rules 154 Deploying Association Rules 156 Variable Selection 157 Interaction Variable Creation 157 Problems with Association Rules 158 Redundant Rules 158 Too Many Rules 158 Too Few Rules 159 Building Classification Rules from Association Rules 159 Summary 161 Chapter 6 Descriptive Modeling 163 Data Preparation Issues with Descriptive Modeling 164 Principal Component Analysis 165 The PCA Algorithm 165 Applying PCA to New Data 169 PCA for Data Interpretation 171 Additional Considerations before Using PCA 172 The Effect of Variable Magnitude on PCA Models 174 Clustering Algorithms 177 The K-Means Algorithm 178 Data Preparation for K-Means 183 Selecting the Number of Clusters 185 The Kohonen SOM Algorithm 192 Visualizing Kohonen Maps 194 Similarities with K-Means 196 Summary 197 Chapter 7 Interpreting Descriptive Models 199 Standard Cluster Model Interpretation 199 Problems with Interpretation Methods 202 Identifying Key Variables in Forming Cluster Models 203 Cluster Prototypes 209 Cluster Outliers 210 Summary 212 Chapter 8 Predictive Modeling 213 Decision Trees 214 The Decision Tree Landscape 215 Building Decision Trees 218 Decision Tree Splitting Metrics 221 Decision Tree Knobs and Options 222 Reweighting Records: Priors 224 Reweighting Records: Misclassifi cation Costs 224 Other Practical Considerations for Decision Trees 229 Logistic Regression 230 Interpreting Logistic Regression Models 233 Other Practical Considerations for Logistic Regression 235 Neural Networks 240 Building Blocks: The Neuron 242 Neural Network Training 244 The Flexibility of Neural Networks 247 Neural Network Settings 249 Neural Network Pruning 251 Interpreting Neural Networks 252 Neural Network Decision Boundaries 253 Other Practical Considerations for Neural Networks 253 K-Nearest Neighbor 254 The k-NN Learning Algorithm 254 Distance Metrics for k-NN 258 Other Practical Considerations for k-NN 259 Naive Bayes 264 Bayes Theorem 264 The Naive Bayes Classifier 268 Interpreting Naive Bayes Classifi ers 268 Other Practical Considerations for Naive Bayes 269 Regression Models 270 Linear Regression 271 Linear Regression Assumptions 274 Variable Selection in Linear Regression 276 Interpreting Linear Regression Models 278 Using Linear Regression for Classification 279 Other Regression Algorithms 280 Summary 281 Chapter 9 Assessing Predictive Models 283 Batch Approach to Model Assessment 284 Percent Correct Classifi cation 284 Rank-Ordered Approach to Model Assessment 293 Assessing Regression Models 301 Summary 304 Chapter 10 Model Ensembles 307 Motivation for Ensembles 307 The Wisdom of Crowds 308 Bias Variance Tradeoff 309 Bagging 311 Boosting 316 Improvements to Bagging and Boosting 320 Random Forests 320 Stochastic Gradient Boosting 321 Heterogeneous Ensembles 321 Model Ensembles and Occam s Razor 323 Interpreting Model Ensembles 323 Summary 326 Chapter 11 Text Mining 327 Motivation for Text Mining 328 A Predictive Modeling Approach to Text Mining 329 Structured vs. Unstructured Data 329 Why Text Mining Is Hard 330 Text Mining Applications 332 Data Sources for Text Mining 333 Data Preparation Steps 333 POS Tagging 333 Tokens 336 Stop Word and Punctuation Filters 336 Character Length and Number Filters 337 Stemming 337 Dictionaries 338 The Sentiment Polarity Movie Data Set 339 Text Mining Features 340 Term Frequency 341 Inverse Document Frequency 344 TF-IDF 344 Cosine Similarity 346 Multi-Word Features: N-Grams 346 Reducing Keyword Features 347 Grouping Terms 347 Modeling with Text Mining Features 347 Regular Expressions 349 Uses of Regular Expressions in Text Mining 351 Summary 352 Chapter 12 Model Deployment 353 General Deployment Considerations 354 Deployment Steps 355 Summary 375 Chapter 13 Case Studies 377 Survey Analysis Case Study: Overview 377 Business Understanding: Defining the Problem 378 Data Understanding 380 Data Preparation 381 Modeling 385 Deployment: What-If Analysis 391 Revisit Models 392 Deployment 401 Summary and Conclusions 401 Help Desk Case Study 402 Data Understanding: Defining the Data 403 Data Preparation 403 Modeling 405 Revisit Business Understanding 407 Deployment 409 Summary and Conclusions 411 Index 413