Learn a simpler and more effective way to analyze data and predict outcomes with Python
Machine Learning in Python shows you how to successfully analyze data using only two core machine learning algorithms, and how to apply them using Python. By focusing on two algorithm families that effectively predict outcomes, this book is able to provide full descriptions of the mechanisms at work, and the examples that illustrate the machinery with specific, hackable code. The algorithms are explained in simple terms with no complex math and applied using Python, with guidance on algorithm selection, data preparation, and using the trained models in practice. You will learn a core set of Python programming techniques, various methods of building predictive models, and how to measure the performance of each model to ensure that the right one is used. The chapters on penalized linear regression and ensemble methods dive deep into each of the algorithms, and you can use the sample code in the book to develop your own data analysis solutions.
Machine learning algorithms are at the core of data analytics and visualization. In the past, these methods required a deep background in math and statistics, often in combination with the specialized R programming language. This book demonstrates how machine learning can be implemented using the more widely used and accessible Python programming language.
Predict outcomes using linear and ensemble algorithm families
Build predictive models that solve a range of simple and complex problems
Apply core machine learning algorithms using Python
Use sample code directly to build custom solutions
Machine learning doesn't have to be complex and highly specialized. Python makes this technology more accessible to a much wider audience, using methods that are simpler, effective, and well tested. Machine Learning in Python shows you how to do this, without requiring an extensive background in math or statistics.
MICHAEL BOWLES teaches machine learning at Hacker Dojo in Silicon Valley, consults on machine learning projects, and is involved in a number of startups in such areas as bioinformatics and high-frequency trading. Following an assistant professorship at MIT, Michael went on to found and run two Silicon Valley startups, both of which went public. His courses at Hacker Dojo are nearly always sold out and receive great feedback from participants.
Introduction xxiii Chapter 1 The Two Essential Algorithms for Making Predictions 1 Why Are These Two Algorithms So Useful? 2 What Are Penalized Regression Methods? 7 What Are Ensemble Methods? 9 How to Decide Which Algorithm to Use 11 The Process Steps for Building a Predictive Model 13 Framing a Machine Learning Problem 15 Feature Extraction and Feature Engineering 17 Determining Performance of a Trained Model 18 Chapter Contents and Dependencies 18 Summary 20 Chapter 2 Understand the Problem by Understanding the Data 23 The Anatomy of a New Problem 24 Different Types of Attributes and Labels Drive Modeling Choices 26 Things to Notice about Your New Data Set 27 Classification Problems: Detecting Unexploded Mines Using Sonar 28 Physical Characteristics of the Rocks Versus Mines Data Set 29 Statistical Summaries of the Rocks versus Mines Data Set 32 Visualization of Outliers Using Quantile ]Quantile Plot 35 Statistical Characterization of Categorical Attributes 37 How to Use Python Pandas to Summarize the Rocks Versus Mines Data Set 37 Visualizing Properties of the Rocks versus Mines Data Set 40 Visualizing with Parallel Coordinates Plots 40 Visualizing Interrelationships between Attributes and Labels 42 Visualizing Attribute and Label Correlations Using a Heat Map 49 Summarizing the Process for Understanding Rocks versus Mines Data Set 50 Real ]Valued Predictions with Factor Variables: How Old Is Your Abalone? 50 Parallel Coordinates for Regression Problems Visualize Variable Relationships for Abalone Problem 56 How to Use Correlation Heat Map for Regression Visualize Pair ]Wise Correlations for the Abalone Problem 60 Real ]Valued Predictions Using Real ]Valued Attributes: Calculate How Your Wine Tastes 62 Multiclass Classification Problem: What Type of Glass Is That? 68 Summary 73 Chapter 3 Predictive Model Building: Balancing Performance, Complexity, and Big Data 75 The Basic Problem: Understanding Function Approximation 76 Working with Training Data 76 Assessing Performance of Predictive Models 78 Factors Driving Algorithm Choices and Performance Complexity and Data 79 Contrast Between a Simple Problem and a Complex Problem 80 Contrast Between a Simple Model and a Complex Model 82 Factors Driving Predictive Algorithm Performance 86 Choosing an Algorithm: Linear or Nonlinear? 87 Measuring the Performance of Predictive Models 88 Performance Measures for Different Types of Problems 88 Simulating Performance of Deployed Models 99 Achieving Harmony Between Model and Data 101 Choosing a Model to Balance Problem Complexity, Model Complexity, and Data Set Size 102 Using Forward Stepwise Regression to Control Overfitting 103 Evaluating and Understanding Your Predictive Model 108 Control Overfitting by Penalizing Regression Coefficients Ridge Regression 110 Summary 119 Chapter 4 Penalized Linear Regression 121 Why Penalized Linear Regression Methods Are So Useful 122 Extremely Fast Coefficient Estimation 122 Variable Importance Information 122 Extremely Fast Evaluation When Deployed 123 Reliable Performance 123 Sparse Solutions 123 Problem May Require Linear Model 124 When to Use Ensemble Methods 124 Penalized Linear Regression: Regulating Linear Regression for Optimum Performance 124 Training Linear Models: Minimizing Errors and More 126 Adding a Coefficient Penalty to the OLS Formulation 127 Other Useful Coefficient Penalties Manhattan and ElasticNet 128 Why Lasso Penalty Leads to Sparse Coefficient Vectors 129 ElasticNet Penalty Includes Both Lasso and Ridge 131 Solving the Penalized Linear Regression Problem 132 Understanding Least Angle Regression and Its Relationship to Forward Stepwise Regression 132 How LARS Generates Hundreds of Models of Varying Complexity 136 Choosing the Best Model from The Hundreds LARS Generates 139 Using Glmnet: Very Fast and Very General 144 Comparison of the Mechanics of Glmnet and LARS Algorithms 145 Initializing and Iterating the Glmnet Algorithm 146 Extensions to Linear Regression with Numeric Input 151 Solving Classification Problems with Penalized Regression 151 Working with Classification Problems Having More Than Two Outcomes 155 Understanding Basis Expansion: Using Linear Methods on Nonlinear Problems 156 Incorporating Non-Numeric Attributes into Linear Methods 158 Summary 163 Chapter 5 Building Predictive Models Using Penalized Linear Methods 165 Python Packages for Penalized Linear Regression 166 Multivariable Regression: Predicting Wine Taste 167 Building and Testing a Model to Predict Wine Taste 168 Training on the Whole Data Set before Deployment 172 Basis Expansion: Improving Performance by Creating New Variables from Old Ones 178 Binary Classification: Using Penalized Linear Regression to Detect Unexploded Mines 181 Build a Rocks versus Mines Classifier for Deployment 191 Multiclass Classification: Classifying Crime Scene Glass Samples 204 Summary 209 Chapter 6 Ensemble Methods 211 Binary Decision Trees 212 How a Binary Decision Tree Generates Predictions 213 How to Train a Binary Decision Tree 214 Tree Training Equals Split Point Selection 218 How Split Point Selection Affects Predictions 218 Algorithm for Selecting Split Points 219 Multivariable Tree Training Which Attribute to Split? 219 Recursive Splitting for More Tree Depth 220 Overfitting Binary Trees 221 Measuring Overfit with Binary Trees 221 Balancing Binary Tree Complexity for Best Performance 222 Modifications for Classification and Categorical Features 225 Bootstrap Aggregation: Bagging 226 How Does the Bagging Algorithm Work? 226 Bagging Performance Bias versus Variance 229 How Bagging Behaves on Multivariable Problem 231 Bagging Needs Tree Depth for Performance 235 Summary of Bagging 236 Gradient Boosting 236 Basic Principle of Gradient Boosting Algorithm 237 Parameter Settings for Gradient Boosting 239 How Gradient Boosting Iterates Toward a Predictive Model 240 Getting the Best Performance from Gradient Boosting 240 Gradient Boosting on a Multivariable Problem 244 Summary for Gradient Boosting 247 Random Forest 247 Random Forests: Bagging Plus Random Attribute Subsets 250 Random Forests Performance Drivers 251 Random Forests Summary 252 Summary 252 Chapter 7 Building Ensemble Models with Python 255 Solving Regression Problems with Python Ensemble Packages 255 Building a Random Forest Model to Predict Wine Taste 256 Constructing a Random Forest Regressor Object 256 Modeling Wine Taste with Random Forest Regressor 259 Visualizing the Performance of a Random Forests Regression Model 262 Using Gradient Boosting to Predict Wine Taste 263 Using the Class Constructor for Gradient Boosting Regressor 263 Using Gradient Boosting Regressor to Implement a Regression Model 267 Assessing the Performance of a Gradient Boosting Model 269 Coding Bagging to Predict Wine Taste 270 Incorporating Non-Numeric Attributes in Python Ensemble Models 275 Coding the Sex of Abalone for Input to Random Forest Regression in Python 275 Assessing Performance and the Importance of Coded Variables 278 Coding the Sex of Abalone for Gradient Boosting Regression in Python 278 Assessing Performance and the Importance of Coded Variables with Gradient Boosting 282 Solving Binary Classification Problems with Python Ensemble Methods 284 Detecting Unexploded Mines with Python Random Forest 285 Constructing a Random Forests Model to Detect Unexploded Mines 287 Determining the Performance of a Random Forests Classifier 291 Detecting Unexploded Mines with Python Gradient Boosting 291 Determining the Performance of a Gradient Boosting Classifier 298 Solving Multiclass Classification Problems with Python Ensemble Methods 302 Classifying Glass with Random Forests 302 Dealing with Class Imbalances 305 Classifying Glass Using Gradient Boosting 307 Assessing the Advantage of Using Random Forest Base Learners with Gradient Boosting 311 Comparing Algorithms 314 Summary 315 Index 319