A visual approach to data mining. Data mining has been defined as the search for useful and previously unknown patterns in large datasets, yet when faced with the task of mining a large dataset, it is not always obvious where to start and how to proceed. This book introduces a visual methodology for data mining demonstrating the application of methodology along with a sequence of exercises using VisMiner. VisMiner has been developed by the author and provides a powerful visual data mining tool enabling the reader to see the data that they are working on and to visually evaluate the models created from the data. Key features: * Presents visual support for all phases of data mining including dataset preparation. * Provides a comprehensive set of non-trivial datasets and problems with accompanying software. * Features 3-D visualizations of multi-dimensional datasets. * Gives support for spatial data analysis with GIS like features. * Describes data mining algorithms with guidance on when and how to use. * Accompanied by VisMiner, a visual software tool for data mining, developed specifically to bridge the gap between theory and practice.
Visual Data Mining: The VisMiner Approach is designed as a hands-on work book to introduce the methodologies to students in data mining, advanced statistics, and business intelligence courses. This book provides a set of tutorials, exercises, and case studies that support students in learning data mining processes. In praise of the VisMiner approach: "What we discovered among students was that the visualization concepts and tools brought the analysis alive in a way that was broadly understood and could be used to make sound decisions with greater certainty about the outcomes" Dr. James V. Hansen, J. Owen Cherrington Professor, Marriott School, Brigham Young University, USA "Students learn best when they are able to visualize relationships between data and results during the data mining process. VisMiner is easy to learn and yet offers great visualization capabilities throughout the data mining process. My students liked it very much and so did I." Dr. Douglas Dean, Assoc. Professor of Information Systems, Marriott School, Brigham Young University, USA
Preface ix Acknowledgments xi 1. Introduction 1 Data Mining Objectives 1 Introduction to VisMiner 2 The Data Mining Process 3 Initial Data Exploration 4 Dataset Preparation 5 Algorithm Selection and Application 8 Model Evaluation 8 Summary 9 2. Initial Data Exploration and Dataset Preparation Using VisMiner 11 The Rationale for Visualizations 11 Tutorial Using VisMiner 13 Initializing VisMiner 13 Initializing the Slave Computers 14 Opening a Dataset 16 Viewing Summary Statistics 16 Exercise 2.1 17 The Correlation Matrix 18 Exercise 2.2 20 The Histogram 21 The Scatter Plot 23 Exercise 2.3 28 The Parallel Coordinate Plot 28 Exercise 2.4 33 Extracting Sub-populations Using the Parallel Coordinate Plot 37 Exercise 2.5 41 The Table Viewer 42 The Boundary Data Viewer 43 Exercise 2.6 47 The Boundary Data Viewer with Temporal Data 47 Exercise 2.7 49 Summary 49 3. Advanced Topics in Initial Exploration and Dataset Preparation Using VisMiner 51 Missing Values 51 Missing Values An Example 53 Exploration Using the Location Plot 56 Exercise 3.1 61 Dataset Preparation Creating Computed Columns 61 Exercise 3.2 63 Aggregating Data for Observation Reduction 63 Exercise 3.3 65 Combining Datasets 66 Exercise 3.4 67 Outliers and Data Validation 68 Range Checks 69 Fixed Range Outliers 69 Distribution Based Outliers 70 Computed Checks 72 Exercise 3.5 74 Feasibility and Consistency Checks 74 Data Correction Outside of VisMiner 75 Distribution Consistency 76 Pattern Checks 77 A Pattern Check of Experimental Data 80 Exercise 3.6 81 Summary 82 4. Prediction Algorithms for Data Mining 83 Decision Trees 84 Stopping the Splitting Process 86 A Decision Tree Example 87 Using Decision Trees 89 Decision Tree Advantages 89 Limitations 90 Artificial Neural Networks 90 Overfitting the Model 93 Moving Beyond Local Optima 94 ANN Advantages and Limitations 96 Support Vector Machines 97 Data Transformations 99 Moving Beyond Two-dimensional Predictors 100 SVM Advantages and Limitations 100 Summary 101 5. Classification Models in VisMiner 103 Dataset Preparation 103 Tutorial Building and Evaluating Classification Models 104 Model Evaluation 104 Exercise 5.1 109 Prediction Likelihoods 109 Classification Model Performance 113 Interpreting the ROC Curve 119 Classification Ensembles 124 Model Application 125 Summary 127 Exercise 5.2 128 Exercise 5.3 128 6. Regression Analysis 131 The Regression Model 131 Correlation and Causation 132 Algorithms for Regression Analysis 133 Assessing Regression Model Performance 133 Model Validity 135 Looking Beyond R2 135 Polynomial Regression 137 Artificial Neural Networks for Regression Analysis 137 Dataset Preparation 137 Tutorial 138 A Regression Model for Home Appraisal 139 Modeling with the Right Set of Observations 139 Exercise 6.1 145 ANN Modeling 145 The Advantage of ANN Regression 148 Top-Down Attribute Selection 149 Issues in Model Interpretation 150 Model Validation 152 Model Application 153 Summary 154 7. Cluster Analysis 155 Introduction 155 Algorithms for Cluster Analysis 158 Issues with K-Means Clustering Process 158 Hierarchical Clustering 159 Measures of Cluster and Clustering Quality 159 Silhouette Coefficient 161 Correlation Coefficient 161 Self-Organizing Maps (SOM) 161 Self-Organizing Maps in VisMiner 163 Choosing the Grid Dimensions 168 Advantages of a 3-D Grid 169 Extracting Subsets from a Clustering 170 Summary 173 Appendix A VisMiner Reference by Task 175 Appendix B VisMiner Task/Tool Matrix 187 Appendix C IP Address Look-up 189 Index 191