Expert Hadoop Administration: Managing, Tuning, and Securing Spark, YARN, and HDFS

Expert Hadoop Administration: Managing, Tuning, and Securing Spark, YARN, and HDFS

By: Sam R. Alapati (author)Paperback

Up to 2 WeeksUsually despatched within 2 weeks

£29.59 RRP £36.99  You save £7.40 (20%) With FREE Saver Delivery

Description

The Comprehensive, Up-to-Date Apache Hadoop Administration Handbook and Reference "Sam Alapati has worked with production Hadoop clusters for six years. His unique depth of experience has enabled him to write the go-to resource for all administrators looking to spec, size, expand, and secure production Hadoop clusters of any size." -Paul Dix, Series Editor In Expert Hadoop (R) Administration, leading Hadoop administrator Sam R. Alapati brings together authoritative knowledge for creating, configuring, securing, managing, and optimizing production Hadoop clusters in any environment. Drawing on his experience with large-scale Hadoop administration, Alapati integrates action-oriented advice with carefully researched explanations of both problems and solutions. He covers an unmatched range of topics and offers an unparalleled collection of realistic examples. Alapati demystifies complex Hadoop environments, helping you understand exactly what happens behind the scenes when you administer your cluster. You'll gain unprecedented insight as you walk through building clusters from scratch and configuring high availability, performance, security, encryption, and other key attributes. The high-value administration skills you learn here will be indispensable no matter what Hadoop distribution you use or what Hadoop applications you run. Understand Hadoop's architecture from an administrator's standpoint Create simple and fully distributed clusters Run MapReduce and Spark applications in a Hadoop cluster Manage and protect Hadoop data and high availability Work with HDFS commands, file permissions, and storage management Move data, and use YARN to allocate resources and schedule jobs Manage job workflows with Oozie and Hue Secure, monitor, log, and optimize Hadoop Benchmark and troubleshoot Hadoop

About Author

Sam R. Alapati has been working with various aspects of the Hadoop environment for the past six years. He is currently the principal Hadoop administrator at Sabre Corporation in Westlake, Texas, and works on a daily basis with multiple large Hadoop 2 clusters. In addition to being the point person for all Hadoop administration at Sabre, Sam manages multiple critical data-science- and data-analysis-related Hadoop job flows and is also an expert Oracle Database Administrator. His vast knowledge of relational databases and SQL contributes to his work with Hadoop related projects. Sam's recognition in the database and middleware area includes having published 18 well-received books over the past 14 years, mostly on Oracle Database Administration and Oracle Weblogic Server. His experience dealing with numerous configuration, architectural, and performance-related Hadoop issues over the years led him to the realization that many working Hadoop administrators and developers would appreciate having a handy reference such as this book to turn to when creating, managing, securing and optimizing their Hadoop infrastructure.

Contents

Foreword xxvii Preface xxix Acknowledgments xxxv About the Author xxxvii Part I: Introduction to Hadoop-Architecture and Hadoop Clusters 1 Chapter 1: Introduction to Hadoop and Its Environment 3 Hadoop-An Introduction 4 Cluster Computing and Hadoop Clusters 12 Hadoop Components and the Hadoop Ecosphere 15 What Do Hadoop Administrators Do? 18 Key Differences between Hadoop 1 and Hadoop 2 21 Distributed Data Processing: MapReduce and Spark, Hive and Pig 24 Data Integration: Apache Sqoop, Apache Flume and Apache Kafka 27 Key Areas of Hadoop Administration 28 Summary 31 Chapter 2: An Introduction to the Architecture of Hadoop 33 Distributed Computing and Hadoop 33 Hadoop Architecture 34 Data Storage-The Hadoop Distributed File System 37 Data Processing with YARN, the Hadoop Operating System 48 Summary 57 Chapter 3: Creating and Configuring a Simple Hadoop Cluster 59 Hadoop Distributions and Installation Types 60 Setting Up a Pseudo-Distributed Hadoop Cluster 62 Performing the Initial Hadoop Configuration 71 Operating the New Hadoop Cluster 86 Summary 90 Chapter 4: Planning for and Creating a Fully Distributed Cluster 91 Planning Your Hadoop Cluster 92 Going from a Single Rack to Multiple Racks 95 Creating a Multinode Cluster 102 Modifying the Hadoop Configuration 106 Starting Up the Cluster 114 Configuring Hadoop Services, Web Interfaces and Ports 119 Summary 126 Part II: Hadoop Application Frameworks 127 Chapter 5: Running Applications in a Cluster-The MapReduce Framework (and Hive and Pig) 129 The MapReduce Framework 129 Apache Hive 141 Apache Pig 144 Summary 145 Chapter 6: Running Applications in a Cluster-The Spark Framework 147 What Is Spark? 148 Why Spark? 149 The Spark Stack 153 Installing Spark 155 Spark Run Modes 158 Understanding the Cluster Managers 159 Spark and Data Access 164 Summary 167 Chapter 7: Running Spark Applications 169 The Spark Programming Model 169 Spark Applications 173 Architecture of a Spark Application 179 Running Spark Applications Interactively 181 Creating and Submitting Spark Applications 185 Configuring Spark Applications 192 Monitoring Spark Applications 194 Handling Streaming Data with Spark Streaming 194 Using Spark SQL for Handling Structured Data 198 Summary 201 Part III: Managing and Protecting Hadoop Data and High Availability 203 Chapter 8: The Role of the NameNode and How HDFS Works 205 HDFS-The Interaction between the NameNode and the DataNodes 205 Rack Awareness and Topology 209 HDFS Data Replication 212 How Clients Read and Write HDFS Data 218 Understanding HDFS Recovery Processes 224 Centralized Cache Management in HDFS 227 Hadoop Archival Storage, SSD and Memory (Heterogeneous Storage) 232 Summary 241 Chapter 9: HDFS Commands, HDFS Permissions and HDFS Storage 243 Managing HDFS through the HDFS Shell Commands 243 Using the dfsadmin Utility to Perform HDFS Operations 251 Managing HDFS Permissions and Users 255 Managing HDFS Storage 260 Rebalancing HDFS Data 267 Reclaiming HDFS Space 274 Summary 276 Chapter 10: Data Protection, File Formats and Accessing HDFS 277 Safeguarding Data 278 Data Compression 289 Hadoop File Formats 295 Using Hadoop WebHDFS and HttpFS 308 Summary 315 Chapter 11: NameNode Operations, High Availability and Federation 317 Understanding NameNode Operations 318 The Checkpointing Process 323 NameNode Safe Mode Operations 329 Configuring HDFS High Availability 334 HDFS Federation 349 Summary 351 Part IV: Moving Data, Allocating Resources, Scheduling Jobs and Security 353 Chapter 12: Moving Data Into and Out of Hadoop 355 Introduction to Hadoop Data Transfer Tools 355 Loading Data into HDFS from the Command Line 356 Copying HDFS Data between Clusters with DistCp 361 Ingesting Data from Relational Databases with Sqoop 365 Ingesting Data from External Sources with Flume 388 Ingesting Data with Kafka 398 Summary 406 Chapter 13: Resource Allocation in a Hadoop Cluster 407 Resource Allocation in Hadoop 407 The FIFO Scheduler 410 The Capacity Scheduler 411 The Fair Scheduler 426 Comparing the Capacity Scheduler and the Fair Scheduler 435 Summary 436 Chapter 14: Working with Oozie to Manage Job Workflows 437 Using Apache Oozie to Schedule Jobs 437 Oozie Architecture 439 Deploying Oozie in Your Cluster 441 Understanding Oozie Workflows 446 How Oozie Runs an Action 449 Creating an Oozie Workflow 454 Running an Oozie Workflow Job 461 Oozie Coordinators 464 Managing and Administering Oozie 470 Summary 475 Chapter 15: Securing Hadoop 477 Hadoop Security-An Overview 478 Hadoop Authentication with Kerberos 481 Hadoop Authorization 505 Auditing Hadoop 518 Securing Hadoop Data 520 Other Hadoop-Related Security Initiatives 524 Summary 525 Part V: Monitoring, Optimization and Troubleshooting 527 Chapter 16: Managing Jobs, Using Hue and Performing Routine Tasks 529 Using the YARN Commands to Manage Hadoop Jobs 530 Decommissioning and Recommissioning Nodes 535 ResourceManager High Availability 541 Performing Common Management Tasks 545 Managing the MySQL Database 548 Backing Up Important Cluster Data 551 Using Hue to Administer Your Cluster 553 Implementing Specialized HDFS Features 562 Summary 567 Chapter 17: Monitoring, Metrics and Hadoop Logging 569 Monitoring Linux Servers 570 Hadoop Metrics 576 Using Ganglia for Monitoring 579 Understanding Hadoop Logging 582 Using Hadoop's Web UIs for Monitoring 599 Monitoring Other Hadoop Components 609 Summary 610 Chapter 18: Tuning the Cluster Resources, Optimizing MapReduce Jobs and Benchmarking 611 How to Allocate YARN Memory and CPU 612 Configuring Efficient Performance 621 Tuning Map and Reduce Tasks-What the Administrator Can Do 625 Optimizing Pig and Hive Jobs 635 Benchmarking Your Cluster 638 Hadoop Counters 647 Optimizing MapReduce 652 Summary 658 Chapter 19: Configuring and Tuning Apache Spark on YARN 659 Configuring Resource Allocation for Spark on YARN 659 Dynamic Resource Allocation when Running Spark on YARN 676 Storage Formats and Compressing Data 678 Monitoring Spark Applications 681 Tuning Garbage Collection 686 Tuning Spark Streaming Applications 688 Summary 689 Chapter 20: Optimizing Spark Applications 691 Revisiting the Spark Execution Model 692 Shuffle Operations and How to Minimize Them 694 Partitioning and Parallelism (Number of Tasks) 703 Optimizing Data Serialization and Compression 710 Understanding Spark's SQL Query Optimizer 712 Caching Data 717 Summary 723 Chapter 21: Troubleshooting Hadoop-A Sampler 725 Space-Related Issues 725 Handling YARN Jobs That Are Stuck 731 JVM Memory-Allocation and Garbage-Collection Strategies 732 Handling Different Types of Failures 737 Troubleshooting Spark Jobs 739 Debugging Spark Applications 740 Summary 742 Chapter 22: Installing VirtualBox and Linux and Cloning the Virtual Machines 743 Installing Oracle VirtualBox 744 Installing Oracle Enterprise Linux 745 Cloning the Linux Server 745 Index 747

Product Details

  • ISBN13: 9780134597195
  • Format: Paperback
  • Number Of Pages: 848
  • ID: 9780134597195
  • weight: 1326
  • ISBN10: 0134597192

Delivery Information

  • Saver Delivery: Yes
  • 1st Class Delivery: Yes
  • Courier Delivery: Yes
  • Store Delivery: Yes

Prices are for internet purchases only. Prices and availability in WHSmith Stores may vary significantly

Close