Apache Spark in 24 Hours, Sams Teach Yourself

Apache Spark in 24 Hours, Sams Teach Yourself

By: Jeffrey Aven (author)Paperback

In Stock

£23.09 RRP £32.99  You save £9.90 (30%) With FREE Saver Delivery

Description

Apache Spark is a fast, scalable, and flexible open source distributed processing engine for big data systems and is one of the most active open source big data projects to date. In just 24 lessons of one hour or less, Sams Teach Yourself Apache Spark in 24 Hours helps you build practical Big Data solutions that leverage Spark's amazing speed, scalability, simplicity, and versatility. This book's straightforward, step-by-step approach shows you how to deploy, program, optimize, manage, integrate, and extend Spark-now, and for years to come. You'll discover how to create powerful solutions encompassing cloud computing, real-time stream processing, machine learning, and more. Every lesson builds on what you've already learned, giving you a rock-solid foundation for real-world success. Whether you are a data analyst, data engineer, data scientist, or data steward, learning Spark will help you to advance your career or embark on a new career in the booming area of Big Data. Learn how to * Discover what Apache Spark does and how it fits into the Big Data landscape * Deploy and run Spark locally or in the cloud * Interact with Spark from the shell * Make the most of the Spark Cluster Architecture * Develop Spark applications with Scala and functional Python * Program with the Spark API, including transformations and actions * Apply practical data engineering/analysis approaches designed for Spark * Use Resilient Distributed Datasets (RDDs) for caching, persistence, and output * Optimize Spark solution performance * Use Spark with SQL (via Spark SQL) and with NoSQL (via Cassandra) * Leverage cutting-edge functional programming techniques * Extend Spark with streaming, R, and Sparkling Water * Start building Spark-based machine learning and graph-processing applications * Explore advanced messaging technologies, including Kafka * Preview and prepare for Spark's next generation of innovations Instructions walk you through common questions, issues, and tasks; Q-and-As, Quizzes, and Exercises build and test your knowledge; "Did You Know?" tips offer insider advice and shortcuts; and "Watch Out!" alerts help you avoid pitfalls. By the time you're finished, you'll be comfortable using Apache Spark to solve a wide spectrum of Big Data problems.

About Author

Jeffrey Aven is a big data consultant and instructor based in Melbourne, Australia. Jeff has an extensive background in data management and several years of experience consulting and teaching in the areas of Hadoop, HBase, Spark, and other big data ecosystem technologies. Jeff has won accolades as a big data instructor and is also an accomplished consultant who has been involved in several high-profile, enterprise-scale big data implementations across different industries in the region.

Contents

Preface xii PART I: GETTING STARTED WITH APACHE SPARK Hour 1: Introducing Apache Spark 1 What Is Spark? 1 What Sort of Applications Use Spark? 3 Programming Interfaces to Spark 3 Ways to Use Spark 5 Q&A 8 Workshop 8 Hour 2: Understanding Hadoop 11 Hadoop and a Brief History of Big Data 11 Hadoop Explained 12 Introducing HDFS 13 Introducing YARN 19 Anatomy of a Hadoop Cluster 22 How Spark Works with Hadoop 24 Q&A 25 Workshop 25 Hour 3: Installing Spark 27 Spark Deployment Modes 27 Preparing to Install Spark 28 Installing Spark in Standalone Mode 29 Exploring the Spark Install 38 Deploying Spark on Hadoop 39 Q&A 43 Workshop 43 Exercises 44 Hour 4: Understanding the Spark Application Architecture 45 Anatomy of a Spark Application 45 Spark Driver 46 Spark Executors and Workers 48 Spark Master and Cluster Manager 49 Spark Applications Running on YARN 51 Local Mode 56 Q&A 59 Workshop 59 Hour 5: Deploying Spark in the Cloud 61 Amazon Web Services Primer 61 Spark on EC2 64 Spark on EMR 73 Hosted Spark with Databricks 81 Q&A 89 Workshop 89 PART II: PROGRAMMING WITH APACHE SPARK Hour 6: Learning the Basics of Spark Programming with RDDs 91 Introduction to RDDs 91 Loading Data into RDDs 93 Operations on RDDs 106 Types of RDDs 111 Q&A 113 Workshop 113 Hour 7: Understanding MapReduce Concepts 115 MapReduce History and Background 115 Records and Key Value Pairs 117 MapReduce Explained 118 Word Count: The "Hello, World" of MapReduce 126 Q&A 135 Workshop 136 Hour 8: Getting Started with Scala 137 Scala History and Background 137 Scala Basics 138 Object-Oriented Programming in Scala 153 Functional Programming in Scala 157 Spark Programming in Scala 160 Q&A 163 Workshop 163 Hour 9: Functional Programming with Python 165 Python Overview 165 Data Structures and Serialization in Python 170 Python Functional Programming Basics 178 Interactive Programming Using IPython 183 Q&A 194 Workshop 194 Hour 10: Working with the Spark API (Transformations and Actions) 197 RDDs and Data Sampling 197 Spark Transformations 199 Spark Actions 206 Key Value Pair Operations 211 Join Functions 219 Numerical RDD Operations 229 Q&A 232 Workshop 233 Hour 11: Using RDDs: Caching, Persistence, and Output 235 RDD Storage Levels 235 Caching, Persistence, and Checkpointing 239 Saving RDD Output 247 Introduction to Alluxio (Tachyon) 254 Q&A 257 Workshop 258 Hour 12: Advanced Spark Programming 259 Broadcast Variables 259 Accumulators 265 Partitioning and Repartitioning 270 Processing RDDs with External Programs 278 Q&A 280 Workshop 280 PART III: EXTENSIONS TO SPARK Hour 13: Using SQL with Spark 283 Introduction to Spark SQL 283 Getting Started with Spark SQL DataFrames 294 Using Spark SQL DataFrames 305 Accessing Spark SQL 316 Q&A 321 Workshop 322 Hour 14: Stream Processing with Spark 323 Introduction to Spark Streaming 323 Using DStreams 326 State Operations 335 Sliding Window Operations 337 Q&A 340 Workshop 340 Hour 15: Getting Started with Spark and R 343 Introduction to R 343 Introducing SparkR 350 Using SparkR 355 Using SparkR with RStudio 358 Q&A 361 Workshop 361 Hour 16: Machine Learning with Spark 363 Introduction to Machine Learning and MLlib 363 Classification Using Spark MLlib 367 Collaborative Filtering Using Spark MLlib 373 Clustering Using Spark MLlib 375 Q&A 378 Workshop 379 Hour 17: Introducing Sparkling Water (H20 and Spark) 381 Introduction to H2O 381 Sparkling Water-H2O on Spark 387 Q&A 397 Workshop 397 Hour 18: Graph Processing with Spark 399 Introduction to Graphs 399 Graph Processing in Spark 402 Introduction to GraphFrames 406 Q&A 414 Workshop 414 Hour 19: Using Spark with NoSQL Systems 417 Introduction to NoSQL 417 Using Spark with HBase 419 Using Spark with Cassandra 425 Using Spark with DynamoDB and More 429 Q&A 431 Workshop 432 Hour 20: Using Spark with Messaging Systems 433 Overview of Messaging Systems 433 Using Spark with Apache Kafka 435 Spark, MQTT, and the Internet of Things 443 Using Spark with Amazon Kinesis 446 Q&A 451 Workshop 451 PART IV: MANAGING SPARK Hour 21: Administering Spark 453 Spark Configuration 453 Administering Spark Standalone 461 Administering Spark on YARN 471 Q&A 477 Workshop 478 Hour 22: Monitoring Spark 479 Exploring the Spark Application UI 479 Spark History Server 488 Spark Metrics 490 Logging in Spark 492 Q&A 499 Workshop 499 Hour 23: Extending and Securing Spark 501 Isolating Spark 501 Securing Spark Communication 504 Securing Spark with Kerberos 512 Q&A 517 Workshop 517 Hour 24: Improving Spark Performance 519 Benchmarking Spark 519 Application Development Best Practices 526 Optimizing Partitions 534 Diagnosing Application Performance Issues 536 Q&A 540 Workshop 541 Index 543

Product Details

  • ISBN13: 9780672338519
  • Format: Paperback
  • Number Of Pages: 592
  • ID: 9780672338519
  • weight: 918
  • ISBN10: 0672338513

Delivery Information

  • Saver Delivery: Yes
  • 1st Class Delivery: Yes
  • Courier Delivery: Yes
  • Store Delivery: Yes

Prices are for internet purchases only. Prices and availability in WHSmith Stores may vary significantly

Close