Apache Spark in 24 Hours, Sams Teach Yo... | WHSmith Books
Apache Spark in 24 Hours, Sams Teach Yourself

Apache Spark in 24 Hours, Sams Teach Yourself

By: Jeffrey Aven (author)Paperback

In Stock

£22.43 RRP £32.99  You save £10.56 (32%) With FREE Saver Delivery


Apache Spark is a fast, scalable, and flexible open source distributed processing engine for big data systems and is one of the most active open source big data projects to date. In just 24 lessons of one hour or less, Sams Teach Yourself Apache Spark in 24 Hours helps you build practical Big Data solutions that leverage Spark's amazing speed, scalability, simplicity, and versatility.This book's straightforward, step-by-step approach shows you how to deploy, program, optimize, manage, integrate, and extend Spark-now, and for years to come. You'll discover how to create powerful solutions encompassing cloud computing, real-time stream processing, machine learning, and more. Every lesson builds on what you've already learned, giving you a rock-solid foundation for real-world success. Whether you are a data analyst, data engineer, data scientist, or data steward, learning Spark will help you to advance your career or embark on a new career in the booming area of Big Data.Learn how to* Discover what Apache Spark does and how it fits into the Big Data landscape* Deploy and run Spark locally or in the cloud* Interact with Spark from the shell* Make the most of the Spark Cluster Architecture* Develop Spark applications with Scala and functional Python* Program with the Spark API, including transformations and actions* Apply practical data engineering/analysis approaches designed for Spark* Use Resilient Distributed Datasets (RDDs) for caching, persistence, and output* Optimize Spark solution performance* Use Spark with SQL (via Spark SQL) and with NoSQL (via Cassandra)* Leverage cutting-edge functional programming techniques* Extend Spark with streaming, R, and Sparkling Water* Start building Spark-based machine learning and graph-processing applications* Explore advanced messaging technologies, including Kafka* Preview and prepare for Spark's next generation of innovationsInstructions walk you through common questions, issues, and tasks; Q-and-As, Quizzes, and Exercises build and test your knowledge; "Did You Know?" tips offer insider advice and shortcuts; and "Watch Out!" alerts help you avoid pitfalls. By the time you're finished, you'll be comfortable using Apache Spark to solve a wide spectrum of Big Data problems.

About Author

Jeffrey Aven is a big data consultant and instructor based in Melbourne, Australia. Jeff has an extensive background in data management and several years of experience consulting and teaching in the areas of Hadoop, HBase, Spark, and other big data ecosystem technologies. Jeff has won accolades as a big data instructor and is also an accomplished consultant who has been involved in several high-profile, enterprise-scale big data implementations across different industries in the region.


Preface xiiPART I: GETTING STARTED WITH APACHE SPARKHour 1: Introducing Apache Spark 1What Is Spark? 1What Sort of Applications Use Spark? 3Programming Interfaces to Spark 3Ways to Use Spark 5Q&A 8Workshop 8Hour 2: Understanding Hadoop 11Hadoop and a Brief History of Big Data 11Hadoop Explained 12Introducing HDFS 13Introducing YARN 19Anatomy of a Hadoop Cluster 22How Spark Works with Hadoop 24Q&A 25Workshop 25Hour 3: Installing Spark 27Spark Deployment Modes 27Preparing to Install Spark 28Installing Spark in Standalone Mode 29Exploring the Spark Install 38Deploying Spark on Hadoop 39Q&A 43Workshop 43Exercises 44Hour 4: Understanding the Spark Application Architecture 45Anatomy of a Spark Application 45Spark Driver 46Spark Executors and Workers 48Spark Master and Cluster Manager 49Spark Applications Running on YARN 51Local Mode 56Q&A 59Workshop 59Hour 5: Deploying Spark in the Cloud 61Amazon Web Services Primer 61Spark on EC2 64Spark on EMR 73Hosted Spark with Databricks 81Q&A 89Workshop 89PART II: PROGRAMMING WITH APACHE SPARKHour 6: Learning the Basics of Spark Programming with RDDs 91Introduction to RDDs 91Loading Data into RDDs 93Operations on RDDs 106Types of RDDs 111Q&A 113Workshop 113Hour 7: Understanding MapReduce Concepts 115MapReduce History and Background 115Records and Key Value Pairs 117MapReduce Explained 118Word Count: The "Hello, World" of MapReduce 126Q&A 135Workshop 136Hour 8: Getting Started with Scala 137Scala History and Background 137Scala Basics 138Object-Oriented Programming in Scala 153Functional Programming in Scala 157Spark Programming in Scala 160Q&A 163Workshop 163Hour 9: Functional Programming with Python 165Python Overview 165Data Structures and Serialization in Python 170Python Functional Programming Basics 178Interactive Programming Using IPython 183Q&A 194Workshop 194Hour 10: Working with the Spark API (Transformations and Actions) 197RDDs and Data Sampling 197Spark Transformations 199Spark Actions 206Key Value Pair Operations 211Join Functions 219Numerical RDD Operations 229Q&A 232Workshop 233Hour 11: Using RDDs: Caching, Persistence, and Output 235RDD Storage Levels 235Caching, Persistence, and Checkpointing 239Saving RDD Output 247Introduction to Alluxio (Tachyon) 254Q&A 257Workshop 258Hour 12: Advanced Spark Programming 259Broadcast Variables 259Accumulators 265Partitioning and Repartitioning 270Processing RDDs with External Programs 278Q&A 280Workshop 280PART III: EXTENSIONS TO SPARKHour 13: Using SQL with Spark 283Introduction to Spark SQL 283Getting Started with Spark SQL DataFrames 294Using Spark SQL DataFrames 305Accessing Spark SQL 316Q&A 321Workshop 322Hour 14: Stream Processing with Spark 323Introduction to Spark Streaming 323Using DStreams 326State Operations 335Sliding Window Operations 337Q&A 340Workshop 340Hour 15: Getting Started with Spark and R 343Introduction to R 343Introducing SparkR 350Using SparkR 355Using SparkR with RStudio 358Q&A 361Workshop 361Hour 16: Machine Learning with Spark 363Introduction to Machine Learning and MLlib 363Classification Using Spark MLlib 367Collaborative Filtering Using Spark MLlib 373Clustering Using Spark MLlib 375Q&A 378Workshop 379Hour 17: Introducing Sparkling Water (H20 and Spark) 381Introduction to H2O 381Sparkling Water-H2O on Spark 387Q&A 397Workshop 397Hour 18: Graph Processing with Spark 399Introduction to Graphs 399Graph Processing in Spark 402Introduction to GraphFrames 406Q&A 414Workshop 414Hour 19: Using Spark with NoSQL Systems 417Introduction to NoSQL 417Using Spark with HBase 419Using Spark with Cassandra 425Using Spark with DynamoDB and More 429Q&A 431Workshop 432Hour 20: Using Spark with Messaging Systems 433Overview of Messaging Systems 433Using Spark with Apache Kafka 435Spark, MQTT, and the Internet of Things 443Using Spark with Amazon Kinesis 446Q&A 451Workshop 451PART IV: MANAGING SPARKHour 21: Administering Spark 453Spark Configuration 453Administering Spark Standalone 461Administering Spark on YARN 471Q&A 477Workshop 478Hour 22: Monitoring Spark 479Exploring the Spark Application UI 479Spark History Server 488Spark Metrics 490Logging in Spark 492Q&A 499Workshop 499Hour 23: Extending and Securing Spark 501Isolating Spark 501Securing Spark Communication 504Securing Spark with Kerberos 512Q&A 517Workshop 517Hour 24: Improving Spark Performance 519Benchmarking Spark 519Application Development Best Practices 526Optimizing Partitions 534Diagnosing Application Performance Issues 536Q&A 540Workshop 541Index 543

Product Details

  • ISBN13: 9780672338519
  • Format: Paperback
  • Number Of Pages: 592
  • ID: 9780672338519
  • weight: 918
  • ISBN10: 0672338513

Delivery Information

  • Saver Delivery: Yes
  • 1st Class Delivery: Yes
  • Courier Delivery: Yes
  • Store Delivery: Yes

Prices are for internet purchases only. Prices and availability in WHSmith Stores may vary significantly