Frank Kane's Taming big data with Apache Spark and Python real-world examples to help you analyze large datasets with Apache Spark

What You Will Learn* Find out how you can identify Big Data problems as Spark problems* Install and run Apache Spark on your computer or on a cluster* Analyze large data sets across many CPUs using Spark's Resilient Distributed Datasets* Implement machine learning on Spark using the MLlib libra...

Full description

Bibliographic Details
Main Author: Kane, Frank
Format: eBook
Language:English
Published: Birmingham, UK Packt Publishing 2017
Subjects:
Online Access:
Collection: O'Reilly - Collection details see MPG.ReNa
Table of Contents:
  • Connecting to the master node using SSH
  • Running the code
  • Creating similar movies from one million ratings
  • part 3
  • Assessing the results
  • Terminating the cluster
  • Troubleshooting Spark on a cluster
  • More troubleshooting and managing dependencies
  • Troubleshooting
  • Managing dependencies
  • Summary
  • Chapter 5: SparkSQL, DataFrames, and DataSets
  • Introducing SparkSQL
  • Using SparkSQL in Python
  • More things you can do with DataFrames
  • Differences between DataFrames and DataSets
  • Shell access in SparkSQL
  • User-defined functions (UDFs)
  • Executing SQL commands and SQL-style functions on a DataFrame
  • Using SQL-style functions instead of queries
  • Using DataFrames instead of RDDs
  • Summary
  • Chapter 6: Other Spark Technologies and Libraries
  • Introducing MLlib
  • MLlib capabilities
  • Special MLlib data types
  • For more information on machine learning
  • Making movie recommendations
  • Using MLlib to produce movie recommendations
  • Examining the movie-recommendations-als.py script
  • Analyzing the ALS recommendations results
  • Why did we get bad results?
  • Using DataFrames with MLlib
  • Examining the spark-linear-regression.py script
  • Getting results
  • Spark Streaming and GraphX
  • What is Spark Streaming?
  • GraphX
  • Summary
  • Chapter 7: Where to Go From Here?
  • Learning More About Spark and Data Science
  • Index
  • Cover
  • Copyright
  • Credits
  • About the Author
  • www.PacktPub.com
  • Customer Feedback
  • Table of Contents
  • Preface
  • Chapter 1: Getting Started with Spark
  • Getting set up
  • installing Python, a JDK, and Spark and its dependencies
  • Installing Enthought Canopy
  • Installing the Java Development Kit
  • Installing Spark
  • Running Spark code
  • Installing the MovieLens movie rating dataset
  • Run your first Spark program
  • the ratings histogram example
  • Examining the ratings counter script
  • Running the ratings counter script
  • Summary
  • Chapter 2: Spark Basics and Spark Examples
  • What is Spark?
  • Spark is scalable
  • Spark is fast
  • Spark is hot
  • Spark is not that hard
  • Components of Spark
  • Using Python with Spark
  • The Resilient Distributed Dataset (RDD)
  • What is the RDD?
  • The SparkContext object
  • Creating RDDs
  • Transforming RDDs
  • Map example
  • RDD actions
  • Ratings histogram walk-through
  • Understanding the code
  • Setting up the SparkContext object
  • Loading the data
  • Extract (MAP) the data we care about
  • Perform an action
  • count by value
  • Sort and display the results
  • Looking at the ratings-counter script in Canopy
  • Key/value RDDs and the average friends by age example
  • Key/value concepts
  • RDDs can hold key/value pairs
  • Creating a key/value RDD
  • What Spark can do with key/value data?
  • Mapping the values of a key/value RDD
  • The friends by age example
  • Parsing (mapping) the input data
  • Counting up the sum of friends and number of entries per age
  • Compute averages
  • Collect and display the results
  • Running the average friends by age example
  • Examining the script
  • Running the code
  • Filtering RDDs and the minimum temperature by location example
  • What is filter()
  • The source data for the minimum temperature by location example
  • Parse (map) the input data
  • Superhero degrees of separation
  • introducing the breadth-first search algorithm
  • Degrees of separation
  • How the breadth-first search algorithm works?
  • The initial condition of our social graph
  • First pass through the graph
  • Second pass through the graph
  • Third pass through the graph
  • Final pass through the graph
  • Accumulators and implementing BFS in Spark
  • Convert the input file into structured data
  • Writing code to convert Marvel-Graph.txt to BFS nodes
  • Iteratively process the RDD
  • Using a mapper and a reducer
  • How do we know when we're done?
  • Superhero degrees of separation
  • review the code and run it
  • Setting up an accumulator and using the convert to BFS function
  • Calling flatMap()
  • Calling an action
  • Calling reduceByKey
  • Getting results
  • Item-based collaborative filtering in Spark, cache(), and persist()
  • How does item-based collaborative filtering work?
  • Making item-based collaborative filtering a Spark problem
  • It's getting real
  • Caching RDDs
  • Running the similar-movies script using Spark's cluster manager
  • Examining the script
  • Getting results
  • Improving the quality of the similar movies example
  • Summary
  • Chapter 4: Running Spark on a Cluster
  • Introducing Elastic MapReduce
  • Why use Elastic MapReduce?
  • Warning
  • Spark on EMR is not cheap
  • Setting up our Amazon Web Services / Elastic MapReduce account and PuTTY
  • Partitioning
  • Using .partitionBy()
  • Choosing a partition size
  • Creating similar movies from one million ratings
  • part 1
  • Changes to the script
  • Creating similar movies from one million ratings
  • part 2
  • Our strategy
  • Specifying memory per executor
  • Specifying a cluster manager
  • Running on a cluster
  • Setting up to run the movie-similarities-1m.py script on a cluster
  • Preparing the script
  • Creating a cluster
  • Filter out all but the TMIN entries
  • Create (station ID, temperature) key/value pairs
  • Find minimum temperature by station ID
  • Collect and print results
  • Running the minimum temperature example and modifying it for maximums
  • Examining the min-temperatures script
  • Running the script
  • Running the maximum temperature by location example
  • Counting word occurrences using flatmap()
  • Map versus flatmap
  • Map ()
  • Flatmap ()
  • Code sample
  • count the words in a book
  • Improving the word-count script with regular expressions
  • Text normalization
  • Examining the use of regular expressions in the word-count script
  • Running the code
  • Sorting the word count results
  • Step 1
  • Implement countByValue() the hard way to create a new RDD
  • Step 2
  • Sort the new RDD
  • Examining the script
  • Running the code
  • Find the total amount spent by customer
  • Introducing the problem
  • Strategy for solving the problem
  • Useful snippets of code
  • Check your results and sort them by the total amount spent
  • Check your sorted implementation and results against mine
  • Summary
  • Chapter 3: Advanced Examples of Spark Programs
  • Finding the most popular movie
  • Examining the popular-movies script
  • Getting results
  • Using broadcast variables to display movie names instead of ID numbers
  • Introducing broadcast variables
  • Examining the popular-movies-nicer.py script
  • Getting results
  • Finding the most popular superhero in a social graph
  • Superhero social networks
  • Input data format
  • Strategy
  • Running the script
  • discover who the most popular superhero is
  • Mapping input data to (hero ID, number of co-occurrences) per line
  • Adding up co-occurrence by hero ID
  • Flipping the (map) RDD to (number, hero ID)
  • Using max() and looking up the name of the winner
  • Getting results