Spark in Action Covers Apache Spark 3 with Examples in Java, Python, and Scala

Spark in Action, Second Edition , teaches you to create end-to-end analytics applications. In this entirely new book, you'll learn from interesting Java-based examples, including a complete data pipeline for processing NASA satellite data. And you'll discover Java, Python, and Scala code s...

Full description

Bibliographic Details
Main Author: Perrin, Jean Georges
Format: eBook
Language:English
Published: New York Manning Publications Co. LLC 2020
Series:ITpro collection
Subjects:
Online Access:
Collection: O'Reilly - Collection details see MPG.ReNa
Table of Contents:
  • 12.2.3 Writing the transformation code
  • 12.2.4 Reviewing your data transformation to ensure a quality process
  • What about sorting?
  • Wrapping up your first Spark transformation
  • 12.3 Joining datasets
  • 12.3.1 A closer look at the datasets to join
  • 12.3.2 Building the list of higher education institutions per county
  • Initialization of Spark
  • Loading and preparing the data
  • 12.3.3 Performing the joins
  • Joining the FIPS county identifier with the higher ed dataset using a join
  • Joining the census data to get the county name
  • 12.4 Performing more transformations
  • Summary
  • 13. Transforming entire documents
  • 13.1 Transforming entire documents and their structure
  • 13.1.1 Flattening your JSON document
  • 13.1.2 Building nested documents for transfer and storage
  • 13.2 The magic behind static functions
  • 13.3 Performing more transformations
  • Summary
  • 14. Extending transformations with user-defined functions
  • 14.1 Extending Apache Spark
  • 14.2 Registering and calling a UDF
  • 14.2.1 Registering the UDF with Spark
  • 14.2.2 Using the UDF with the dataframe API
  • 14.2.3 Manipulating UDFs with SQL
  • 14.2.4 Implementing the UDF
  • 14.2.5 Writing the service itself
  • 14.3 Using UDFs to ensure a high level of data quality
  • 14.4 Considering UDFs' constraints
  • Summary
  • 15. Aggregating your data
  • 15.1 Aggregating data with Spark
  • 15.1.1 A quick reminder on aggregations
  • 15.1.2 Performing basic aggregations with Spark
  • Performing an aggregation using the dataframe API
  • Performing an aggregation using Spark SQL
  • 15.2 Performing aggregations with live data
  • 15.2.1 Preparing your dataset
  • 15.2.2 Aggregating data to better understand the schools
  • What is the average enrollment for each school?
  • What is the evolution of the number of students?
  • What is the higher enrollment per school and year?
  • 6.2.2 Setting up the environment
  • 6.3 Building your application to run on the cluster
  • 6.3.1 Building your application's uber JAR
  • 6.3.2 Building your application by using Git and Maven
  • 6.4 Running your application on the cluster
  • 6.4.1 Submitting the uber JAR
  • 6.4.2 Running the application
  • 6.4.3 the Spark user interface
  • Summary
  • Part 2. Ingestion
  • 7. Ingestion from files
  • 7.1 Common behaviors of parsers
  • 7.2 Complex ingestion from CSV
  • 7.2.1 Desired output
  • 7.2.2 Code
  • 7.3 Ingesting a CSV with a known schema
  • 7.3.1 Desired output
  • 7.3.2 Code
  • 7.4 Ingesting a JSON file
  • 7.4.1 Desired output
  • 7.4.2 Code
  • 7.5 Ingesting a multiline JSON file
  • 7.5.1 Desired output
  • 7.5.2 Code
  • 7.6 Ingesting an XML file
  • 7.6.1 Desired output
  • 7.6.2 Code
  • 7.7 Ingesting a text file
  • 7.7.1 Desired output
  • 7.7.2 Code
  • 7.8 File formats for big data
  • 7.8.1 The problem with traditional file formats
  • 7.8.2 Avro is a schema-based serialization format
  • 7.8.3 ORC is a columnar storage format
  • 7.8.4 Parquet is also a columnar storage format
  • 7.8.5 Comparing Avro, ORC, and Parquet
  • 7.9 Ingesting Avro, ORC, and Parquet files
  • 7.9.1 Ingesting Avro
  • 7.9.2 Ingesting ORC
  • 7.9.3 Ingesting Parquet
  • 7.9.4 Reference table for ingesting Avro, ORC, or Parquet
  • Summary
  • 8. Ingestion from databases
  • 8.1 Ingestion from relational databases
  • 8.1.1 Database connection checklist
  • 8.1.2 Understanding the data used in the examples
  • 8.1.3 Desired output
  • 8.1.4 Code
  • 8.1.5 Alternative code
  • 8.2 The role of the dialect
  • 8.2.1 What is a dialect, anyway?
  • 8.2.2 JDBC dialects provided with Spark
  • 8.2.3 Building your own dialect
  • 8.3 Advanced queries and ingestion
  • 8.3.1 Filtering by using a WHERE clause
  • 8.3.2 Joining data in the database
  • 8.3.3 Performing Ingestion and partitioning
  • 8.3.4 Summary of advanced features
  • 8.4 Ingestion from Elasticsearch
  • 8.4.1 Data flow
  • 8.4.2 The New York restaurants dataset digested by Spark
  • 8.4.3 Code to ingest the restaurant dataset from Elasticsearch
  • Summary
  • 9 Advanced ingestion: finding data sources and building your own
  • 9.1 What is a data source?
  • 9.2 Benefits of a direct connection to a data source
  • 9.2.1 Temporary files
  • 9.2.2 Data quality scripts
  • 9.2.3 Data on demand
  • 9.3 Finding data sources at Spark Packages
  • 9.4 Building your own data source
  • 9.4.1 Scope of the example project
  • 9.4.2 Your data source API and options
  • 9.5 Behind the scenes: Building the data source itself
  • 9.6 Using the register file and the advertiser class
  • 9.7 Understanding the relationship between the data and schema
  • 9.7.1 The data source builds the relation
  • 9.7.2 Inside the relation
  • 9.8 Building the schema from a JavaBean
  • 9.9 Building the dataframe is magic with the utilities
  • 9.10 The other classes
  • Summary
  • 10. Ingestion through structured streaming
  • 10.1 What's streaming?
  • 10.2 Creating your first stream
  • 10.2.1 Generating a file stream
  • 10.2.2 Consuming the records
  • 10.2.3 Getting records, not lines
  • 10.3 Ingesting data from network streams
  • 10.4 Dealing with multiple streams
  • 10.5 Differentiating discretized and structured streaming
  • Summary
  • Part 3. Transforming your data
  • 11. Working with SQL
  • 11.1 Working with Spark SQL
  • 11.2 The difference between local and global views
  • 11.3 Mixing the dataframe API and Spark SQL
  • 11.4 Don't DELETE it!
  • 11.5 Going further with SQL
  • Summary
  • 12 Transforming your data
  • 12.1 What is data transformation?
  • 12.2 Process and example of record-level transformation
  • 12.2.1 Data discovery to understand the complexity
  • 12.2.2 Data mapping to draw the process
  • Includes bibliographical references
  • Intro
  • Copyright
  • brief contents
  • contents
  • front matter
  • foreword
  • The analytics operating system
  • preface
  • acknowledgments
  • about this book
  • Who should read this book
  • What will you learn in this book?
  • How this book is organized
  • About the code
  • liveBook discussion forum
  • about the author
  • about the cover illustration
  • Part 1. The theory crippled by awesome examples
  • 1. So, what is Spark, anyway?
  • 1.1 The big picture: What Spark is and what it does
  • 1.1.1 What is Spark?
  • 1.1.2 The four pillars of mana
  • 1.2 How can you use Spark?
  • 1.2.1 Spark in a data processing/engineering scenario
  • 1.2.2 Spark in a data science scenario
  • 1.3 What can you do with Spark?
  • 1.3.1 Spark predicts restaurant quality at NC eateries
  • 1.3.2 Spark allows fast data transfer for Lumeris
  • 1.3.3 Spark analyzes equipment logs for CERN
  • 1.3.4 Other use cases
  • 1.4 Why you will love the dataframe
  • 1.4.1 The dataframe from a Java perspective
  • 1.4.2 The dataframe from an RDBMS perspective
  • 1.4.3 A graphical representation of the dataframe
  • 1.5 Your first example
  • 1.5.1 Recommended software
  • 1.5.2 Downloading the code
  • 1.5.3 Running your first application
  • Command line
  • Eclipse
  • 1.5.4 Your first code
  • Summary
  • 2. Architecture and flow
  • 2.1 Building your mental model
  • 2.2 Using Java code to build your mental model
  • 2.3 Walking through your application
  • 2.3.1 Connecting to a master
  • 2.3.2 Loading, or ingesting, the CSV file
  • 2.3.3 Transforming your data
  • 2.3.4 Saving the work done in your dataframe to a database
  • Summary
  • 3. The majestic role of the dataframe
  • 3.1 The essential role of the dataframe in Spark
  • 3.1.1 Organization of a dataframe
  • 3.1.2 Immutability is not a swear word
  • 3.2 Using dataframes through examples
  • 3.2.1 A dataframe after a simple CSV ingestion