Applied data science using Pyspark learn the end-to-end predictive model-building cycle

Discover the capabilities of PySpark and its application in the realm of data science. This comprehensive guide with hand-picked examples of daily use cases will walk you through the end-to-end predictive model-building cycle with the latest techniques and tricks of the trade. Applied Data Science U...

Full description

Bibliographic Details
Main Author: Kakarla, Ramcharan
Other Authors: Krishnan, Sundar, Alla, Sridhar
Format: eBook
Language:English
Published: Berkeley, CA Apress 2021
Subjects:
Online Access:
Collection: O'Reilly - Collection details see MPG.ReNa
Table of Contents:
  • Create a Simple Docker Image
  • Download PySpark Docker
  • Step-by-Step Approach to Understanding the Docker PySpark run Command
  • Databricks Community Edition
  • Create Databricks Account
  • Create a New Cluster
  • Create Notebooks
  • How Do You Import Data Files into the Databricks Environment?
  • Basic Operations
  • Upload Data
  • Access Data
  • Calculate Pi
  • Summary
  • Chapter 2: PySpark Basics
  • PySpark Background
  • PySpark Resilient Distributed Datasets (RDDs) and DataFrames
  • Data Manipulations
  • Reading Data from a File
  • Reading Data from Hive Table
  • Reading Metadata
  • Dropping Duplicates
  • Data Visualizations
  • Introduction to Machine Learning
  • Summary
  • Chapter 4: Variable Selection
  • Exploratory Data Analysis
  • Cardinality
  • Missing Values
  • Missing at Random (MAR)
  • Missing Completely at Random (MCAR)
  • Missing Not at Random (MNAR)
  • Code 1: Cardinality Check
  • Code 2: Missing Values Check
  • Step 1: Identify Variable Types
  • Step 2: Apply StringIndexer to Character Columns
  • Step 3: Assemble Features
  • Built-in Variable Selection Process: Without Target
  • Principal Component Analysis
  • Mechanics
  • Singular Value Decomposition
  • Intro
  • Table of Contents
  • About the Authors
  • About the Technical Reviewer
  • Acknowledgments
  • Foreword 1
  • Foreword 2
  • Foreword 3
  • Introduction
  • Chapter 1: Setting Up the PySpark Environment
  • Local Installation using Anaconda
  • Step 1: Install Anaconda
  • Step 2: Conda Environment Creation
  • Step 3: Download and Unpack Apache Spark
  • Step 4: Install Java 8 or Later
  • Step 5: Mac & Linux Users
  • Step 6: Windows Users
  • Step 7: Run PySpark
  • Step 8: Jupyter Notebook Extension
  • Docker-based Installation
  • Why Do We Need to Use Docker?
  • What Is Docker?
  • Built-in Variable Selection Process: With Target
  • ChiSq Selector
  • Model-based Feature Selection
  • Custom-built Variable Selection Process
  • Information Value Using Weight of Evidence
  • Monotonic Binning Using Spearman Correlation
  • How Do You Calculate the Spearman Correlation by Hand?
  • How Is Spearman Correlation Used to Create Monotonic Bins for Continuous Variables?
  • Custom Transformers
  • Main Concepts in Pipelines
  • Voting-based Selection
  • Summary
  • Chapter 5: Supervised Learning Algorithms
  • Basics
  • Regression
  • Classification
  • Loss Functions
  • Optimizers
  • Counting Records
  • Subset Columns and View a Glimpse of the Data
  • Missing Values
  • One-Way Frequencies
  • Sorting and Filtering One-Way Frequencies
  • Casting Variables
  • Descriptive Statistics
  • Unique/Distinct Values and Counts
  • Filtering
  • Creating New Columns
  • Deleting and Renaming Columns
  • Summary
  • Chapter 3: Utility Functions and Visualizations
  • Additional Data Manipulations
  • String Functions
  • Registering DataFrames
  • Window Functions
  • Other Useful Functions
  • Collect List
  • Sampling
  • Caching and Persisting
  • Saving Data
  • Pandas Support
  • Joins