Mastering Spark with R the complete guide to large-scale analysis and modeling

"Authors Javier Luraschi, Kevin Kuo, and Edgar Ruiz show you how to combine R with Spark to analyze data at scale. This book covers relevant data science topics, cluster computing, and issues that will interest even the most advanced users."--Back cover

Bibliographic Details
Main Authors: Luraschi, Javier, Kuo, Kevin (Author), Ruiz, Edgar (Author)
Format: eBook
Language:English
Published: Sebastopol, CA O'Reilly Media 2019
Edition:First edition
Subjects:
Online Access:
Collection: O'Reilly - Collection details see MPG.ReNa
Table of Contents:
  • Using ggplot2Using dbplot; Model; Caching; Communicate; Recap; Chapter 4. Modeling; Overview; Exploratory Data Analysis; Feature Engineering; Supervised Learning; Generalized Linear Regression; Other Models; Unsupervised Learning; Data Preparation; Topic Modeling; Recap; Chapter 5. Pipelines; Overview; Creation; Use Cases; Hyperparameter Tuning; Operating Modes; Interoperability; Deployment; Batch Scoring; Real-Time Scoring; Recap; Chapter 6. Clusters; Overview; On-Premises; Managers; Distributions; Cloud; Amazon; Databricks; Google; IBM; Microsoft; Qubole; Kubernetes; Tools; RStudio; Jupyter
  • Includes bibliographical references and index
  • OverviewTransformations; Analysis; Modeling; Pipelines; Distributed R; Kafka; Shiny; Recap; Chapter 13. Contributing; Overview; The Spark API; Spark Extensions; Using Scala Code; Recap; Appendix A. Supplemental Code References; Preface; Formatting; Chapter 1; The World's Capacity to Store Information; Daily Downloads of CRAN Packages; Chapter 2; Prerequisites; Chapter 3; Hive Functions; Chapter 4; MLlib Functions; Chapter 6; Google Trends for On-Premises (Mainframes), Cloud Computing, and Kubernetes; Chapter 12; Stream Generator; Installing Kafka; Index; About the Authors; Colophon
  • Intro; Copyright; Table of Contents; Foreword; Preface; Formatting; Acknowledgments; Conventions Used in This Book; Using Code Examples; O'Reilly Online Learning; How to Contact Us; Chapter 1. Introduction; Overview; Hadoop; Spark; R; sparklyr; Recap; Chapter 2. Getting Started; Overview; Prerequisites; Installing sparklyr; Installing Spark; Connecting; Using Spark; Web Interface; Analysis; Modeling; Data; Extensions; Distributed R; Streaming; Logs; Disconnecting; Using RStudio; Resources; Recap; Chapter 3. Analysis; Overview; Import; Wrangle; Built-in Functions; Correlations; Visualize
  • LivyRecap; Chapter 7. Connections; Overview; Edge Nodes; Spark Home; Local; Standalone; YARN; YARN Client; YARN Cluster; Livy; Mesos; Kubernetes; Cloud; Batches; Tools; Multiple Connections; Troubleshooting; Logging; Spark Submit; Windows; Recap; Chapter 8. Data; Overview; Reading Data; Paths; Schema; Memory; Columns; Writing Data; Copying Data; File Formats; CSV; JSON; Parquet; Others; File Systems; Storage Systems; Hive; Cassandra; JDBC; Recap; Chapter 9. Tuning; Overview; Graph; Timeline; Configuring; Connect Settings; Submit Settings; Runtime Settings; sparklyr Settings; Partitioning
  • Implicit PartitionsExplicit Partitions; Caching; Checkpointing; Memory; Shuffling; Serialization; Configuration Files; Recap; Chapter 10. Extensions; Overview; H2O; Graphs; XGBoost; Deep Learning; Genomics; Spatial; Troubleshooting; Recap; Chapter 11. Distributed R; Overview; Use Cases; Custom Parsers; Partitioned Modeling; Grid Search; Web APIs; Simulations; Partitions; Grouping; Columns; Context; Functions; Packages; Cluster Requirements; Installing R; Apache Arrow; Troubleshooting; Worker Logs; Resolving Timeouts; Inspecting Partitions; Debugging Workers; Recap; Chapter 12. Streaming