Delta Lake: up and running modern data Lakehouse architectures with Delta Lake

With the surge in big data and AI, organizations can rapidly create data products. However, the effectiveness of their analytics and machine learning models depends on the data's quality. Delta Lake's open source format offers a robust lakehouse framework over platforms like Amazon S3, ADL...

Full description

Bibliographic Details
Main Authors: Haelen, Bennie, Davis, Dan (Author)
Format: eBook
Language:English
Published: Sebastopol, California O'Reilly Media, Inc. 2023
Edition:First edition
Subjects:
Online Access:
Collection: O'Reilly - Collection details see MPG.ReNa
Table of Contents:
  • Delta Connectors
  • Conclusion
  • Chapter 2. Getting Started with Delta Lake
  • Getting a Standard Spark Image
  • Using Delta Lake with PySpark
  • Running Delta Lake in the Spark Scala Shell
  • Running Delta Lake on Databricks
  • Creating and Running a Spark Program: helloDeltaLake
  • The Delta Lake Format
  • Parquet Files
  • Writing a Delta Table
  • The Delta Lake Transaction Log
  • How the Transaction Log Implements Atomicity
  • Breaking Down Transactions into Atomic Commits
  • The Transaction Log at the File Level
  • Scaling Massive Metadata
  • Conclusion
  • Analyzing the MERGE operation with DESCRIBE HISTORY
  • Inner Workings of the MERGE Operation
  • Conclusion
  • Chapter 5. Performance Tuning
  • Data Skipping
  • Partitioning
  • Partitioning Warnings and Considerations
  • Compact Files
  • Compaction
  • OPTIMIZE
  • ZORDER BY
  • ZORDER BY Considerations
  • Liquid Clustering
  • Enabling Liquid Clustering
  • Operations on Clustered Columns
  • Liquid Clustering Warnings and Considerations
  • Conclusion
  • Chapter 6. Using Time Travel
  • Delta Lake Time Travel
  • Restoring a Table
  • Restoring via Timestamp
  • Time Travel Under the Hood
  • Chapter 3. Basic Operations on Delta Tables
  • Creating a Delta Table
  • Creating a Delta Table with SQL DDL
  • The DESCRIBE Statement
  • Creating Delta Tables with the DataFrameWriter API
  • Creating a Delta Table with the DeltaTableBuilder API
  • Generated Columns
  • Reading a Delta Table
  • Reading a Delta Table with SQL
  • Reading a Table with PySpark
  • Writing to a Delta Table
  • Cleaning Out the YellowTaxis Table
  • Inserting Data with SQL INSERT
  • Appending a DataFrame to a Table
  • Using the OverWrite Mode When Writing to a Delta Table
  • Inserting Data with the SQL COPY INTO Command
  • Partitions
  • User-Defined Metadata
  • Using SparkSession to Set Custom Metadata
  • Using the DataFrameWriter to Set Custom Metadata
  • Conclusion
  • Chapter 4. Table Deletes, Updates, and Merges
  • Deleting Data from a Delta Table
  • Table Creation and DESCRIBE HISTORY
  • Performing the DELETE Operation
  • DELETE Performance Tuning Tips
  • Updating Data in a Table
  • Use Case Description
  • Updating Data in a Table
  • UPDATE Performance Tuning Tips
  • Upsert Data Using the MERGE Operation
  • Use Case Description
  • The MERGE Dataset
  • The MERGE Statement
  • Intro
  • Copyright
  • Table of Contents
  • Preface
  • How to Contact Us
  • Conventions Used in This Book
  • Using Code Examples
  • O'Reilly Online Learning
  • Acknowledgment
  • Chapter 1. The Evolution of Data Architectures
  • A Brief History of Relational Databases
  • Data Warehouses
  • Data Warehouse Architecture
  • Dimensional Modeling
  • Data Warehouse Benefits and Challenges
  • Introducing Data Lakes
  • Data Lakehouse
  • Data Lakehouse Benefits
  • Implementing a Lakehouse
  • Delta Lake
  • The Medallion Architecture
  • The Delta Ecosystem
  • Delta Lake Storage
  • Delta Sharing