Table of Contents:
  • ManageabilityAvailable distributions; Cloudera Distribution of Hadoop (CDH); Hortonworks Data Platform (HDP); MapR; Pivotal HD; Summary; Chapter 2: Advanced MapReduce; MapReduce input; The InputFormat class; The InputSplit class; The RecordReader class; Hadoop's ""small files"" problem; Filtering inputs; The Map task; The dfs.blocksize attribute; Sort and spill of intermediate outputs; Node-local Reducers or Combiners; Fetching intermediate outputs
  • Map-side; The Reduce task; Fetching intermediate outputs
  • Reduce-side; Merge and spill of intermediate outputs; MapReduce output
  • The Replicated joinSkewed joins; The Merge join; User-defined functions; The evaluation functions; The aggregate functions; The filter functions; The load functions; The store functions; Pig performance optimizations; The optimization rules; Measurement of Pig script performance; Combiners in Pig; Memory for the Bag data type; Number of reducers in Pig; The multiquery mode in Pig; Best practices; The explicit usage of types; Early and frequent projection; Early and frequent filtering; The usage of the LIMIT operator; The usage of the DISTINCT operator; The reduction of operations
  • Speculative execution of tasksMapReduce job counters; Handling data joins; Reduce-side joins; Map-side joins; Summary; Chapter 3: Advanced Pig; Pig versus SQL; Different modes of execution; Complex data types in Pig; Compiling Pig scripts; The logical plan; The physical plan; The MapReduce plan; Development and debugging aids; The DESCRIBE command; The EXPLAIN command; The ILLUSTRATE command; The advanced Pig operators; The advanced FOREACH operator; The FLATTEN operator; The nested FOREACH operator; The COGROUP operator; The UNION operator; The CROSS operator; Specialized joins in Pig
  • The usage of Algebraic UDFsThe usage of Accumulator UDFs; Eliminating nulls in the data; The usage of specialized joins; Compressing intermediate results; Combining smaller files; Summary; Chapter 4: Advanced Hive; The Hive architecture; The Hive metastore; The Hive compiler; The Hive execution engine; The supporting components of Hive; Data types; File formats; Compressed files; ORC files; The Parquet files; The data model; Dynamic partitions; Semantics for dynamic partitioning; Indexes on Hive tables; Hive query optimizers; Advanced DML; The GROUP BY operation
  • Cover ; Copyright; Credits; About the Author; Acknowledgments; About the Reviewers; www.PacktPub.com; Untitled; Untitled; Table of Contents; Preface; Chapter 1: Hadoop 2.X; The inception of Hadoop; The evolution of Hadoop; Hadoop's genealogy; Hadoop-0.20-append; Hadoop-0.20-security; Hadoop's timeline; Hadoop 2.X; Yet Another Resource Negotiator (YARN); Architecture overview; Storage layer enhancements; High availability; HDFS Federation; HDFS snapshots; Other enhancements; Support enhancements; Hadoop distributions; Which Hadoop distribution?; Performance; Scalability; Reliability