Site reliability engineering How Google runs production systems

In this collection of essays and articles, key members of Google's Site Reliability Team explain how and why their commitment to the entire lifecycle has enabled the company to successfully build, deploy, monitor, and maintain some of the largest software systems in the world

Bibliographic Details
Other Authors: Beyer, Betsy (Editor), Jones, Chris, Petoff, Jennifer, Murphy, Niall Richard
Format: eBook
Language:English
Published: Sebastopol, CA O'Reilly Media 2016
Subjects:
Online Access:
Collection: O'Reilly - Collection details see MPG.ReNa
Table of Contents:
  • Includes bibliographical references and index
  • Gmail: Predictable, Scriptable Responses from HumansThe Long Run; Conclusion; Chapter 7. The Evolution of Automation at Google; The Value of Automation; Consistency; A Platform; Faster Repairs; Faster Action; Time Saving; The Value for Google SRE; The Use Cases for Automation; Google SRE's Use Cases for Automation; A Hierarchy of Automation Classes; Automate Yourself Out of a Job: Automate ALL the Things!; Soothing the Pain: Applying Automation to Cluster Turnups; Detecting Inconsistencies with Prodtest; Resolving Inconsistencies Idempotently; The Inclination to Specialize
  • Chapter 5. Eliminating ToilToil Defined; Why Less Toil Is Better; What Qualifies as Engineering?; Is Toil Always Bad?; Conclusion; Chapter 6. Monitoring Distributed Systems; Definitions; Why Monitor?; Setting Reasonable Expectations for Monitoring; Symptoms Versus Causes; Black-Box Versus White-Box; The Four Golden Signals; Worrying About Your Tail (or, Instrumentation and Performance); Choosing an Appropriate Resolution for Measurements; As Simple as Possible, No Simpler; Tying These Principles Together; Monitoring for the Long Term; Bigtable SRE: A Tale of Over-Alerting
  • Cover; Copyright; Table of Contents; Foreword; Preface; Conventions Used in This Book; Using Code Examples; O'Reilly Safari; How to Contact Us; Acknowledgments; Part I. Introduction; Chapter 1. Introduction; The Sysadmin Approach to Service Management; Google's Approach to Service Management: Site Reliability Engineering; Tenets of SRE; Ensuring a Durable Focus on Engineering; Pursuing Maximum Change Velocity Without Violating a Service's SLO; Monitoring; Emergency Response; Change Management; Demand Forecasting and Capacity Planning; Provisioning; Efficiency and Performance
  • Identifying the Risk Tolerance of Infrastructure ServicesMotivation for Error Budgets1An early version of this section appeared as an article in ; login: (August 2015, vol. 40, no. 4).; Forming Your Error Budget; Benefits; Chapter 4. Service Level Objectives; Service Level Terminology; Indicators; Objectives; Agreements; Indicators in Practice; What Do You and Your Users Care About?; Collecting Indicators; Aggregation; Standardize Indicators; Objectives in Practice; Defining Objectives; Choosing Targets; Control Measures; SLOs Set Expectations; Agreements in Practice
  • The End of the BeginningChapter 2. The Production Environment at Google, from the Viewpoint of an SRE; Hardware; System Software That "Organizes" the Hardware; Managing Machines; Storage; Networking; Other System Software; Lock Service; Monitoring and Alerting; Our Software Infrastructure; Our Development Environment; Shakespeare: A Sample Service; Life of a Request; Job and Data Organization; Part II. Principles; Chapter 3. Embracing Risk; Managing Risk; Measuring Service Risk; Risk Tolerance of Services; Identifying the Risk Tolerance of Consumer Services