Post-incident reviews learning from failure for improved incident response

Anyone who works with technology knows that eventually something will go wrong--even in today's complex, distributed, and highly available IT systems. In the battle to maintain uninterrupted service, your DevOps teams require updated methods for detecting and solving problems fast. In this repo...

Full description

Bibliographic Details
Main Author: Hand, Jason
Format: eBook
Language:English
Published: Sebastopol, CA O'Reilly Media 2017
Edition:First edition
Subjects:
Online Access:
Collection: O'Reilly - Collection details see MPG.ReNa
Description
Summary:Anyone who works with technology knows that eventually something will go wrong--even in today's complex, distributed, and highly available IT systems. In the battle to maintain uninterrupted service, your DevOps teams require updated methods for detecting and solving problems fast. In this report, author Jason Hand explains that effective post-incident reviews today encourage team members to play a key role in continuously improving the system. Traditional techniques for conducting post-incident analyses don't work well in modern IT organizations, mainly because the command-and-control approach offers team members no incentive to explore the system and detect flaws when they occur. This report presents an up-to-date approach to post-incident reviews that embraces the human element and adds more eyes for discovering system flaws and potential improvements. Understand why sustained success depends on a core value of continuous improvement Examine why traditional post-incident approaches, such as Root Cause Analysis, do little to provide greater availability and reliability of IT services Understand the role that team members can play in discovering system flaws Learn why it's often difficult to determine the cause and effect of outages in complex systems Get a case study that examines the unique phases of an outage incident Explore post-incident analysis in depth by moving away from causes and going deeper into the phases of the incident lifecycle
Physical Description:1 volume illustrations