Google、書籍「Site Reliability Engineering」の無料公開を開始。インフラや運用をソフトウェアで改善していく新しいアプローチ

2017年1月30日

「Site Reliability Engineering」（SRE）とは、GoogleのシニアVPであるBen Treynor氏が提唱した、高い信頼性や性能を発揮するシステムインフラを実現し、改善していくアプローチのひとつです。

これまでの運用チームやインフラチームによる運用や改善とSREが異なるのは、SREでは積極的にコードを書き、ソフトウェアによって目的の達成を目指している点にあるといえます。

Googleが公開しているSREのWebサイトでは、SREを次のように説明しています。

Like traditional operations groups, we keep important, revenue-critical systems up and running despite hurricanes, bandwidth outages, and configuration errors. Unlike traditional operations groups, we view software as the primary tool through which our systems are managed, maintained, and minded;

従来の運用チームと同様に、私たちは重要かつ売り上げに直結するクリティカルなシステムを、たとえハリケーンが襲い、ネットワークが落ち、構成エラーが起きようとも、稼働状態を保つようにします。一方、従来の運用チームとは異なり、私たちはソフトウェアを主なツールとし、私たちのシステムを運用管理し、維持し、監視します。

そのSREのバイブルとも言えるオライリーの書籍「Site Reiability Engineering」を、Googleはオンラインで無料公開したと発表しました。

Site Reliability Engineering book is now available free online: https://t.co/el1vFcm5q4
— SRE Book (@srebook) 2017年1月27日

GoogleはSREに関する情報を集約した専用サイトを公開しており、書籍「Site Reiability Engineering」の無料公開も、このサイトで行われています。

書籍は500ページを超えるボリュームがあり、その内容がすべてオンラインで公開されています。下記が目次です。

Part I - Introduction
Chapter 1 - Introduction
Chapter 2 - The Production Environment at Google, from the Viewpoint of an SRE
Part II - Principles
Chapter 3 - Embracing Risk
Chapter 4 - Service Level Objectives
Chapter 5 - Eliminating Toil
Chapter 6 - Monitoring Distributed Systems
Chapter 7 - The Evolution of Automation at Google
Chapter 8 - Release Engineering
Chapter 9 - Simplicity
Part III - Practices
Chapter 10 - Practical Alerting
Chapter 11 - Being On-Call
Chapter 12 - Effective Troubleshooting
Chapter 13 - Emergency Response
Chapter 14 - Managing Incidents
Chapter 15 - Postmortem Culture: Learning from Failure
Chapter 16 - Tracking Outages
Chapter 17 - Testing for Reliability
Chapter 18 - Software Engineering in SRE
Chapter 19 - Load Balancing at the Frontend
Chapter 20 - Load Balancing in the Datacenter
Chapter 21 - Handling Overload
Chapter 22 - Addressing Cascading Failures
Chapter 23 - Managing Critical State: Distributed Consensus for Reliability
Chapter 24 - Distributed Periodic Scheduling with Cron
Chapter 25 - Data Processing Pipelines
Chapter 26 - Data Integrity: What You Read Is What You Wrote
Chapter 27 - Reliable Product Launches at Scale
Part IV - Management
Chapter 28 - Accelerating SREs to On-Call and Beyond
Chapter 29 - Dealing with Interrupts
Chapter 30 - Embedding an SRE to Recover from Operational Overload
Chapter 31 - Communication and Collaboration in SRE
Chapter 32 - The Evolving SRE Engagement Model