Google App Engine、全データセンターを巻き込む連鎖的障害で能力半減、復旧のためフルリスタート

2012年10月29日

「2011年1月にHigh Replication Datastoreを立ち上げて以来、App Engineでこれだけ大規模なシステム障害を経験したことはなかった」。グーグルのGoogle App Engine Blogは10月26日付けのエントリ「About today's App Engine outage」でこう書き、同日発生したApp Engineの障害について報告しました。

Google App Engine Blog: About today's App Engine outage

この障害は10月26日のおおよそ午前7時半から11時30分までの約4時間、 App Engineのリクエストの約半分が失敗するという大規模なものでした。同社は以下のように経緯を説明しています。

ルータへの負荷が全データセンターへ拡大

4:00 am - Load begins increasing on traffic routers in one of the App Engine datacenters.

App Engineデータセンターの1つで、ルータ群への負荷が上昇し始める。

6:10 am - The load on traffic routers in the affected datacenter passes our paging threshold.

障害を起こしたデータセンターのルータへの負荷が閾値を超えてしまう。

6:30 am - We begin a global restart of the traffic routers to address the load in the affected datacenter.

ルータへの負荷を解決するため、障害を起こしたデータセンターのグローバルリスタートを開始。

ここで行った「グローバルリスタート」がなにを指すのかは分かりませんが、障害は結局全データセンターへ広がっていきます。

7:30 am - The global restart plus additional load unexpectedly reduces the count of healthy traffic routers below the minimum required for reliable operation. This causes overload in the remaining traffic routers, spreading to all App Engine datacenters. Applications begin consistently experiencing elevated error rates and latencies.

グローバルリスタートに加えさらなる負荷によって、予期せず正常状態のルータが減少。安定的な運用に必要なルータの数を下回る。これにより残りのルータが過負荷になり、すべてのApp Engineデータセンターに現象が拡大。アプリケーションのエラーや遅延が拡大していく。

フルリスタート以外に選択肢はないと決断

すべてのApp Engineへ現象が拡大し、ルータが連鎖的に障害を起こすようになったため、フルリスタートを決断。

8:28 am - [email protected] is updated with notification that we are aware of the incident and working to repair it.

[email protected]へ報告をアップデート。現在のインシデントとその取り組みについて。

ここでも「フルリスタート」がなにを指すのか説明はありませんが、障害が全App Engineデータセンターに広がっていることを考えると、いったんすべてのApp Engineデータセンターをリスタートすることと想像されます。

11:10 am - We determine that App Engine’s traffic routers are trapped in a cascading failure, and that we have no option other than to perform a full restart with gradual traffic ramp-up to return to service.

ルータ群が連鎖的な障害に陥っており、フルリスタートし、徐々にトラフィックを戻してサービスを正常にしていく以外に選択肢がないと判断した。

11:45 am - Traffic ramp-up completes, and App Engine returns to normal operation.

トラフィックを完全に戻し終え、App Engineは通常運営に復旧。

グーグルはこの障害の再発防止策として、ルーティングの容量追加、および連鎖的な影響を受けにくくするための構成変更を行うとのことです。