Computing Reviews

The calculus of service availability
Sloss B., Dahlin M., Rau V., Beyer B. Queue15(2):49-67,2017.Type:Article
Date Reviewed: 05/21/20

The authors give a brief introduction to the service availability required by service-level objectives (SLOs) for availability and responsiveness. The article makes many references to Google’s Site reliability engineering: how Google runs production systems, which is available online [1].

Google strives for 99.999 percent availability of its services. The availability table in Appendix A of Site reliability engineering [1] can be used as a reference to better understand how service availability is defined. For example, three days of downtime results in 99 percent availability on a yearly basis.

A personal example that I can cite occurred over ten years ago, when the heating and air conditioning system in a server room failed on a Friday night. The servers in the room were monitored by an open-source Nagios program. The temperature in the room exceeded 110 degrees Fahrenheit and the machines automatically shut down. The room had no provision for monitoring the ambient temperature and no provision for monitoring the Nagios server that monitored all the machines in the room. With three days of downtime, service availability was still about 99 percent.

The authors point out the need to identify single points of failure (SPOF), which was not done in my above example.

This article gives an excellent introduction to service availability concepts. For an in-depth discussion of how Google handles site reliability engineering, see [1].


1)

Beyer, B.; Jones, C.; Petoff, J.; Murphy, N. R. (Eds.) Site reliability engineering: how Google runs production systems. O’Reilly, Sebastopol, CA, 2016, https://landing.google.com/sre/books/.

Reviewer:  W. E. Mihalo Review #: CR146975 (2009-0226)

Reproduction in whole or in part without permission is prohibited.   Copyright 2024 ComputingReviews.com™
Terms of Use
| Privacy Policy