Computing Reviews
Today's Issue Hot Topics Search Browse Recommended My Account Log In
Review Help
Search
The calculus of service availability
Sloss B., Dahlin M., Rau V., Beyer B. Queue15 (2):49-67,2017.Type:Article
Date Reviewed: May 21 2020

The authors give a brief introduction to the service availability required by service-level objectives (SLOs) for availability and responsiveness. The article makes many references to Google’s Site reliability engineering: how Google runs production systems, which is available online [1].

Google strives for 99.999 percent availability of its services. The availability table in Appendix A of Site reliability engineering [1] can be used as a reference to better understand how service availability is defined. For example, three days of downtime results in 99 percent availability on a yearly basis.

A personal example that I can cite occurred over ten years ago, when the heating and air conditioning system in a server room failed on a Friday night. The servers in the room were monitored by an open-source Nagios program. The temperature in the room exceeded 110 degrees Fahrenheit and the machines automatically shut down. The room had no provision for monitoring the ambient temperature and no provision for monitoring the Nagios server that monitored all the machines in the room. With three days of downtime, service availability was still about 99 percent.

The authors point out the need to identify single points of failure (SPOF), which was not done in my above example.

This article gives an excellent introduction to service availability concepts. For an in-depth discussion of how Google handles site reliability engineering, see [1].

Reviewer:  W. E. Mihalo Review #: CR146975 (2009-0226)
1) Beyer, B.; Jones, C.; Petoff, J.; Murphy, N. R. (Eds.) Site reliability engineering: how Google runs production systems. O’Reilly, Sebastopol, CA, 2016, https://landing.google.com/sre/books/.
Bookmark and Share
  Featured Reviewer  
 
Distributed Systems (D.4.7 ... )
 
 
System Architectures (C.0 ... )
 
 
General (D.2.0 )
 
 
General (C.0 )
 
Would you recommend this review?
yes
no
Other reviews under "Distributed Systems": Date
The design of the Saguaro distributed operating system
Andrews G., Schlichting R., Hayes R., Purdin T. IEEE Transactions on Software Engineering 13(1): 104-118, 1987. Type: Article
Sep 1 1987
Modern operating systems
Tanenbaum A., Prentice-Hall, Inc., Upper Saddle River, NJ, 1992. Type: Book (9780135881873)
Dec 1 1992
The drinking philosophers problem
Chandy K., Misra J. ACM Transactions on Programming Languages and Systems 6(4): 632-646, 1984. Type: Article
Jun 1 1985
more...

E-Mail This Printer-Friendly
Send Your Comments
Contact Us
Reproduction in whole or in part without permission is prohibited.   Copyright 1999-2024 ThinkLoud®
Terms of Use
| Privacy Policy