Why do Internet services fail, and what can be done about it?

494Citations
Citations of this article
166Readers
Mendeley users who have this article in their library.

Abstract

In 1986 Jim Gray published his landmark study of the causes of failures of Tandem systems and the techniques Tandem used to prevent such failures [6 . Seventeen years later,Internet services have replaced fault-tolerant servers as the new kid on the 24x7-availability block. Usingdata from three large-scale Internet services,we analyzed the causes of their failures and the (potential)effectiveness of various techniques for preventingand mitigatingservice failure. We find that (1) operator error is the largest cause of failures in two of the three services,(2)operator error is the largest contributor to time to repair in two of the three services,(3) configuration errors are the largest category of operator errors,(4)failures in custom-written front-end software are significant,and (5)more extensive online testing and more thoroughly exposing and detecting component failures would reduce failure rates in at least one service. Qualitatively we find that improvement in the maintenance tools and systems used by service operations staff would decrease time to diagnose and repair problems.

Cite

CITATION STYLE

APA

Oppenheimer, D., Ganapathi, A., & Patterson, D. A. (2003). Why do Internet services fail, and what can be done about it? In 4th USENIX Symposium on Internet Technologies and Systems, USITS 2003. USENIX Association.

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free