Recovery from concurrent failures in communication protocols

  • Al-Saqabi K
  • Saleh K
  • Ahmad I
  • 5

    Readers

    Mendeley users who have this article in their library.
  • 0

    Citations

    Citations of this article.

Abstract

The reliability of communication software is desirable to meet the availability requirement of the associated communication services. Failures in the software implementing the communication protocol may result in unpredictable costs and penalties. In the past, reliability and recovery techniques for communication software have been proposed, but almost all of them are based on intrusive checkpointing procedures, which are invoked periodically. We present an efficient recovery algorithm, which, when incorporated in the protocol design, yields a protocol that can tolerate multiple concurrent failures. The fault tolerance of the recovery algorithm is achieved in two ways: the recovery algorithm can recover from multiple concurrent failures with minimum penalty, and it can recover from failures occurring during the recovery process itself. The recovery algorithm described is based on the concept of event indices and maximally reachable event tuples. During normal operation of the protocol, the state information required for a recovery is piggy-backed with the normal protocol messages. As a result, no overhead in checkpointing is introduced while executing the protocol. Our algorithm requires minimal rollback in case of failure and can handle multiple concurrent failures. A detailed discussion and an example are presented.

Get free article suggestions today

Mendeley saves you time finding and organizing research

Sign up here
Already have an account ?Sign in

Find this document

Authors

Cite this document

Choose a citation style from the tabs below

Save time finding and organizing research with Mendeley

Sign up for free