The reliability of communication software is desirable to meet the availability requirement of the associated communication services. Failures in the software implementing the communication protocol may result in unpredictable costs and penalties. In the past, reliability and recovery techniques for communication software have been proposed, but almost all of them are based on intrusive checkpointing procedures, which are invoked periodically. We present an efficient recovery algorithm, which, when incorporated in the protocol design, yields a protocol that can tolerate multiple concurrent failures. The fault tolerance of the recovery algorithm is achieved in two ways: the recovery algorithm can recover from multiple concurrent failures with minimum penalty, and it can recover from failures occurring during the recovery process itself. The recovery algorithm described is based on the concept of event indices and maximally reachable event tuples. During normal operation of the protocol, the state information required for a recovery is piggy-backed with the normal protocol messages. As a result, no overhead in checkpointing is introduced while executing the protocol. Our algorithm requires minimal rollback in case of failure and can handle multiple concurrent failures. A detailed discussion and an example are presented.
Al-Saqabi, K., Saleh, K., & Ahmad, I. (1996). Recovery from concurrent failures in communication protocols. Journal of Systems and Software, 35(1), 55–65. https://doi.org/10.1016/0164-1212(95)00085-2