Fault-tolerant distributed computing (Q1188893)

From MaRDI portal
scientific article
Language Label Description Also known as
English
Fault-tolerant distributed computing
scientific article

    Statements

    Fault-tolerant distributed computing (English)
    0 references
    17 September 1992
    0 references
    [The articles of this volume will not be indexed individually.] Continued and correct operation in the presence of failures are attributes we require today for an increasing number of computing systems that are embedded in safety critical, real-time applications. A great number of these applications are of inherent distributed nature. Thus, they can be best materialized through autonomous, concurrently running components, possibly geographically separated (according to the application) and equipped with communication hardware. These components consist of processors, each with associated memory and input/output devices. Recent advances in hardware technology have made distributed systems a cost effective alternative to centralized systems. To carry out complex applications on distributed systems, one must rely on cooperation among processors. This cooperation cannot be achieved by shared memory as usually done by undistributed concurrent systems. Communication between processors and random communication delays are characteristics of distributed systems. Low-speed communication restricts the flow of information around the network, leading to uncertainty about the global state of the system. When --- in addition to the uncertainty - -- processors are allowed to fail and the system is reuired to continue to function properly despite failures, i.e. if the system is expected to be a fault-tolerant one, this task beomes even more difficult. This book handles fault tolerance issues in distributed systems, considering many aspects of and interactions between theory and practice. To achieve this goal, a workshop was held at Asilomar, California in the Spring of 1986. Most of the 22 chapters of the present book were written following the workshop and subsequently revised before it was eventually published in 1990. Some chapters are transcription of presentations which have been recorded. Also some tutorials and exposés or essays are included. This heterogeneous structure does not make the book for easy- to-read. It is, however, a comprehensive volume on fault tolerance and distributed computing, maybe the first one recapitulating the state-of- the-art on the subject. The chapters of the book are arranged in the order of presentations at the workshop, i.e. not necessarily reflecting any adjustment or interaction between the chapters the organizers may have intended. This missing adjustment between the chapters is sometimes confusing, e.g. \textit{Michael J. Fischer} introduces ``failstop'' as a faulty process' ceasing and notification of other processes of the ``fault'' (page 7) while \textit{Daniel P. Siewiorek} claims that a ``failed-stop'' sequence will be caused by a failure (stopping all execution) and the other processors are not notified of the ``failure'' (page 244). Also \textit{Danny Dolev} introduces this event on page 45; this time noted as ``fail-stop'' and causing a notification. Such notational and definition inconsistencies may puzzle the reader, especially the students the editors wish to include in their readership. The theoretical orientated chapters (mostly grouped in the first part) handle primarily the reliable broadcast problem which will be evently matched with the byzantine generals problem. \textit{J. Gray} compares this problem with the transaction commitment problem, subject to the degree of agreement (some/all) and fault tolerance (time/errors) \textit{Vassos Hadzilacos} compares (in the third part of the book, which is dedicated to system descriptions) the atomic commitment problem with the consensus problem. \textit{Flaviu Cristian et al.} view the principal problem from a specific point of view: ``Atomic broadcast in a real-time environment'', classifying the failures in distributed systems and describing protocols to handle them. \textit{Fred B. Schneider} outlines the state machine approach, focusing on Byzantine and fail-stop failure models (expectedly reintroducing ``fail-stop''(!) on page 22). An important, time related problem, namely the clock syncronisation, will be handled in two subsequent chapters: the first one modeling the problem and its solutions, the second giving some implementation aspects. The second and third part of the book bear the title ``Systems session I'' and ``Systems session II', respectively, apparently claiming to describe some existing systems and ongoing project work. However, several papers of these parts are primarily theoretically orientated, e.g. handling impossibility proofs for distributed problems (by \textit{Michael J. Fischer et al.}), or, modeling replica controlling problems in data management (by \textit{Dale Skeen et al.}). Also \textit{Özalp Baboǧlu} does not describe any particular existing system, but summarizes the very aspect of the present book precisely: the ``Engineering'' of fault- tolerant distributed computing systems. The systems described in the last two parts include the programming language and environment Argus that supports the development and execution of distributed programs (by \textit{Barbara Liskov}), TABS which is a distributed transaction management facility implemented at Carnegie- Mellon University, under the direction of \textit{Alfred Z. Spector}. Also some commercial available systems are covered, e.g. the August system (by \textit{John Wensley}), the Sequoia system (by \textit{Phil Bernstein}), and a fault-tolerant, distributed version of UNIX (by \textit{Anita Borg et al.}). A comprehensive bibliography, neatly grouped in 19 categories, concludes the book. I recommend this book to all with some knowledge and interest in both fault-tolerance and distributed computing. It is a proceedings, and not an introductory, or a text book, to use for class-room teaching. The critical reader will benefit from this book which distinctly impacts the major directions in the research and development on the subject. Contents: \textit{M. J. Fischer}, A theoretician's view of fault tolerant distributed computing (pp. 1-9); \textit{J. Gray}, A comparison of the Byzantine agreement problem and the transaction commit problem (pp. 10- 17); \textit{F. B. Schneider}, The state machine approach: A tutorial (pp. 18-41); \textit{D. Dolev} and \textit{R. Strong}, A simple model for agreement in distributed systems (pp. 42-50); \textit{F. Cristian}, \textit{D. Dolev}, \textit{R. Strong} and \textit{H. Aghili}, Atomic broadcast in a real-time environment (pp. 51-71); \textit{M. Ben-Or}, Randomized agreement protocols (pp. 72-83); \textit{B. Simons}, \textit{J. L. Welch} and \textit{N. Lynch}, An overview of clock synchronization (pp. 84-96); \textit{M. Beck}, \textit{T. K. Srikanth} and \textit{S. Toueg}, Implementation issues in clock synchronization (pp. 97-107); \textit{B. Liskov}, Argus (pp. 108-114); \textit{A. Z. Spector}, TABS (pp. 115-123); \textit{K. P. Birman} and \textit{T. A. Joseph}, Communication support for reliable distributed computing (pp. 124-137); \textit{S. J. Finkelstein}, Algorithms and system design in the highly available systems project (pp. 138-146); \textit{M. J. Fischer}, \textit{N. A. Lynch} and \textit{M. Merritt}, Easy impossibility proofs for distributed consensus problems (pp. 147-170); \textit{D. Skeen}, \textit{A. El Abbadi} and \textit{F. Cristian}, An efficient, fault-tolerant protocol for replicated data management (pp. 171-191); \textit{S. Cohn}, Arpanet routing (pp. 192-200); \textit{V. Hadzilacos}, On the relationship between the atomic commitment and consensus problems (pp. 201-208); \textit{J. Wensley}, The August system (pp. 209-216); \textit{P. Bernstein}, The Sequoia system (pp. 217-223); \textit{A. Borg}, \textit{W. Blau}, \textit{W. Oberle} and \textit{W. Graetsch}, Fault tolerance in distributed UNIX (pp. 224-243); \textit{D. P. Siewiorek}, Faults and their manifestation (pp. 244-261); \textit{Ö. Babaoǧlu}, The ``engineering'' of fault-tolerant distributed computing systems (pp. 262-273); \textit{B. A. Coan}, Bibliography for fault-tolerant distributed computing (pp. 274-298).
    0 references
    0 references
    0 references
    0 references
    0 references
    0 references
    0 references
    0 references
    0 references
    Fault-tolerant distributed computing
    0 references
    Byzantine agreement problem
    0 references
    distributed systems
    0 references
    protocols
    0 references
    synchronization
    0 references
    consensus problems
    0 references
    0 references