Breaking the Availability Barrier
Survivable Systems for Enterprise Computing
About the Book
As our daily lives and corporate well-being become more dependent upon computers, system reliability grows increasingly important. No longer are frequent system outages acceptable. In many cases, failure intervals must now be measured in centuries.
Even current fault-tolerant computing systems will fail once every five or ten years. This book is the first in a three-part series on active/active systems. It describes techniques that can be used today for extending system failure times from years to centuries, often at little or no additional cost.
The techniques described include splitting a large system into smaller, cooperating independent nodes. Copies of the application’s database are distributed across the nodes. It is shown that these techniques significantly reduce the number of system failure modes and increase the level of sparing. As a result, the loss of a single node’s capacity occurs far less frequently than the loss of all capacity when the equivalent monolithic system fails. Furthermore, the loss of more than one node’s worth of capacity is almost never.
Central to these techniques is the requirement that all database copies that are distributed across the network must be kept in synchronism. Several methods available today for maintaining synchronism are described. They include asynchronous data replication, synchronous data replication, and network transactions.
About the Author
Dr. Bill Highleyman, Paul J. Holenstein, and Dr. Bruce Holenstein have a combined experience of over 90 years in the implementation of fault-tolerant, highly available computing systems. This experience ranges from the early days of custom redundant systems to today’s fault-tolerant offerings from HP (NonStop) and Stratus.
Dr. Bill Highleyman has done extensive work on the effect of failure mode reduction on system availability. He has built fault-tolerant systems for train control, racetrack wagering, securities trading, message communication, and other applications. He is the Managing Editor of the Availability Digest (availabilitydigest.com).
Paul J. Holenstein and Dr. Bruce Holenstein have architected and implemented the various data replication techniques required for the availability enhancements described in this book. Their company, Gravic, provides the Shadowbase line of data replication products to the fault-tolerant community.