As our daily lives and corporate well-being become more dependent upon computers, system reliability grows increasingly important. No longer are frequent system outages acceptable. In many cases, failure intervals must now be measured in centuries.
Even current fault-tolerant computing systems will fail once every five or ten years. This book is the first in a three-part series on active/active systems. It describes techniques that can be used today for extending system failure times from years to centuries, often at little or no additional cost.
The techniques described include splitting a large system into smaller, cooperating independent nodes. Copies of the application’s database are distributed across the nodes. It is shown that these techniques significantly reduce the number of system failure modes and increase the level of sparing. As a result, the loss of a single node’s capacity occurs far less frequently than the loss of all capacity when the equivalent monolithic system fails. Furthermore, the loss of more than one node’s worth of capacity is almost never.
Central to these techniques is the requirement that all database copies that are distributed across the network must be kept in synchronism. Several methods available today for maintaining synchronism are described. They include asynchronous data replication, synchronous data replication, and network transactions.