Blueprints for High Availability: Designing Resilient Distributed Systems
Bestselling authors deliver an authoritative, hands-on book of tools for maintaining constant system availability. Reliability is not a quality that can simply be purchased; instead, it needs to be engineered into a system or product. Here is the only top-to-bottom guide available for the assessment, design, implementation, and testing of a system for 100% reliability. Renowned authors Evan Marcus and Hal Stern provide readers with a series of practical blueprints, disciplines, and processes for assessing risks of a distributed system, assigning costs and selecting appropriate reliability levels, and designing and testing solutions without excessive downtime.
Review By: Jon Duncan Hagar
06/23/2010This book outlines design-based frameworks and architectural considerations for computer system networks that are “always on.” The book is based on the experiences and observations of the authors building these kinds of systems for the Internet world. The authors provide an up-front definition of their specific topic, and at the end of the book provide definitions of many of the terms that they use. After defining what availability is, the book identifies twenty key system design principles in chapter 3. The remainder of the book addresses issues such as data management, designing networks, system management, and implementation techniques. Techniques discussed include replication, recovery, backup/restores, redundancy of systems, data service reliability, and failover concepts (two separate chapters). The authors discuss the technical issues of implementing such systems, and they address issues of human operation including such things as disaster recovery.
Areas of availability are examined in chapters that include implementation information, diagrams, options, and "tales from the field." Topics are covered at the system and architecture level without going into electrical schematics or detailed computer logic. Each chapter is punctuated by a "key point" closing paragraph where the most salient features of that chapter are provided. The primary target of the book is how to design distributed systems that are available 24/7, and what the tradeoffs or issues are with these kinds of systems.
Architects and systems analysts will find the book of interest. The book is very readable for a subject that is highly specialized and of great interest to most companies doing business on the Internet, where if your system/software is not there when somebody comes "clicking," you do not get a second chance. The tales from the field and the examples made the technical information relevant. The content, while narrow in focus, is complete for the level at which the book is written. The book is easy to understand and suitable for a beginner. Pointers to the next level of detail and a bibliography could have made this a better reference, though there is list of URLs for people looking for more detail.
Testing is covered only in passing, though it is pointed out as one of the "keys" for high availability. There is not much coverage or lessons learned from areas like the telecom industry or system reliability worlds, which would seem to be related to the topic of the book. Space considerations must have reduced the coverage of these related topics in favor of discussing high-level design and architecture issues. Testers looking to understand design and basic concepts to aid them in setting up test plans would be well served by this book, but people looking for testing/QA solutions in the area of system availability will not get their questions fully answered.