System reliability, availability and robustness are often not well understood by system architects, engineers and developers. They often don't understand what drives customer's availability expectations, how to frame verifiable availability/robustness requirements, how to manage and budget availability/robustness, how to methodically architect and design systems that meet robustness requirements, and so on. The book takes a very pragmatic approach of framing reliability and robustness as a functional aspect of a system so that architects, designers, developers and testers can address it as a concrete, functional attribute of a system, rather than an abstract, non-functional notion.
ERIC BAUER is Reliability Engineering Manager in the Wireline Division of Alcatel-Lucent. After two decades of software development experience, he joined the Lucent reliability team to lead a reliability group, and has since worked reliability engineering on a variety of wireless and wireline products and solutions. Mr. Bauer currently focuses on increasing the reliability of Alcatel-Lucent's IP Multimedia Subsystem (IMS) solution and the network elements that comprise the IMS solution. He has been awarded twelve U.S. patents, coauthored Practical System Reliability (Wiley), and has published several papers in the Bell Labs Technical Journal.
Figures. Tables. Preface. Acknowledgements. PART ONE RELIABILITY BASICS. 1 Reliability and Availability Concepts. 1.1 Reliability and Availability. 1.2 Faults, Errors and Failures. 1.3 Error Severity. 1.4 Failure Recovery. 1.5 Highly Available Systems. 1.6 Quantifying Availability. 1.7 Outage Attributability. 1.8 Hardware Reliability. 1.9 Software Reliability. 1.10 Problems. 1.11 For Further Study. 2 System Basics. 2.1 Hardware and Software. 2.2 External Entities. 2.3 System Management. 2.4 System Outages. 2.5 Service Quality. 2.6 Total Cost of Ownership. 2.7 Problems. 3 What Can Go Wrong. 3.1 Failures in the Real World. 3.2 Eight-Ingredient Framework. 3.3 Mapping Ingredients to Error Categories. 3.4 Applying Error Categories. 3.5 Error Category: Field Replaceable Unit (FRU) Hardware. 3.6 Error Category: Programming Errors. 3.7 Error Category: Data Error. 3.8 Error Category: Redundancy. 3.9 Error Category: System Power. 3.10 Error Category: Network. 3.11 Error Category: Application Protocol. 3.12 Error Category: Procedures. 3.13 Summary. 3.14 Problems. 3.15 For Further Study. PART TWO RELIABILITY CONCEPTS. 4 Failure Containment and Redundancy. 4.1 Units of Design. 4.2 Failure Recovery Groups. 4.3 Redundancy. 4.4 Summary. 4.5 Problems. 4.6 For Further Study. 5 Robust Design Principles. 5.1 Robust Design Principles. 5.2 Robust Protocols. 5.3 Robust Concurrency Controls. 5.4 Overload Control. 5.5 Process, Resource and Throughput Monitoring. 5.6 Data Auditing. 5.7 Fault Correlation. 5.8 Failed Error Detection, Isolation or Recovery. 5.9 Geographic Redundancy. 5.10 Security, Availability and System Robustness. 5.11 Procedural Considerations. 5.12 Problems. 5.13 For Further Study. 6 Error Detection. 6.1 Detecting Field Replaceable Unit (FRU) Hardware Faults. 6.2 Detecting Programming and Data Faults. 6.3 Detecting Redundancy Failures. 6.4 Detecting Power Failures. 6.5 Detecting Networking Failures. 6.6 Detecting Application Protocol Failures. 6.7 Detecting Procedural Failures. 6.8 Problems. For Further Study. 7 Analyzing and Modeling Reliability and Robustness. 7.1 Reliability Block Diagrams. 7.2 Qualitative Model of Redundancy. 7.3 Failure Mode and Effects Analysis. 7.4 Availability Modeling. 7.5 Planned Downtime. 7.6 Problems. 7.7 For Further Study. PART THREE DESIGN FOR RELIABILITY. 8 Reliability Requirements. 8.1 Background. 8.2 Defining Service Outages. 8.3 Service Availability Requirements. 8.4 Detailed Service Availability Requirements. 8.5 Service Reliability Requirements. 8.6 Triangulating Reliability Requirements. 8.7 Problems. 9 Reliability Analysis. 9.1 Step 1: Enumerate Recoverable Modules. 9.2 Step 2: Construct Reliability Block Diagrams. 9.3 Step 3: Characterize Impact of Recovery. 9.4 Step 4: Characterize Impact of Procedures. 9.5 Step 5: Audit Adequacy of Automatic Failure Detection and Recovery. 9.6 Step 6: Consider Failures of Robustness Mechanisms. 9.7 Step 7: Prioritizing Gaps. 9.8 Reliability of Sourced Modules and Components. 9.9 Problems. 10 Reliability Budgeting and Modeling. 10.1 Downtime Categories. 10.2 Service Downtime Budget. 10.3 Availability Modeling. 10.4 Update Downtime Budget. 10.5 Robustness Latency Budgets. 10.6 Problems. 11 Robustness and Stability Testing. 11.1 Robustness Testing. 11.2 Context of Robustness Testing. 11.3 Factoring Robustness Testing. 11.4 Robustness Testing in the Development Process. 11.5 Robustness Testing Techniques. 11.6 Selecting Robustness Test Cases. 11.7 Analyzing Robustness Test Results. 11.8 Stability Testing. 11.9 Release Criteria. 11.10 Problems. 12 Closing the Loop. 12.1 Analyzing Field Outage Events. 12.2 Reliability Roadmapping. 12.3 Problems. 13 Design for Reliability Case Study. 13.1 System Context. 13.2 System Reliability Requirements. 13.3 Reliability Analysis. 13.4 Downtime Budgeting. 13.5 Availability Modeling. 13.6 Reliability Roadmap. 13.7 Robustness Testing. 13.8 Stability Testing. 13.9 Reliability Review. 13.10 Reliability Report. 13.11 Release Criteria. 13.12 Field Data Analysis. 14 Conclusion. 14.1 Overview of Design for Reliability. 14.2 Concluding Remarks. 14.3 Problems. 15 Appendix: Assessing Design for Reliability Diligence. 15.1 Assessment Methodology. 15.2 Reliability Requirements. 15.3 Reliability Analysis. 15.4 Reliability Modeling and Budgeting. 15.5 Robustness Testing. 15.6 Stability Testing. 15.7 Release Criteria. 15.8 Field Availability. 15.9 Reliability Roadmap. 15.10 Hardware Reliability. Abbreviations. References. Photo Credits. About the Author. Index.