The Practice of Cloud System Administration: DevOps and SRE Practices for Web Services, Volume 2

The Practice of Cloud System Administration: DevOps and SRE Practices for Web Services, Volume 2

By: Thomas A. Limoncelli (author), Strata R. Chalup (author), Christina J. Hogan (author)Paperback

Special OrderSpecial Order item not currently available. We'll try and order for you.

Description

"There's an incredible amount of depth and thinking in the practices described here, and it's impressive to see it all in one place." -Win Treese, coauthor of Designing Systems for Internet Commerce The Practice of Cloud System Administration, Volume 2, focuses on "distributed" or "cloud" computing and brings a DevOps/SRE sensibility to the practice of system administration. Unsatisfied with books that cover either design or operations in isolation, the authors created this authoritative reference centered on a comprehensive approach. Case studies and examples from Google, Etsy, Twitter, Facebook, Netflix, Amazon, and other industry giants are explained in practical ways that are useful to all enterprises. The new companion to the best-selling first volume, The Practice of System and Network Administration, Second Edition, this guide offers expert coverage of the following and many other crucial topics: Designing and building modern web and distributed systems Fundamentals of large system design Understand the new software engineering implications of cloud administration Make systems that are resilient to failure and grow and scale dynamically Implement DevOps principles and cultural changes IaaS/PaaS/SaaS and virtual platform selection Operating and running systems using the latest DevOps/SRE strategies Upgrade production systems with zero down-time What and how to automate; how to decide what not to automate On-call best practices that improve uptime Why distributed systems require fundamentally different system administration techniques Identify and resolve resiliency problems before they surprise you Assessing and evaluating your team's operational effectiveness Manage the scientific process of continuous improvement A forty-page, pain-free assessment system you can start using today

About Author

Thomas A. Limoncelli is an internationally recognized author, speaker, and system administrator with more than twenty years of experience at companies like Google, Bell Labs, and StackExchange.com. Strata R. Chalup has more than twenty-five years of experience in Silicon Valley, focusing on IT strategy, best-practices, and scalable infrastructures at firms that include Apple, Sun, Cisco, McAfee, and Palm. Christina J. Hogan has more than twenty years of experience in system administration and network engineering, from Silicon Valley to Italy and Switzerland. She has a master's degree in computer science, a doctorate in aeronautical engineering, and has been part of a Formula 1 racing team.

Contents

Preface xxiii About the Authors xxix Introduction 1 Part I: Design: Building It 7 Chapter 1: Designing in a Distributed World 9 1.1 Visibility at Scale 10 1.2 The Importance of Simplicity 11 1.3 Composition 12 1.4 Distributed State 17 1.5 The CAP Principle 21 1.6 Loosely Coupled Systems 24 1.7 Speed 26 1.8 Summary 29 Exercises 30 Chapter 2: Designing for Operations 31 2.1 Operational Requirements 31 2.2 Implementing Design for Operations 45 2.3 Improving the Model 48 2.4 Summary 49 Exercises 50 Chapter 3: Selecting a Service Platform 51 3.1 Level of Service Abstraction 52 3.2 Type of Machine 56 3.3 Level of Resource Sharing 62 3.4 Colocation 65 3.5 Selection Strategies 66 3.6 Summary 68 Exercises 68 Chapter 4: Application Architectures 69 4.1 Single-Machine Web Server 70 4.2 Three-Tier Web Service 71 4.3 Four-Tier Web Service 77 4.4 Reverse Proxy Service 80 4.5 Cloud-Scale Service 80 4.6 Message Bus Architectures 85 4.7 Service-Oriented Architecture 90 4.8 Summary 92 Exercises 93 Chapter 5: Design Patterns for Scaling 95 5.1 General Strategy 96 5.2 Scaling Up 98 5.3 The AKF Scaling Cube 99 5.4 Caching 104 5.5 Data Sharding 110 5.6 Threading 112 5.7 Queueing 113 5.8 Content Delivery Networks 114 5.9 Summary 116 Exercises 116 Chapter 6: Design Patterns for Resiliency 119 6.1 Software Resiliency Beats Hardware Reliability 120 6.2 Everything Malfunctions Eventually 121 6.3 Resiliency through Spare Capacity 124 6.4 Failure Domains 126 6.5 Software Failures 128 6.6 Physical Failures 131 6.7 Overload Failures 138 6.8 Human Error 141 6.9 Summary 142 Exercises 143 Part II: Operations: Running It 145 Chapter 7: Operations in a Distributed World 147 7.1 Distributed Systems Operations 148 7.2 Service Life Cycle 155 7.3 Organizing Strategy for Operational Teams 160 7.4 Virtual Office 166 7.5 Summary 167 Exercises 168 Chapter 8: DevOps Culture 171 8.1 What Is DevOps? 172 8.2 The Three Ways of DevOps 176 8.3 History of DevOps 180 8.4 DevOps Values and Principles 181 8.5 Converting to DevOps 186 8.6 Agile and Continuous Delivery 188 8.7 Summary 192 Exercises 193 Chapter 9: Service Delivery: The Build Phase 195 9.1 Service Delivery Strategies 197 9.2 The Virtuous Cycle of Quality 200 9.3 Build-Phase Steps 202 9.4 Build Console 205 9.5 Continuous Integration 205 9.6 Packages as Handoff Interface 207 9.7 Summary 208 Exercises 209 Chapter 10: Service Delivery: The Deployment Phase 211 10.1 Deployment-Phase Steps 211 10.2 Testing and Approval 214 10.3 Operations Console 217 10.4 Infrastructure Automation Strategies 217 10.5 Continuous Delivery 221 10.6 Infrastructure as Code 221 10.7 Other Platform Services 222 10.8 Summary 222 Exercises 223 Chapter 11: Upgrading Live Services 225 11.1 Taking the Service Down for Upgrading 225 11.2 Rolling Upgrades 226 11.3 Canary 227 11.4 Phased Roll-outs 229 11.5 Proportional Shedding 230 11.6 Blue-Green Deployment 230 11.7 Toggling Features 230 11.8 Live Schema Changes 234 11.9 Live Code Changes 236 11.10 Continuous Deployment 236 11.11 Dealing with Failed Code Pushes 239 11.12 Release Atomicity 240 11.13 Summary 241 Exercises 241 Chapter 12: Automation 243 12.1 Approaches to Automation 244 12.2 Tool Building versus Automation 250 12.3 Goals of Automation 252 12.4 Creating Automation 255 12.5 How to Automate 258 12.6 Language Tools 258 12.7 Software Engineering Tools and Techniques 262 12.8 Multitenant Systems 270 12.9 Summary 271 Exercises 272 Chapter 13: Design Documents 275 13.1 Design Documents Overview 275 13.2 Design Document Anatomy 277 13.3 Template 279 13.4 Document Archive 279 13.5 Review Workflows 280 13.6 Adopting Design Documents 282 13.7 Summary 283 Exercises 284 Chapter 14: Oncall 285 14.1 Designing Oncall 285 14.2 Being Oncall 294 14.3 Between Oncall Shifts 299 14.4 Periodic Review of Alerts 302 14.5 Being Paged Too Much 304 14.6 Summary 305 Exercises 306 Chapter 15: Disaster Preparedness 307 15.1 Mindset 308 15.2 Individual Training: Wheel of Misfortune 311 15.3 Team Training: Fire Drills 312 15.4 Training for Organizations: Game Day/DiRT 315 15.5 Incident Command System 323 15.6 Summary 329 Exercises 330 Chapter 16: Monitoring Fundamentals 331 16.1 Overview 332 16.2 Consumers of Monitoring Information 334 16.3 What to Monitor 336 16.4 Retention 338 16.5 Meta-monitoring 339 16.6 Logs 340 16.7 Summary 342 Exercises 342 Chapter 17: Monitoring Architecture and Practice 345 17.1 Sensing and Measurement 346 17.2 Collection 350 17.3 Analysis and Computation 353 17.4 Alerting and Escalation Manager 354 17.5 Visualization 358 17.6 Storage 362 17.7 Configuration 362 17.8 Summary 363 Exercises 364 Chapter 18: Capacity Planning 365 18.1 Standard Capacity Planning 366 18.2 Advanced Capacity Planning 371 18.3 Resource Regression 381 18.4 Launching New Services 382 18.5 Reduce Provisioning Time 384 18.6 Summary 385 Exercises 386 Chapter 19: Creating KPIs 387 19.1 What Is a KPI? 388 19.2 Creating KPIs 389 19.3 Example KPI: Machine Allocation 393 19.4 Case Study: Error Budget 396 19.5 Summary 399 Exercises 399 Chapter 20: Operational Excellence 401 20.1 What Does Operational Excellence Look Like? 401 20.2 How to Measure Greatness 402 20.3 Assessment Methodology 403 20.4 Service Assessments 407 20.5 Organizational Assessments 411 20.6 Levels of Improvement 412 20.7 Getting Started 413 20.8 Summary 414 Exercises 415 Epilogue 416 Part III: Appendices 419 Appendix A: Assessments 421 A.1 Regular Tasks (RT) 423 A.2 Emergency Response (ER) 426 A.3 Monitoring and Metrics (MM) 428 A.4 Capacity Planning (CP) 431 A.5 Change Management (CM) 433 A.6 New Product Introduction and Removal (NPI/NPR) 435 A.7 Service Deployment and Decommissioning (SDD) 437 A.8 Performance and Efficiency (PE) 439 A.9 Service Delivery: The Build Phase 442 A.10 Service Delivery: The Deployment Phase 444 A.11 Toil Reduction 446 A.12 Disaster Preparedness 448 Appendix B: The Origins and Future of Distributed Computing and Clouds 451 B.1 The Pre-Web Era (1985-1994) 452 B.2 The First Web Era: The Bubble (1995-2000) 455 B.3 The Dot-Bomb Era (2000-2003) 459 B.4 The Second Web Era (2003-2010) 465 B.5 The Cloud Computing Era (2010-present) 469 B.6 Conclusion 472 Exercises 473 Appendix C: Scaling Terminology and Concepts 475 C.1 Constant, Linear, and Exponential Scaling 475 C.2 Big O Notation 476 C.3 Limitations of Big O Notation 478 Appendix D: Templates and Examples 481 D.1 Design Document Template 481 D.2 Design Document Example 482 D.3 Sample Postmortem Template 484 Appendix E: Recommended Reading 487 Bibliography 491 Index 499

Product Details

  • ISBN13: 9780321943187
  • Format: Paperback
  • Number Of Pages: 560
  • ID: 9780321943187
  • weight: 892
  • ISBN10: 032194318X

Delivery Information

  • Saver Delivery: Yes
  • 1st Class Delivery: Yes
  • Courier Delivery: Yes
  • Store Delivery: Yes

Prices are for internet purchases only. Prices and availability in WHSmith Stores may vary significantly

Close