The book Patterns for Performance and Operability by Ford et al is one of the few publications which addresses directly the operability of business software (which is partly why I am writing Software Operability: A Guide for Software Teams). Patterns for Performance and Operability (‘PPO’) is an excellent volume, containing many valuable insights into the ways we can improve the operability of software systems; this blog post explores a few of the key themes and ideas found in the book.
The performance and operability of software is often given scant attention by software development teams driven by the delivery of end-user features by a product owner motivated largely by functional requirements (“Is the feature present?” rather than “Does the feature work well with 500 users?”). The so-called ‘ilities‘ and performance are aspects of software which are often called ‘non-functional requirements’ and tend to be de-scoped during delivery; this inevitably leads to problems in Production, including lost revenue, re-deployments, hasty bug-fixes, and much gnashing of teeth by operations people and development teams alike.
By treating functional and non-functional requirements alike, and adjusting the terminology to ‘end-user features’ and ‘operational features’, those aspects which help to make the software really work well can be worked on alongside the regular ‘user stories’ or visible functionality.
By articulating the nature of the operational aspects of software systems, the authors of PPO have helped to emphasise the need to take seriously the operability of software.
Much of the book is taken up with identifying and elaborating on different aspects of software system performance. Performance is a crucial aspect of operability; in fact, ‘many of the operability tests that you will need to conduct can only be executed under load’ (p. 171). Accordingly, capacity planning is outlined (p. 245); although flexible compute models (aka ‘the cloud’) reduce some of the complexity of capacity planning, they do not remove the need for it altogether, and PPO provides a useful introduction to capacity planning for on-premise, traditional off-premise, and cloud-hosting models.
Types of Testing
The various different kinds of operational testing are explained, some of which will probably be an eye-opener to many software development teams. The need to identify and test specific boundary conditions is an important part of operational testing (p. 183). Soak testing, where non-punishing tests are run for a long time in order to flush out errors which occur only after the system is ‘soaked’ or carrying a lot of transactions, is a good example of the kind of testing which often gets missed out before going live. For instance, what happens to the software when its log file eventually fills the entire drive or file system? Such scenarios can be forced or simulated by using a small RAM disk during testing, in order to trigger any error condition quickly, without having to wait for the larger production-sized file system to fill up (which might take days normally).
The way in which an organisation treats ‘failures’ can have a marked effect on the effectiveness of the software delivery effort. If every failure in Production leads to ever-increasing additional checks, tests, and (most destructively) blame, then future failures end up being more (not less) likely, as people retreat into the ‘safety’ of minimal effort and fear of change.
The PPO authors rightly urge us to treat failures as ‘canaries in a coal mine’, alerting us to bigger problems. W. Edwards Deming advised us to avoid a blame culture based on fear of failure, and so to set up our delivery processes and practices so that we treat failures as an opportunity for learning, not for retribution and blame. The REAL failure is not allowing teams to learn from incidents; the blameless post-mortem review is a crucial part of helping that organisational learning to take place (p. 272).
The DevOps movement puts a great deal of emphasis on monitoring; events such as Monitorama demonstrate the huge amount of interest and competence in the web-scale monitoring space. PPO covers some useful patterns for monitoring, including the importance of aggregating/grouping errors by type/message (p. 243) and the need for ‘end-user’ or synthetic monitoring (pp. 239-40):
End-user monitors may not tell you what is wrong, but they are unlikely to fail to alert when your system is experiencing problems.
The relationship between monitoring and trending (including capacity planning) is covered too.
The Business Case for Operability
A valuable aspect of PPO is the focus it gives to real-world problems relating to operability found in the project process. Often, operability testing is de-scoped at the initiation phase of a project, making it very difficult to re-gain the initiative for operability later on; Chapter 2 outlines how to fight for an appropriate level of testing and operability features, justifying the investment in business terms. Chapter 9, Resisting Pressure from the Functional Requirements Stream, is effectively a ‘survival handbook’ for operability in the ‘jungle’ of functionality-focused budget holders; the section in pp. 214-23 is sound, pragmatic advice for software professionals who want to ‘Do The Right Thing’ for the operability of the software. Chapter 12 is devoted to Common Impediments to Good Design, acting like a kind of anti-checklist of things to avoid: Tight Timeframes, Constantly Changing Technology, Personal Design Preferences, etc.
The audience for PPO seems to be the ‘software architect’ or ‘technical lead’, someone with influence over budget, but no budgetary decision-making power, and the person likely to be ‘measured’ (within Analytic-minded organisations, at least) in terms of the operational effectiveness of the software system she is helping to build. The book therefore contains few code samples, focusing instead on detailed (and well-referenced) explanations of the core principles behind high-performing and operable systems. As such, the book is suited to those who have an overview of how the whole system will work together, rather than individual software developers or software development teams, because – apart from the section on exception handling (pp. 87-90) which contains some useful worked examples – most sections of the book may be too abstract for many software developers.
PPO is written with the assumption that a broadly ‘waterfall’ delivery process is in place: organisations or teams using more agile methodologies might find some of the up-front activities a bit cumbersome. However, in my experience, many agile-esque software projects would do well to ‘bring the pain forward’ when it comes to operability, and adopt some of the suggestions and practices in PPO, even if these ‘front-load’ the project a little more than is familiar.
The book was published just before what is known as ‘cloud computing’ really became mainstream, so anyone looking for cloud-specific details will be disappointed; however, books like Scalable Internet Architectures by Theo Schlossnagle have since filled this gap. I did find the section on Exception Handling a little short, but – as the authors point out – an entire book could be written on that subject, and it has been covered well elsewhere (particularly in Release It! by Michael Nygard).
Finally, the authors do not elaborate much on the inter-team communication patterns needed for effective collaboration between ‘development’ and ‘operations’, but (to be fair) the book was published in 2008, a few years before the term ‘DevOps’ was coined as a way to characterise the cross-functional cooperation on which large-scale business systems increasingly rely today. Given the huge amount of high-quality material in PPO, I think we can be happy to look elsewhere for less ‘technical’ concerns.
Patterns for Performance and Operability is a particularly useful book for people wanting to make their software systems operate better in the Production environment. Aimed at software architects and technical leads, it presents guidance on making the business case for operability, planning testing, better software design, and how to evaluate trade-offs between different operational requirements. Highly recommended.