Operability can Improve if Developers Write a Draft Run Book

The run book (or system operation manual) is traditionally written by the IT operations (Ops) team after software development is considered complete. However, this typically leads to operability problems being discovered with the software, operational concerns having been ignored, forgotten, or not fully addressed by the development (Dev) team. If the software development team writes a draft run book or draft operation manual, many of the operational problems typically found during pre-live system readiness testing can be caught and corrected much earlier. Because the development team needs to collaborate with the operations team in order to define and complete the various draft run book details, the operations team also gains early insight into the new software. Channels of communication, trust, and collaboration are established between the traditionally siloed Dev and Ops teams, which can help to establish and strengthen a DevOps approach to building and running software systems.

Clay tablet, Museum of Athens, Greece

Do not carve the run book in stone; focus instead on the collaboration needed to write the draft.

I will be talking about run book collaboration at DevOps Summit in Amsterdam on 15 November 2013.

Note: I actually agree with much of what Jeff Goldschrafe says on run books; if we rely on run books to help us actually operate a system in 2013 or later, we have likely not automated or monitored enough. The key point about run book collaboration is to get Dev and Ops talking to each other about operational features during development, rather than leaving those conversations to a failure-laden ‘Production-ization’ phase.

Address Operational Concerns from the Start of the Project or Programme

Many software projects or programmes tend to leave consideration of operational concerns until close to launch or ‘go live’. This is often due to a software development team (or budget holder) driven primarily by end-user features; the operability of the software is considered to be a ‘problem for the Ops team’ to be addressed during a so-called ‘Production-ization’ phase (a terrible term). Even where software has undergone some level of capacity or load testing before being handed to the Operations team, crucial differences between test environments and the Production environment are often assumed to be irrelevant. Almost invariably this leads to a last-minute rush of bug fixes, hacks, and workarounds by both Dev and Ops, along with much gnashing of teeth and complaints about ‘stupid developers’ from Ops and ‘I just want it to work’ from Dev.

1 Consider Operational Concerns Only After Development is Complete

What Happens if We Consider Operational Concerns Only After Development is Complete

In extreme cases, the software may have to be substantially or entirely re-written to address operability problems, suggesting that a buffer is needed between the first operability test and the go-live date. At one software consultancy/outsourcing company I worked at, we used to insist that substantial technical testing should be conducted half-way through the project, so that we still had half the project timeline to radically change the performance characteristics of our software if needed; this strategy helped us on more than one occasion when the load characteristics of our software were unexpected.

2 Address Operational Concerns Halfway Through

What Happens if We Address Operational Concerns Halfway Through

By starting technical testing earlier in the cycle, we expose some operational problems sooner, so rather than a major problem occurring just before the launch date, we typically see several smaller, more tractable problems arising. However, the “50% timeline” approach still suffers from not treating operational concerns as first-class features alongside end-user features, and can really only give us a single chance to correct our ‘best guesses’ about operability. A much better approach to making our software truly operable is to consider operational features from the very start of the programme of work (or project); operational features (typically seen as ‘non-function requirements’) should be included in the product backlog alongside end-user features (typically seen as ‘functional requirements’). In this way, the Dev team has a better chance of understanding and addressing these crucial and often project-critical aspects before they derail or delay a software launch.

3 Consider Operational Features from the Start

The Best Approach – Consider Operational Features from the Start

As operational concerns are identified and addressed throughout the duration of the development phase, both the Ops team and the Dev team become more confident that the software will work well in Production on the go-live date, gaining each other’s trust. Dev teams have not traditionally included people with much operational experience, although this is changing as DevOps approaches demonstrate the value of greater cross-functional working methods. Whether our Dev team has embedded Ops people or not, many Dev folk have had little exposure to operational issues, and so it can be difficult for them to anticipate the software changes needed to make the software operable. This is where the draft run book can play a vital role.

The Draft Run Book as a Collaboration Tool for Dev and Ops

The ‘run book’ (sometimes called the ‘system operation manual’, or just ‘operation manual’) is a collection of procedures and steps for operations teams to follow (either manually, or through run book automation) in order to enable the software to run effectively in Production. A run book includes details of how the operations team should deal with things like daylight saving time changes, data cleardown, recovery from failover, server patching, troubleshooting, and so on. Historically, it was the Ops team that wrote the run book, based on chance conversations with the Dev team, sketchy documentation, and much trial and error. However, by turning around this situation and giving responsibility for the first few drafts of the run book to the Dev team, substantial improvements to the operability of the software system can result.

I have found that many developers (I include myself here) are often surprised at the ‘stuff’ which is needed to make software work in Production (content switch rulesets, SSL offloading, a separate management NIC, data cleardown, etc.) and appreciate the opportunity to make their code better. Also, if operational features are written as familiar agile stories, it becomes as ‘natural’ for the Dev team to addresses operational features as it is to address end-user features, identifying the Ops team as a set of users with real needs and requirements.

UserFriendly - SysAdmin Day

UserFriendly – SysAdmin Day – from http://ars.userfriendly.org/cartoons/?id=20130724

Pre-Requisites for Success with the Draft Run Book

It is important to recognise two related but distinct benefits of run book collaboration:

  1. The software becomes more operable and so works better in Production.
  2. The Dev and Ops teams have collaborated on the design and execution of the software.

Of these two, the second is arguably more important than the first, because building a lasting trust between Dev and Ops will help to achieve better software over a longer period of time, with quality built in, whereas a one-off push for operability might reduce problems prior to go-live but not address fundamental aspects of operational readiness within Dev teams.

Rubbish bins / trash cans in London

Throw away the draft run book to emphasise the importance of collaboration over documentation

In fact, I recommend to organisations that they throw away the draft run book in order emphasise that the purpose of the run book collaboration is to increase communication and trust between Den and Ops, not produce a giant document which might replace or prevent automation and monitoring.

There are a few other pre-requisites for success with run book collaboration:

  • Talk about operational features, not Non-Functional Requirements
  • Make the Dev team (or better, the Product Owner) partly responsible for the operational effectiveness of the software system
  • Encourage and persuade Ops teams to engage with Dev teams during the development phase. This means Ops folk doing CRAZY STUFF like pairing on logging implementations, and attending stand-ups, planning meetings, and retrospectives
  • Having technical management sufficiently savvy that they see value in the act of collaboration even if the artefact of collaboration (the draft run book) is discarded at the end of the process

In practice, a decent alignment in goals between Dev and Ops is also going to be needed; if Dev is rewarded largely on user story delivery, and Ops rewarded largely on uptime of the existing systems, the space and time to collaborate is going to be difficult to find.

Summary

To reduce operational problems with our software systems, we should begin to address operational concerns from the very start of the project or programme of work and treat non-functional requirements as operational features, scheduling them alongside end-user features. By having the software development team write a draft run book (operational manual), and by updating the draft run book alongside feature development (during every iteration), many typical operational problems can be avoided. In addition, the Dev and Ops teams can develop a framework for collaboration around the run book, building trust as they work together on operational features, rather than discovering operational problems close to the ‘go live’ date.

Sign up to receive notification when my forthcoming book Software Operability is ready. The book will include an example run book template for typical web-based software systems.

I will be talking about run book collaboration at DevOps Summit in Amsterdam on 15 November 2013.

7 thoughts on “Operability can Improve if Developers Write a Draft Run Book

  1. Pingback: DevOps im Experten-Check: 7 Gründe, warum Kultur der Schlüssel zum Erfolg ist - JAXenter

  2. Pingback: DevOps-Expertencheck: 7 Gründe, warum Unternehmenskultur wichtig ist

  3. Pingback: Visualising Testable Architectures – Diagram Industries

Leave a comment