The DevOps movement has had many incredibly positive outcomes on IT systems engineering as a discipline. Direct work with software development teams has led many infrastructure engineers to adopt practices which have been standard operating procedure in the software development realm for decades. Much of this practice has centered around adoption and evolution of the technology that allows Agile development teams to quickly and confidently achieve rapid change. Infrastructure-as-code, as either traditional configuration management or containerization, allows complicated platforms to be expressed as versioned artifacts, without the bureaucratic overload of an ITIL-style CMDB and manual release management processes. Simultaneously, continuous integration systems allow us to trivially test for regressions in functionality, performance, and security.

Significantly less attention has been paid to the ways that Agile teams manage schedules. This is critical for technologists to understand, because according to a well-known study by McKinsey and the University of Oxford, the average IT project is 45% over budget on cost and 7% over budget on time. These risks are intrinsically known. One major reason for the swell in cloud computing is that, since most organizations share critical personnel on operations and infrastructure, operational issues can create unforeseen bottlenecks on key project staff. (The Phoenix Project covered this concept at length.) However, even with distractions removed, we must also keep pace by using estimation methods compatible with Agile engineering styles.

Particularly in Scrum methodology, the Definition of Done is integral to this process. Peter Stevens summarizes the concept quite succinctly:

At its most basic level, a definition of Done creates a shared understanding of what it means to be finished, so everybody in the project means the same thing when they say “it’s done”. More subtly, the definition of Done is an expression of the team’s quality standards. A more rigorous definition of Done will be associated with higher quality software. Generally the team will become more productive (“have a higher velocity”) as their definition of Done becomes more stringent, because they will spend less time fixing old problems. Rework all but disappears.

Old problems? Rework? These are things that are in no way foreign to anyone who has built any kind of technology infrastructures. Acceptance criteria, even loosely-defined ones, ensure everyone is on the same page with project progress.

Stevens’ sample Definition of Done looked like this:

  1. Potentially releasable build available for download
  2. Summary of changes updated to include newly implemented features
  3. Inactive/unimplemented features hidden or greyed out (not executable)
  4. Unit tests written and green
  5. Source code committed on server
  6. Jenkins built version and all tests green
  7. Code review completed (or pair-programmed)
  8. How to Demo verified before presentation to Product Owner
  9. Ok from Product Owner

Infrastructure has some different requirements. A Definition of Done for an infrastructure task might have some of the following:

  1. Service Level Agreement determined
  2. Infrastructure repeatable through code
  3. Continuous integration tests for (2) written and passing
  4. Metrics and logs aggregated for rapid problem diagnosis
  5. Automated monitoring alerts for availability and performance problems
  6. Documentation and architecture diagrams completed
  7. Run books written for investigating outages
  8. Automated backups of service data
    1. Automated verification of backups
  9. Guidelines established for capacity planning and scaling
    1. Launch-day capacity plan completed
  10. Full and partial service failure behaviors tested
  11. Operations staff provided basic training on the service

(Alternatively, depending on just how closely your development and operations teams work together, you might work together directly on the same sprint goals and share a single Definition of Done that takes these operations-oriented facets into account.)

The concept of a checklist is far from new; Tom Limoncelli even wrote an entire book about how to improve individual productivity by making effective use of them. But the Definition of Done’s emphasis on team communication and understanding makes it clear that this is a crucial concept for high-performing DevOps organizations. A good Definition of Done should include input from infrastructure, product owner, development, security, and risk management teams, as well as higher-level layers of the business. In a post very much worth reading, Mitch Lacey outlines a clear process for helping arrive at a mutually-understood Definition of Done.

When the team discusses these items together, everyone understands that each of these facets has an impact on the schedule, and discussions happen around what those impacts are. All stakeholders have agreed on the value of each of these aspects of the deliverable, and have discussed how much work is actually appropriate to arrive at Done.

Christian Vos actually proposes writing two Definitions of Done: one for minimum acceptance, and one for continued maturity of the project. (In other words: it’s okay to ship without batteries included, as long as everyone involved is aware.) Particularly in Lean shops where the uptake of new features is not known until those features are deployed and observed, this can be valuable to avoid building unnecessary resiliency, instrumentation, or scale into the system before it’s needed.

The Definition of Done is a powerful process which can be invaluable for helping team members arrive at shared understanding. Working together to arrive at the Definition of Done from an infrastructure perspective allows organizations to understand system operability, coordinate and resolve conflicting priorities, and schedule features completely and correctly so they do not need to be revisited on future development sprints.