Introduction
My Perspective
Capabilities
Metrics
Outcome-Oriented Organisation
Beware of the Kool-Aid
Accelerate’s Methodology
Use Metrics With Caution
Summary
Further Reading

Introduction

This article looks at the book Accelerate: The Science of Lean Software and DevOps: Building and Scaling High Performance Technology Organisations by Forsgren, Humble and Kim, 2018. The work started in this book/research project is now continued by Google Cloud’s DevOps Research and Assessment (DORA) team.

The book addresses three aspects of the software development process which influence software quality and performance:

culture
capabilities
metrics

Culture is probably the least interesting aspect of this book, and not discussed much here. In short, where IT staff can freely access and exchange information on project status, and can discuss and experiment with new ideas, productivity is high. The authors take about 20% of the book to discuss this - rather overdone in my opinion. Yes, the right culture contributes to productivity - but I think most of us already know that.

Capabilities discusses how software should be developed and deployed, ie what needs to be done or needs to be available in order to be productive.

Metrics are proposed as a way for teams and organisation management to know whether productivity is increasing or decreasing (particularly after introducing process changes), and which topics hold the most potential for improvement (ie should be the primary focus). They also can be used with care to compare an organisation against other companies in general.

This book looks at two distinct phases of software development:

requirements and development (Lean Product Management)
testing and deployment (Software Delivery aka DevOps)

In particular it looks at what environment and processes should be in place, and what culture should be encouraged, in order to create an organisation that is efficient, and which has a drive to improve itself further.

The authors assert that Lean Product Management and good Software Delivery practices work well together. Lean product management with poor software delivery is not effective; lots of small products are requested and customer feedback is organised, but the organisation cannot release the software in a timely manner. Good software delivery with traditional product management is similarly ineffective; the organisation is ready to deliver software in iterations but customer feedback loops are not available to take advantage of that.

What makes this book somewhat different from others is that it is not a new methodology or a personal opinion, but instead describes the results of statistical analysis over a large number of IT companies. It is also interesting in that the authors assert that their evidence shows that software delivery performance is a significant influence on the performance of the organisation overall. Given that IT is now so significant (I recently saw a quote describing a bank as an IT department with a marketing division), this is no big surprise, but it is still nice to have statistics show it.

One argument the authors make is that “cross-functional outcome-oriented teams” are the best way to organise software development. This topic is better covered in the book “Agile IT Organisation Design” by Sriram Narayan.

My personal summary up front: I like the capabilities discussed in the book, generally agree with the lean product management approach (though the discussion is rather vague), find the culture aspects unsurprising, and have significant concerns about the proposed metrics. The book is definitely worth reading - but not necessarily swallowing whole.

And if you don’t have time to read the whole book, a convenient summary is actually hidden in appendix B.

My Perspective

I’m certainly not as well qualified as either the authors of this book, or the people who provided forewords for this book.

I have however been “in the trenches” of software architecture and development for several decades, have experienced many different management styles, and survived many different waves of hype for various methodologies. I therefore feel entitled to an opinion, and this article presents mine. I hope this review gives you at least some inspiration for developing your own opinion.

As someone who works within the structures and rules that an organisation has designed, this topic is relevant. It’s certainly a major thing to look at when looking to change jobs - is the new position embedded in an organisational structure that I agree with? And as someone whose role has become somewhat more “organisational” over the years, it is sometimes now my responsibility to establish (or at least encourage) good software development processes. This book certainly gives food-for-thought for all these perspectives.

One thing that Martin Fowler notes very well in his foreword is that each reader should be aware of “confirmation bias” - the temptation to cherry-pick the results that confirm the reader’s existing opinion while rejecting results that challenge it. The nice thing about this book is that results are backed by statistics; I have therefore tried to take seriously results that don’t initially fit with my existing opinions. However on the other hand, statistics are exceedingly tricky things. A small change in how something is measured, or a poorly worded survey question, can lead to incorrect conclusions; comparing results against “common sense” (another word for our pre-existing opinions) can at least give us a warning when this might be happening. It is important to note that all survey results are self-reported - including the productivity of each participating organisation.

Capabilities

The Proposed Capability List

The book defines 24 capabilities that the authors have identified as being statistically important for effective software delivery. This is an excellent list, and I’ve replicated it below. Most of these terms should be self-explanatory; if not then I recommend either searching for them on the internet or reading Accelerate yourself.

A few of these items are either not self-explanatory in my opinion, or I have some qualifications to make; these are marked in italics with comment “(see later)” and are addressed below. Any item not marked in this way is something that I (and many others) regard as “accepted best practice” for which no further comment is needed.

Continuous Delivery Capabilities
- Use version control for all production artifacts
- Automate your deployment process
- Implement continuous integration
- Use trunk-based development methods (see later)
- Implement test automation
- Support test data management
- Shift left on security (see later)
- Implement continuous delivery (CD)
Architecture Capabilities
- Use a loosely coupled architecture
- Architect for empowered teams (see later)
Product and Process Capabilities
- Gather and implement customer feedback
- Make the flow of work visible through the value stream (see later)
- Work in small batches
- Foster and enable team experimentation
Lean Management and Monitoring Capabilities
- Have a lightweight change approval process (see later)
- Monitor across application and infrastructure to inform business decisions
- Check system health proactively
- Improve processes and manage work with work-in-progress (WIP) limits
- Visualize work to monitor quality and communicate throughout the team
Cultural Capabilities
- Support a generative culture (as outlined by Westrum) (see later)
- Encourage and support learning
- Support and facilitate collaboration among teams
- Provide resources and tools that make work meaningful
- Support or embody transformational leadership

Cultural Capability: Support a Generative Culture

As far as I can tell, this primarily means:

encourage people to report problems (don’t blame the messenger)
encourage people to treat mistakes as a chance to learn
encourage the attitude “we’re in this together” rather than “that’s not my problem”
and generally encourage the free flow of information

That’s all pretty obvious - though not necessarily trivial to achieve.

The authors do provide some effective questions that your organisation can ask its staff to determine how they perceive the organisation. They further claim that their statistics show that implementing Continuous Delivery and Lean Management (effectively the capabilities listed above) are effective in changing organisation culture.

Lean Management and Monitoring Capability: Have a Lightweight Change Approval Process

The authors seem to imply that code-review from a single colleague with whom the primary code author regularly works together is sufficient. That certainly is the fastest way to get code into production, but I’m not sure it is a great idea. Good code reviews:

bring in fresh ideas from outside the code author’s “thought bubble”
share information and ideas from the code author with others in the organisation
ensure documentation is adequate
and encourage consistency across teams

These goals are not well served by having reviews done by “insiders”. A review by someone already familiar with the relevant code-base is more likely to spot errors in the details (eg possible race conditions), and will certainly review faster. However errors in the design concept, possible problems with scalability, better ways of doing things, and insufficient documentation are generally more likely to be spotted by someone with a little “distance” to the code being reviewed.

However code review does need to be timely; having changes wait days (or even hours) for review is definitely a problem. Giving developers a feeling of independence is also important - ie they should not be given the feeling that each change requires “approval from a supervisor”.

It’s tricky to get the balance right. The more chance external experts are given to provide feedback during the feature design phase, the less need there is for such input during the change-approval (code review) step.

This is one of the points, however, where the statistics need to be taken seriously. The authors claim to have proven that having no review process is more effective than having a somewhat bureaucratic one. The thought of no review at all certainly makes me nervous. For my own code, I like having it reviewed. I also like reviewing code from others; it provides me a chance to learn, a chance to share, and a chance to remind people to provide good documentation so if in future I need to work on this code, it is comprehensible. Giving that up and expecting all to be well is a big step - and probably one I would be very cautious about putting my support behind, regardless of the statistics. Reduced shared learning, poorer documentation, etc. will also probably not show up on any of the proposed metrics within a time period that allows tracing back to the root cause of reduced review.

Product and Process Capability: Make the Flow of Work Visible through the Value Stream

Welcome to buzzword city..

As far as I can tell, this means that every team involved in the overall delivery of some feature with business value should be aware of which other teams are working on it, and the relation between the pieces of work. Or in other words, cross-team information-flow is important.

Architecture Capability: Architect for Empowered Teams

This simply means that those responsible for implementing a feature should be able to make most of the necessary decisions themselves. While this is again a balancing act to correctly define “most”, the general concept seems pretty well accepted by now: teams with the right to make decisions are more motivated to produce quality work, and have specialised knowledge that a central decision-making team does not. I certainly agree.

One significant aspect of empowerment is the ability to choose the tools with which the work is done. There are advantages to centralized/standardized tooling, but the authors argue that autonomy is more important in most cases.

Technical aspects of the software design can make it possible to grant more rights to teams; a “loosely coupled architecture” (for example a “microservice architecture”) provides more freedom to teams than a monolithic system.

Further aspects that the authors identified as being important for productivity (which all seem right to me) are:

ability to change code without permission from outside the team (autonomy)
ability to change code without needing to coordinate with other teams (API stability)
ability to deploy to production on demand
ability to deploy during business hours
ability to test without requiring a complex integrated test environment

The ability to change code is not necessarily unlimited; there is benefit to organisation-wide standards. However any such standards should be kept to a reasonable minimum.

Continuous Delivery Capability: Shift Left on Security

This just means encouraging thinking about security of software early in the development lifecycle. Building something, then thinking about security as an “add-on” is not the most efficient or safe way to implement software. Building software then handing it off to a different group who are responsible for security is far worse.

However, in my experience, doing security “early” isn’t trivial. Building secure software takes a lot of knowledge that not all teams will have, but holding up development until experts can be involved in early design decisions is also not optimal. Good training and communications seem to be the most effective measures but it’s still tricky.

Continuous Delivery Capability: Use Trunk Based Development Methods

The behaviour being encouraged here is to develop in short cycles, eg 1 day. Changes should be merged into a “releasable branch” after each cycle, and ideally released into production.

This contrasts with the approach of creating a “feature branch” on which work is implemented for weeks, possibly by a team of developers, isolated from other changes happening to the system.

The authors use wording in their presentation of this concept with which I am not 100% in agreement; I suspect none of them are people working on code on a regular basis and that their grasp of modern development and version control is not quite as good as it could be. That’s no surprise considering the authors are primarily management experts, not developers. Unfortunately it means that in this case some interpretation of their wording is needed.

This unclear terminology with respect to version control also (IMO) affects the definition of “Lead Time for Change” and unfortunately leads to some confusion there.

The authors use the specific terms “trunk/master” when as far as I can tell, they mean “the branch into which features are merged and on which releases are based” ie the place where integration of features occur; the specific name is not relevant.

Supposedly, well-performing teams had “fewer than three active branches”. No further information is provided, but that seems very dependent on team size. Assuming that in general survey respondents are following the “no team too large to do a standup” rule then teams are likely to be in range 3-10 people. Is a team of 10 really going to just have 3 active branches? It might be possible for teams using Pair Programming (reduces the number of work-in-progress branches by 2) and maybe cross-functional teams would be even lower (testers and SREs might often not have any “active branches”). However this statistic still seems a little dubious.

They further claim that “teams that did well” typically had branches which lasted less than 1 day (how they measured “did well” is not described inline). As an experienced developer myself, I find the average branch duration of 1 day somewhat hard to believe. Some developers writing or modifying simple UI screen layouts as their primary work might have branch lifetimes that are relatively short, but that is not typical. Among other things, where is the time for code review and testing? And what about work done on friday which is merged on monday? Fridays are 20% of the work week after all, so that is not an uncommon circumstance. I would agree that work should be done in the smallest increments that provide customer value - definitely. And sometimes even intermediate steps that don’t provide customer value should still be merged (disabled) and deployed to production in order to minimise integration conflicts due to long-running work. An average branch time of 3 days seems a not unreasonable goal, but I find it hard to believe that the kind of work I regularly do can be merged into production every day.

The authors also spend a few words on the “GitHub Flow” and appear to consider it somewhat different than “trunk based development”. I see no such conflict; the branch names are different but there is no reason why a “GitHub Flow” is incompatible with short-lived feature branches. Other flows such as the similarly-named “Git Flow” in which features are developed on a branch while ignoring bugfixes (and other features) which are merged into the integration branch can cause problems. The trunk based development site has some useful information on this.

There is also the possibility to resolve integration conflicts within the feature branch, rather than deal with integration conflicts after change approval. Git provides two ways to resolve conflicts between a feature branch and the “integration branch”: merge the integration branch into the feature branch, or rebase the feature branch onto the integration branch. In either case, any integration conflicts become the issue of the feature developer, and not a concern of the software delivery pipeline. This still doesn’t mean long-lived feature branches are a good idea, but does appear to be something that the authors have not considered.

The old practices of “code freezes” and “stabilisation periods” are of course a bad idea, but these are pretty rarely encountered these days. At least I hope so!

Metrics

This book is probably most famous for its definition of four metrics that are supposedly very strong indicators of the quality of the software development process:

Change Fail Rate (CFR)
Deployment Frequency (DF)
Mean Time To Recovery (MTTR)
Lead Time for Change (LTC)

These are known as the DORA (DevOps Research and Assessment) metrics.

These are actually externally-visible indicators of internal processes - meaning that good practices result in “good” values for these metrics and vice versa. Unfortunately, the book does not really look into the underlying processes - although these processes are linked to the “capabilities” discussed elsewhere. Below is my personal analysis of the worth of these metrics.

The authors don’t simply present logical arguments for these metrics; they sent questionaires to many large companies regarding their development practices, and how they rate themselves in terms of productivity. The answers were then analysed to determine which (self-reported) practices correlate with which (self-reported) productivity ratings. Unfortunately, at least in this book, the exact steps used in the data analysis are not well described.

Interestingly, one of the things the book does is criticise previous efforts at metrics regarding software development, and in particular:

counting lines of code
measuring team utilization (ratio of work-time to idle-time)
velocity as defined in the Scrum process

The reasons for disregarding lines-of-code and utilization as metrics are fair but not new. The criticism of Scrum’s velocity measurement is one I agree with; I’ve never been a fan of velocity as a useful measure. Interesting however that each new methodology criticises the previous one; it does leave the thought: in 5 years will we be reading a book that describes why the metrics in this book are not effective? That’s not meant as a defeatist “don’t bother” comment, just an encouragement to remain cautious and skeptical (see Kool Aid).

These metrics do have interesting (and deliberate) interactions - eg improving deployment frequency without adequate automated testing will lead to a poorer change fail rate. However I would be cautious in assuming that this counter-balance is really effective in driving software delivery effectiveness directly; that requires the software delivery team to recognise the link between changes made in one aspect of their development process, and consequences that become visible in a different metric possibly months later. In particular, poor architectural decisions, poor documentation, and poor automated tests can have their full impact long after the relevant event occurred. Common sense and experience can often see potential problems long before they are revealed by metrics.

As noted in the introduction, the book addresses two topics:

requirements and development (Lean Product Management)
testing and deployment (Software Delivery aka DevOps)

The Change Fail Rate (CFR) and Deployment Frequency (DF) metrics are primarily development-centric metrics, ie are affected by processes related to writing code. As discussed below, metric Mean Time To Recover (MTTR) can be interpreted in multiple ways; in its most obvious definition it relates primarily to deployment processes and not to development. Lead Time for Change (LTC) is also sadly rather ambiguous, but its most obvious definition also reflects only deployment-time processes and therefore is also not related to development processes. CFR/DF and MTTR/LTC therefore appear to be quite independent, ie the “counter-balance effect” from the previous paragraph does not apply between these groups. It seems possible to have very poor development processes (eg waterfall and poor testing) while still having very good automated deployment processes - and therefore poor CFR/DF but excellent MTTR/LTC.

Change Fail Rate (CFR)

The Change Fail Rate is defined as the number of releases which contained a critical issue divided by the total number of releases over some time period. It is an indicator of:

the competence of the software team (in design, implementation, and testing)
the quality of the automated tests
the size of the changes being merged

I generally agree with this metric: it is clearly defined and easy to interpret. The actions needed to reduce the Change Fail Rate are also obvious, as are the costs.

An interesting online comment stated that a CFR of zero might not actually be optimal; it can be an indication of too much caution and that more rapid development could be possible. Obviously for safety-critical systems, CFR of zero is indeed the goal!

Deployment Frequency (DF)

Deployment frequency is defined as

how often their organisation deploys code for the primary service or application they work on

The book states that the underlying process they are trying to measure is “batch size”, ie how complex is each “feature” that gets released into production?

My experience and instincts agree with their understanding of “batch size” as important: frequently releasing simple features results in more stable software and faster progress than infrequently releasing complex features. And I would agree that measuring this directly is difficult.

However I do not agree that “deployment frequency” is a “good proxy” for the underlying process - at least not for everyone. For an organisation that produces only one monolithic service or application, that is clear. However in many cases there are significant complications - in particular for organisations that have no “primary service or application”.

One danger to be aware of is the “average of averages” problem. Assume that an organisation has a single product that they release 10 times a day. This product is then split into 10 individually deployable components. Given that the rate of work is the same, each component has a deployment frequency of 1 time per day. Averaging the deployment frequencies would then give a release rate of 1 time per day (new architecture) instead of 10 times per day (old architecture) even though the amount of work being done and the “batch size” is identical.

Another concern that does not appear to be addressed in the book is the scale of the company. Given an organisation with 10 developers which is releasing 2 times per day, what happens when the organisation doubles its staff? It is now releasing 4 times a day - doubling its metric although the process has not become any better. Clearly, such raw values cannot be compared over time, or between organisations - the value must somehow be normalised. However I haven’t found anything in the book that describes how such normalisation occurs.

A further point: when deploying to the Apple AppStore, deployment frequency is not under the control of the organisation. And even for other app-stores, multiple versions per day of a native app is probably not advisable. It also is not really an appropriate measure for a shrink-wrap product that customers purchase online or in-store - particularly for commercial customers; there a few feature-releases per year is probably the maximum that is reasonable. I have no idea how the recipients of their survey dealt with this, or how “statistical clustering” works when the dataset includes responses for such companies which otherwise have very “high performance” software development processes. As I mentioned earlier, statistics can be tricky..

And yet another point: configurable software may gain “features” without a deployment - something that may be superior to hard-coded behaviour, despite having a lower deployment frequency.

Given all this uncertainty, I’m not quite sure how the authors got consistent and reliable answers to their survey.

All this uncertainty doesn’t mean deployment frequency is useless. It certainly works as long as:

you are measuring only a specific component over time, not comparing components
you are not averaging over multiple components
you are not comparing between companies
you discard your historical data if the component is split or the development team size changes

Mean Time to Recovery (MTTR)

The authors make a good argument that the traditional “mean time between failures” metric is not really a fair measure of modern systems. They propose “mean time to recovery” as a better measure.

Mean Time to Recovery can be split into multiple parts (my interpretation):

operations: how long does it take to get a broken infrastructure running?
deployment: how long does it take to get a broken feature reverted?
development: how long does it take to get a broken feature correctly implemented?
design: how long does it take to get an incorrectly designed feature correctly implemented?

Measuring, monitoring, and striving to reduce these things seems like a good idea.

Unfortunately, the book does not clearly distinguish these quite different cases.

Given the potential for multiple interpretations of this metric, it is clearly not possible to compare companies. If you choose to define the metric as including more of the above steps, then compare against another organisation that defines the metric as including fewer steps, you’ll just look bad.

Within a single company, combining the first two seems reasonable. It is a direct measure of the efficiency of the deployment chain.

Including the third point, together with putting pressure on teams who have an “over-average MTTR” is dangerous; it is a direct incentive for teams to take shortcuts in fixing incorrectly implemented features in order to satisfy management. The argument can be made that a team which does take such shortcuts will see a rise in the CFR. However it is not clear that such a rise would occur immediately, and that anyone would be able to link the rise in CFR to pressure to “quick fix” code due to the way MTTR is defined.

Including the fourth point simply strengthens the problems listed in the previous paragraph. It would take a very perceptive team to notice that their MTTR values can be improved by doing more design.

I would therefore recommend including only the first two points in MTTR measurements, ie reduce it to a pure ops-centric metric that measures how long it takes to fix broken infrastructure and to roll-back broken releases.

Lead Time for Change (LTC)

The book defines LTC as “time it takes to go from code committed to code successfully running in production” - unfortunately a rather ambiguous description.

Obviously, responding quickly to requests for new features is desirable. The book states that it takes the LTC concept from the manufacturing industry and adapts it for IT - quite radically. In manufacturing terms, LTC includes the whole request lifecycle from customer request to product delivery (though whether requirement analysis is included is not clear). When applying to IT, the authors recommend dividing it into two parts:

analysing the requirement and designing the corresponding feature (hard to measure and very variable)
rolling that feature out to users

However the book is very vague with respect to exactly which tasks belong in which of these categories. It isn’t even clear whether LTC is to be applied at the level of an Agile user story or a subtask (though it certainly is not talking about an entire product).

The authors do state that LTC is difficult to measure due to its “fuzzy front end” and that only the “delivery” part of the lead time should be measured. They then describe “delivery” as being “implemented, tested, delivered”. But what exactly is “implemented”? This book has a focus on DevOps, ie the border of development and operations. For a developer, “implementation” has a specific meaning. For a tester, the term is undefined, and for an operator it is meaningless as far as I know.

The authors then provide a table indicating that “product delivery” includes build/test/deployment - but not coding.

And finally, their survey on this topic provides the option of “less than one hour” as the average value for LTC. Apparently this is the answer reported most commonly in the survey response cluster they label “high performance” (though at a later time they changed “high performance” to be one day or less).

This is all rather confusing and contradictory. If their definition of LTC is interpreted literally as “code committed” meaning the timestamp associated with a commit to a version control repository, then the question is: which commit? In a feature branch consisting of multiple commits to implement a single change (including fixes applied as a result of code review feedback), does LTC start:

at the first commit in the series?
at the last commit in the series?
at the first commit pushed to a company repo (in the case of a decentralised system such as Git)? (note that Git doesn’t record this anywhere)
the “merge commit” that results from merging a feature branch into an integration branch?

If a change consists of several commits made over a period of a few days, it clearly would make no sense to use the first commit-timestamp, ie ignore the “thinking time” involved before the first part of the change was committed to version-control, but including the “thinking time” for later parts.

The authors mention that “code review” can be done before or after the actual commit. Starting LTC at any of the commit timestamps might therefore include code-review time, or not include code-review time, depending on how the review was done.

The only consistent interpretation I can find is that:

LTC should be measured from the time at which code is “approved for release” - ie code implementation is complete, code review is complete, and initial testing (both automated and manual) is complete against the “feature branch” in which this change has been stored.
LTC then includes the following and only the following:
- merging of the feature branch into the integration branch (eg ‘git merge …’)
- compilation of the integration branch to produce release artifacts
- execution of tests against the release artifacts (to ensure the merge of the feature into the integration branch was successful)
- deployment of the release artifacts to production

Or in other words, LTC should start at the “merge commit” timestamp, or the change-request-approval timestamp.

This would be consistent with the term “code committed” having a business meaning rather than a technical meaning - the point in time where the decision has been made to move the change to production.

In Martin Fowler’s foreword to this book, he clearly states that LTC is about getting code from “committed to mainline” to “running in production”, and that an effective organisation can achieve this within 1 hour. He also reminds the reader that “their book focuses on IT delivery, that is, the journey from commit to production, not the entire software development process”. This seems consistent with the above interpretation.

Note also that any other interpretation of LTC (ie choosing an earlier point as the start of the LTC measurement) will make your organisation look worse in this statistic; unless you are a masochist it seems only sensible to choose the most positive interpretation - the “integration merge” point. Alternatively, if using Git then do either of the following before submitting code for review, in order to improve your LTC:

git rebase --onto master ...
git rebase -i HEAD~N (where N is the number of commits in the change) then make a trivial change to the comment associated with the oldest commit

Given that “gaming the system” can be done in this way (and that similar rebasing is actually valid to provide a “clean history”), measuring LTC from the merge-commit seems even more sensible.

A shame that such a critical point needs such delicate interpretation..

Outcome-Oriented Organisation

One topic that the authors repeatedly mention, but which is not a primary focus of the book, is the concept of “outcomes”.

As an IT organisation, an “outcome” is something that is visible to people outside of IT and valued by them. Ideally it is something that increases income, decreases cost, increases market share, improves brand value, improves customer satisfaction or similar.

An “initiative” is something that is necessary to achieve an outcome, but which does not interest anyone outside of IT. Examples include:

automating the deployment process (increases the rate at which features are deployed, thus improving income/cost/satisfaction and many other outcomes)
improving the security infrastructure (can avoid a major decrease in brand value and customer satisfaction)

This is a mildly controversial topic; many IT departments are still organised along skill lines (sysadmins together, DBAs together, back-end devs together, front-end devs together, testers, etc). There are two plausible reasons for restructuring along “outcome” lines (aka multi-functional teams):

motivational
- staff are more motivated when they can see a business outcome from their work
- staff are inhibited from implementing projects that do not produce a business outcome
efficiency
- development crosses fewer team boundaries during journey from design to deployment (ie less bureaucracy)

In the end, a business exists to make a profit, and an organisation exists to achieve a goal. We all know that IT staff love to play with new toys, and many are happy polishing code to perfection; a focus on outcomes relevant to the organisation reduces these tendencies.

Whether IT staff are really motivated by business outcomes is less clear.

See the book “Agile IT Organisation Design” for more details.

Beware of the Kool-Aid

Management trends come and go. Things on the cover of Best Management Weekly get promoted as the next silver bullet then fade again.

The current attention to Accelerate and in particular its four metrics has something of that feel for me. I don’t mean that it is useless - but rather it is no silver bullet. And more specifically, both correctly gathering and correctly interpreting statistics is non-trivial. I am almost certain that different organisations are measuring these values in different ways, and therefore direct comparisons are often misleading. There may also be categories of organisation for which the metrics are misleading, and where too much focus on them will damage the organisation rather than help it.

My advice is to check that the measurements match your common sense feeling regarding the state of your organisation before taking action based on them - and before passing the values on up to a management layer that might not use the necessary caution when interpreting them.

Or in other words, think before drinking the Kool Aid.

Accelerate’s Methodology

The authors talk about using a survey with “Likert scale” questions (“strongly agree” to “strongly disagree”). But then in some places show graphs which clearly are using other statistics (eg raw numbers of deployments-per-day). The method through which they obtained this additional data is not documented anywhere in the book; the section on methodology discusses Likert surveys only.

This book is based upon statistical analysis of 4 years of surveys - though the survey was not identical each year. That’s a substantial base of data, but at the same time just a moment-in-time snapshot of the state of software development. One thing that gives me some concern is that while their statistics seem to be stable for their self-named “high performer” cluster, the results vary wildly for answers in other groupings. Something appears to be going on that cannot be explained; see for example pages 21 and 22 in which the statistics for the cluster that they label as “low performers” show no logical trend or consistency at all.

The authors claim (from page 24) that effective software delivery is strongly linked to good market performance of the organisation overall. It is worth noting that this link is made through the survey results: when respondents who rated their organisation as good in software delivery also rated their organisation as successful in market-share and profitability. This presumably supports their labeling of specific clusters as high/medium/low performance (ie presumably the clusters of respondents whose answers show high deployment frequence, low change fail rate, etc. also typically report good market performance) - although the book does not specifically state that.

The authors often claim “predictive power” for statistical results, ie state not only that two measurable results are correlated, but that one is a function of the other. There are three possible relations between values:

correlated - when one changes, so does the other
caused - a change in one causes a change in the other

And in-between there is this concept that a change in one “predicts” a change in the other. Unfortunately, the term “predicts” is not well explained in the book. In fact, I have been unable to find any good explanation anywhere on the internet. However one thing is clear: X predicts Y does not mean that X causes Y.

Cluster analysis is described as grouping the survey results into “high/medium/low” performer categories. In fact, cluster analysis does not do that; it just results in N groups of entities where the group members give similar answers. It is the authors who chose to apply labels “high/medium/low” to these groupings; they could just as well have been called group A, B and C. For one year (2014) the authors did an analysis of stock market prices for 355 companies whose employees were included in the survey, and concluded that there was a correlation between the clusters and market capitalization growth. However (if my rusty statistics hasn’t failed me), this correlation is not necessarily causation. One possible explanation is having the metrics in the cluster labeled “high performer” leads to stock market success. Another would be that having stock market success leads to having the metrics labeled “high performer” (possibly because the extra cash allows companies to spend on IT consultants). Yet another explanation would be that a common factor lies behind both effects (eg management keen on implementing the latest trends). Nevertheless, the most likely explanation is that these metrics do indeed indicate underlying IT processes which (on average) lead to profitability in the markets and year for which this analysis was done - and quite likely in general.

Use Metrics With Caution

The science fiction writer Isaac Asimov imagined in his Foundation series the concept of a “social drive” which causes a society to develop iteratively in a specific direction. This book in some ways addresses how to implant a “social drive” in a software development organisation that leads it to iteratively improve its processes over time. This isn’t new: the Agile sprint review process does this too. Encouraging an organisation (or team within an organisation) to monitor the “accelerate metrics” (CFR/DF/MTTR/LTC) and attempt to improve them is a specifically guided form of this. If you do this, however, make sure the way you measure the metrics really does encourage the behaviour you want; the definitions of the metrics in the Accelerate book are vague and can be interpreted in multiple ways. When automating the measurement, there is also potential for simple “implementation bugs” to produce statistics that aren’t quite what you expect - and therefore will encourage/reward behaviours that might not be desired. See also the comment below on Goodhart’s law.

The authors make a very good point regarding metrics on page 27: “In pathological and bureaucratic organisational cultures .. people hide information that challenges existing rules, strategies, and power structures. .. whenever there is fear, you get the wrong numbers”.

There is another issue with metrics which is usually called Goodhart’s Law although the most common rephrasing is by Marilyn Strathern:

When a measure becomes a target, it ceases to be a good measure.

This means that rewarding (or penalising) people based upon a metric can be dangerous and counter-productive. It is safer to use metrics as warning signs that certain projects need further attention, or measure them before and after an experiment to see if the experiment had a useful effect.

Summary

I believe the authors have convincingly argued that there is no tradeoff beween code quality and development speed. There is no tradeoff between “change fail rate” and “deployment frequency”; instead the two are strongly correlated. The faster you deploy features (meaning the features are small) the less likely there is to be a problem.

I also believe that the core capabilities that the authors list as significant are indeed good goals for any software organisation. Most of these points are not new, but the authors provide statistical evidence that exactly these capabilities are strongly correlated with other desirable high-level organisational outcomes.

I’m not entirely convinced that the metrics that they define (CFR, DF, MTTR, LTC) are really good proxies for the underlying processes they want to measure - and in particular am not convinced they can be used for comparisons between organisations. There are just too many ways to interpret their definitions (fault of the authors), and too many ways to interpret the statistics (for mathematically unsophisticated readers). Unfortunately in many companies reporting these statistics is now mandatory; as the way these statistics are defined is ambiguous, I can only recommend choosing the interpretation that is most favourable for you. Then get on with implementing the described capabilities (sensible) and improving the underlying processes (well described) regardless of the metric values. In particular, when measuring LTC, I recommend interpreting the vague words “commit time” as being the timestamp associated with the merge of a feature branch into the “integration branch” from which releases will then be made; this reduces LTC to being a measure of deployment-pipeline efficiency and excludes all development and code review steps. That’s almost certainly what all the “high performers” are doing.

About

Recent Posts

Categories

Book Review: Accelerate (capabilities, culture and metrics)

Table of Contents