Last week I finished reading the book The Phoenix Project by Gene Kim and I really enjoyed it. It describes the adventures of a guy named Bill who almost single-handedly (although he had the help of demigod Erik) saves the company from bankruptcy by “industrializing” the IT department and in particular the delivery of new software from development to operations (in short DevOps).
Having worked in a similar context for the past years I could easily identify with Bill and therefore the book has given me a great insight in the various solution patterns that exist around this topic.
I have to agree however with the IT Skeptic (ed add: twitter: @ITSkeptic) on some of the critics he wrote in his review of the book:
Fictionalising allows you to paint an idealised picture, and yet make it seem real, plausible. Skeptics know that humans give disproportionate weight to anecdote as evidence.
It is pure Hollywood, all American feel-good happy endings.
The Phoenix Project wants you to believe that within weeks the staff are complying with the new change process, that business managers are taking no for an answer, that there are simple fixes to complex problems.
So basically he doesn’t believe that the book does a good job in representing the reality.
That’s when I reasoned it may be interesting for others if I published my own experiences in this area in an attempt to fill the hole between Gene’s fairy tale and the IT Skeptic’s reality.
Let me start by giving you my view on this whole subject of DevOps (highly influenced by the excellent presentation of John Allspaw and Paul Hammond of Flickr):
- The business – whether that be the sponsor or the end-user – wants the software to change, to adapt it to the changing world it represents, and it want this to happen fast (most of the time as fast as possible)
- At the same time, the business wants the existing IT services to remain stable or at least not disrupted from the introduction of changes
The problem with the traditional software delivery process (or the lack thereof) is that it is not well adapted to support these two requirements simultaneously. So companies have to choose between either delivering changes fast and ending up with a messy production environment or keeping a stable but outdated environment.
This doesn’t work very well. Most of the time they will still want both and therefore put pressure on the developers to deliver fast and on the ops guys to keep their infrastructure stable. It is no wonder that dev and ops will start fighting with each other to protect their objectives and as a result they will gradually start drifting away from each other, leaving the Head of IT somewhere stuck inside the gap that arises between the two departments.
This picture pretty much summarizes the position of the Head of IT:
You can already guess what will happen with the poor man when the business sends both horses into different directions.
The solution is to somehow redefine the software delivery process in order to enable it to support these two requirements – fast and stable – simultaneously.
But how exactly should we redefine it? Let us have a look at this question from the point of view of the three layers that make up the software delivery process: the process itself, the tooling to support it and the culture (i.e. the people who use it):
First of all the process should be logic and efficient. It’s funny how sometimes big improvements can be made by just drawing the high-level process on a piece of paper and removing the obvious inconsistencies.
The process should initially take into account the complexities of the particular context (like the company’s application landscape, its technologies, team structure, etc) but in a later stage it should be possible to adapt the context in order to improve the process wherever the effort is worth it (e.g. switch from a difficult to automate development technology to an easy to automate one).
Secondly, especially if the emphasis is on delivering the changes fast, the process should be automated where possible. This has the added benefit that the produced data has a higher level of confidence (and therefore will more easily be used by whoever has an interest) and that the executed workflows are consistent with one another and not dependent on the mood or fallibility of a human being.
Automation should also be non-intrusive with regards to human intervention. With this I mean that whenever a situation occurs that is not supported by the automation (e.g. because it is too new, too complex to automate and/or happens only rarely) it should be possible for humans to take over from there and when the job is done give control back to the automation. This ability doesn’t come for free but must explicitly be designed for.
And then there is the cultural part: everyone involved in the process should get at least a high-level understanding of the end-to-end flow and in particular a sufficient knowledge of their own box and all interfacing boxes.
It is also well-known that people resist change. It should therefore not come as a surprise that they will resist changes to the software delivery process, a process that has a huge impact on they way they work day-to-day. There can also be a trade-off: some changes are good for the whole (the IT department or the company) but bad for a unit (a developer, a team, …). An example would be a developer who has to switch to Oracle because that’s what the other teams have experience with, while the developer wants to stick with MySQL because that’s what he knows best. Such changes are particularly difficult to introduce so they should be treated with care. But most of the people are reasonable by nature and will adapt as soon as they understand the bigger picture.
OK so now we know how we should upgrade the process.
But are we sure that it will bring us the expected results? Is there any evidence? At least the major tech companies like Flickr (see the link above), Facebook (here), Etsy (here), Amazon (here) etc show us that it works in their particular case. But what about the traditional enterprises with their legacy technologies that lack support for automation and testability, their human workforce doesn’t necessarily know or embrace these modern concepts, their heavy usage of frameworks that are known to have bureaucratic top-down implementations (CMMi,ITIL, TOGAF, …)? Can we apply the same patterns in such an extremely different context and hope for the same result? Or are these traditional companies doomed and awaiting to be replaced by modern companies, those that were built with agility and automation in mind? Or in other words is there a trail from Mount Traditional to Mount Modern?
I don’t know the answer but I don’t see why it would not work and there is not really the option of not at least trying it, is there? See also Jez Humble’s talk and Charles Betz’s post on this subject.
OK let us leave the theory behind us and switch our focus to my own experience in introducing DevOps in a traditional enterprise in the financial sector.
First a bit about myself: in this traditional enterprise I was responsible for the development community. My role was to take care of all cross-functional stuff like the administration of the version control tools, build and deployment automation, defining the configuration management process, etc. I also was the representative of the developers towards the infrastructure, operations and release management teams.
Configuration and release management had always remained mostly a manual and labor-intensive activity that was known to be one of the most challenging steps in the whole process of building and delivering their software. Due to the steady growth of the development community and the increasing complexity of technologies the problems surrounding these two activities grew exponentially over time causing a major bottleneck in the delivery of software. At one point it became clear to me that we had to solve this problem by doing what we are good at: automation. Not for the business this time but for ourselves: the IT department. So I set out on a great adventure to get this issue fixed …
It was clear very soon that without bringing configuration management under control, it would not be possible to bring control in release management because it relies so heavily on it. And thanks to this we would also be able to finally solve a number of other long-standing issues that all required proper configuration management but didn’t seem important enough on their own to initiate it.
So finally the project consisted of the following actions:
- getting a mutual agreement between all teams on the global software delivery process to be used within the company (including the definition of the configuration management structure), using ITIL as a guide
- the implementation of a release management tool
- the implementation of a configuration management system (CMS) and a software repository (which is called a definitive media library or DML in ITIL terms)
- the integration with the existing build and deployment automations and the change management tool
Note that I use the term configuration management in the broader sense as used by Wikipedia (here) and ITIL and not in the sense of server provisioning or version control of source code: the definition of the configuration items, their versions and their dependencies. This information is typically contained in a CMS.
With release management I mean the review, planning and coordination of the deployment requests as they are delivered from the developers to the operations teams.
This is my mental model of these concepts:
Fiddling with the software delivery process in big and heterogeneous environments as can be found in the traditional enterprises is a huge undertaking that has impacts in many areas within many departments and on many levels. It was therefore out of the question to transition from a mostly manual process to a fully automated one on one step. Instead we needed a tool that could be used to support the release management team with their work but that would also allow to gradually take more responsibilities as time goes by and help the transition to a more end-to-end automated approach.
During my research I was quite surprised that so few tools were available that aim to support release management teams and I believe now that the market was just not yet ready for it. Also in the traditional enterprises development has generally become more complex, more diverse, more service oriented etc in the last years and the supporting tools now need some time to adapt to this new situation.
(For those interested: after a proof of concept with the few vendors that claimed to have a solution we chose Streamstep’s SmartRelease, which was acquired by BMC and renamed to RPM during the project.)
Also the culture had to be adapted to the new process and tooling. This part involves changing the people’s habits and has proven to be the most challenging and time-consuming part of the job. But these temporary frictions were dwarfed by the tremendous gains we received by the automation: less manual work, simplified processes, better visibility, trustworthy metrics, etc.
By implementing a clear process that was unavoidably more restrictive – and above all enforceable by the tooling – I noticed that in general developers resisted it (because they lose their freedom to do things at their own will and believe me there were some exotic habits in the wild) and ops guys embraced it (because their job becomes easier).
As I already mentioned I don’t know if it is possible to DevOps-ify traditional enterprises the way web 2.0 companies do it (namely by almost completely automating the software delivery process and installing a culture of continuous delivery from the early beginnings). But it may make sense to not look too far ahead in the future but instead start by tackling the biggest and simplest problems first, optimizing and automating where appropriate, and as such iterating towards a better situation. And hope that in the meantime a web 2.0 company has not run away with their business
Can traditional enterprises adapt to the modern world or will they become the equivalent of the dinosaurs and get extinct?
In my next post I will zoom one level deeper into the reasoning behind these decisions. I will first give an overview of the situation that existed before the project was initiated, looking at the specifics of each team. Then I will focus on the problems that derived from it and the solutions that were followed.