image_pdfimage_print

Google was among the pioneering adopters of the DevOps SDLC methodology. The company scaled its business rapidly, setting an example for organizations pursuing similar growth trajectory by leveraging the DevOps framework.

In recent years, however, Google has expanded its approach to service management, calling it site reliability engineering (SRE). Let’s look at the differences between DevOps and SRE.

BMC Helix - The Future of Service and Operations Management

BMC Helix is the first and only end-to-end service and operations platform that’s integrated with 360-degree intelligence. Built for the cloud, this reimagined service and operations experience is unrivaled, giving you:

  • BMC Helix ITSM optimized for ITIL® 4
  • Enterprise-wide service including IT, HR, Facilities, and Procurement
  • An omni-channel experience across Slack, Chatbot, Skype, and more
  • Automation with conversational bots and RPA bots
  • Learn more about BMC Helix

Understanding DevOps

DevOps is an ITSM framework that defines the mindset, culture, and philosophy of working on IT projects as a collaboration among developers, operations, and QA teams. The collaboration spans the SDLC pipeline with no proverbial walls or silos that produce hard separation between the three roles. Instead of using tools or processes as descriptors, DevOps is best described in terms of its characteristics, both interpersonal and cultural, necessary to achieve:

  • Continuous application delivery
  • Shorter release cycles
  • Reduced waste processes
  • Lower expenses
  • Reduced friction among the workforce

However, DevOps doesn’t outline specific practical guidelines to apply those concepts in real-world environments, partly because DevOps principles are designed to be followed with organizational context and use cases. Organizations have the freedom to optimize tradeoffs in following the DevOps principles. For example, DevOps requires IT teams to accept failure as a normal. According to DevOps, there’s no specific definition of ‘failure’ and ’normal’. This leaves potential gaps in translating DevOps principles into concrete and practical steps that an IT organization can adopt and replicate Google’s DevOps strategies.

What is SRE?

Site reliability engineering (SRE) fuses the software engineering and operations disciplines. SRE professionals spend about half their time in development tasks and half on operations tasks. The SRE role enables collaboration and information sharing between Dev and Ops departments, similar to the DevOps principles but for additional specific objectives.

SRE satisfies DevOps principles

If you’re asking which is better, you’re asking the wrong question. Instead, consider Google’s analogy of DevOps as a programming language interface and SRE as a programming class used to implement DevOps. The philosophy of DevOps may define the overall behavior of a service management framework, with the specific implementation strategy left up to the author. SRE, then, is a prescriptive approach to implement, measure, and achieve DevOps objectives.

There are five key ways SRE complies with DevOps:

1. Reducing organizational silos

SRE treats Ops as a software engineering problem. Engineers are required to spend their efforts in solving issues that were previously thrown across the proverbial wall between Dev and Ops. The shared sense of responsibility leads to practical initiatives such as reducing bugs early during the development sprints and using similar tools between Dev and Ops. These issues may be more focused on the engineering design and applications, instead of business logic problems. For example, the SRE may have competencies to improve application performance and latency instead of managing infrastructure resources. Like DevOps, there is less focus on specific tools used in the Dev or Ops environment, but adequate expertise are available to use similar technologies in either department.

2. Accepting failure as normal

Similar to DevOps, SREs don’t pass the blame for failures and production incidents between the IT teams. SRE guidelines encourage radical changes (within limits) that potentially lead to failure. A measured risk budget is allowed for SREs to test these limits and potentially innovate faster. SREs also assume that 100% availability and performance targets are not viable to facilitate growth. Strong collaboration between business and IT is required from an SRE perspective to evaluate optimal targets for service level objectives (SLOs) and service level indicators (SLIs). Any violation funnels a feedback look to the IT teams, targets are re-evaluated and optimized for the changing business and IT circumstances. Blameless postmortem of incidents and failures is mandated as part of the SRE framework.

3. Implementing gradual change

Like DevOps, SRE also encourages continuous improvement through change. SRE requires the changes to be small and frequent. As a result, any negative repercussions are less impactful and low-risk improvements can be readily tested and implemented. Automated testing of such changes is readily used in the change management strategies, including CI/CD as adopted in the DevOps framework. An objective measurement of change needs to be defined such that the cost of failure reduces at the same time. For example, reducing the mean time to repair (MTTR) for production incidents allows more time for devs to invest in feature improvements. As a result, the product quality improves on a gradual but continuous basis.

4. Leveraging tooling and automation

While DevOps encourages automation and technology adoption, SRE is focused on embracing consistent technologies and information access across the IT teams. Google pioneered this concept by unifying its codebase, albeit not specifically for its SRE framework. SRE requires all teams working on the same service to adopt the same technology solutions. Incompatibility and integration issues between technologies from different vendors, era and use cases can create unnecessary silos even in the DevOps environment. In essence, the adopted tooling also determines the skillset necessary for each particular team and service area. However, SRE is not entirely about using a specific and well-defined set of technologies to fulfil certain IT tasks, but focused on the API orientation of the tooling and how the associated ITSM activities are served.

5. Measuring everything

Both DevOps and SRE encourage measurement. DevOps is primarily focused on the process performance and results achieved with the feedback loop to realize continuous improvement. SRE requires measurement of SLOs as the dominant metrics, since the framework observes Ops problems as software engineering problems. SRE also requires budgeting for toil activities and risk. It is assumed that the definition of ‘normal’ will evolve on a continuous basis. In the spirit of gradual change and improvements, IT teams must observe some slack that allows them to perform repeatable and automatable tasks manually. Once the ITSM model of the organization matures, these tasks can be automated, redefining the new ‘normal’ including a lower budget for risk through the improved ITSM capability. Both DevOps and SREs follow a data-driven approach. Evaluating appropriate targets however, remains to be a contextual challenge is less prescriptive in nature for either ITSM practices.

Additional resources

Free Trial: BMC Helix ITSM

Take a free personalized demo of BMC Helix, the future of IT Service Management software
Free Trial › Learn More ›
Last updated: 10/29/2019

These postings are my own and do not necessarily represent BMC's position, strategies, or opinion.

See an error or have a suggestion? Please let us know by emailing blogs@bmc.com.

About the author

Muhammad Raza

Muhammad Raza

Muhammad Raza is a Stockholm-based technology consultant working with leading startups and Fortune 500 firms on thought leadership branding projects across DevOps, Cloud, Security and IoT.