When it emerged within Google, the original idea of Site Reliability Engineering (SRE) was that about 50% of the work of site reliability engineers (SREs) would focus on operations and the other 50% on development. However, a recent report by Catchpoint looking at data from early 2020 shows that this expectation is no longer the reality.
It is now apparent that SREs perform a wider distribution of tasks that have a heavier focus on infrastructure operations than application development. The report goes into more detail about this distribution of responsibilities, the recent effects of COVID-19, and the future of SRE.
SRE Report Overview
The Catchpoint report took place at a unique time in the world and IT. It consists of two sets of survey data, with the first being from February of 2020, before the global COVID-19 pandemic changed the workplace landscape. The second set of data was collected in May, about the time when most SREs and other IT professionals had started to work remotely from home.
The timing of this study is valuable in determining not only the current state of SREs but also their likely future. For example, while the 2018 survey from Catchpoint suggested that SRE wasn’t something conducive to remote work, this year’s survey shows that the post-pandemic environment is likely to result in about half of SREs working from home.
Additionally, as many IT companies need to downsize to adapt to the world of the pandemic, they start looking for SREs that can perform the widest range of tasks possible. This change in responsibilities means that the term SRE is diverging from its original meaning and might continue to do so.
The next sections discuss in more detail these four main takeaways of the report:
- Observability is not prevalent in SRE
- Focusing on operations comes at a cost
- Remote working brings opportunities and challenges
- The future of SRE is both remote and bright
Shifting Responsibilities and Observability in SRE
Respondents to the survey outlined their activities both before and after the onset of the COVID-19 pandemic. Pre-COVID, 75% of the SREs stated that their jobs focus more on operations activities, and 25% stated that the focus is on development activities. Similarly, only 14% of respondents stated that more than 50% of their total workload consisted of development activities. In addition to this already unbalanced task load, a net 10% of the SREs stated that their activities had further shifted during the stay-at-home portion of the pandemic to include more operations work.
The development activities included in the survey include tasks such as writing software to help operations and developing applications. Operations includes conducting incident response and managing trouble tickets. When asked about specific tool categories relevant to their roles, the SREs listed monitoring and alerting, dashboarding, and infrastructure as code as the top three areas.
Catchpoint highlights that observability is not one of the areas that the respondents indicated as being most relevant to their work. Only 53% included it as a relevant category, whereas 93% chose monitoring and alerting. The same pattern appeared when the survey asked about the SREs’ key responsibilities, where the majority did not list observability pillars such as metrics, logs, and tracing.
The usefulness of monitoring and alerting depends heavily on the observability of the system. Without ensuring observability, they won’t be able to deliver useful results because monitoring is not adequate for finding out about unknown system problems. Observability, on the other hand, asks questions from outside the system to collect information that it can combine with information from the metrics, logs, and tracing within the system. Metrics help identify failures, logs provide more detail about the event, and tracing shows the execution of the program during that time.
Consequences of Focusing on Operations
These shifts in development vs. operations work don’t only change the meaning of SRE—they also affect costs and business outcomes. Before the pandemic, the survey data indicated that the heavier focus on operations work was resulting in increased costs for owning and maintaining systems.
Focusing on operations also means focusing on a reactive approach with low observability. Shifting to a more mature preventative approach with high observability capabilities would improve business performance and lower costs. The steps of such a process would involve leaving behind reactive approaches in favor of first proactive and then preventative methods.
Larger portions of the respondents to the survey indicated practicing more reactive activities, including postmortem analysis of problems via planned activities (80% of participants), responding to system-generated alert messages (75%), analysis of metrics (72%), and repairing infrastructure problems (68%).
Conversely, the proportion of respondents who indicated spending moderate or large amounts of time on proactive activities was considerably lower. Only 61% listed task automation, 56% listed monitoring and analyzing system metrics for trends, 54% listed support of post-deployment operations, and 53% listed SRE-specific systems planning.
Optimally, this balance should change such that SREs can focus more on high-level engagements while automation takes care of service operations. Catchpoint emphasizes that a huge opportunity remains to shift the workload more towards development through using output feedback as an input to improve the observability of the products and services that a company offers. Furthermore, this possibility exists for organizations of all sizes.
Opportunities Brought by COVID
In terms of employee experience, morale, and work-life balance, the COVID pandemic has brought both challenges and opportunities. The report suggests recognizing the challenges and addressing them such that they turn into chances for strategic differentiation of assets and encouraging an employee-first mentality.
The pre-COVID challenges that most respondents identified include delayed involvement in the application life cycle, too much time spent debugging, lack of support from other teams, and inadequate budget for training. On the other hand, issues at the top of the list since working at home include work-life balance, team communication, focus/clarity, and facilities such as equipment and broadband access.
Combinations of these issues can lead to burnout, which is what companies should be aiming to avoid. In particular, the lack of automation mentioned previously can combine with the lack of budget and training and lead to employees feeling that much of their work is “toil”, ultimately resulting in burnout. Problems of work-life balance and focus/clarity exacerbate this issue, meaning that working from home might be increasing burnout in workers.
The goals of improving the balance of development and operations coincide with the goals of improving the wellbeing of workers. Shifting to proactive, preventative measures that increase observability and automation not only help the business side, but also reduce toil and stress for employees. The pandemic pushes companies to take this opportunity to hit two birds with one stone, and not doing so is likely to result in further challenges down the line.
The Future of SRE
The outlook that the report gives for SRE outlines likely positive trends as well as some potential caveats. One of the main factors contributing to both ends of the spectrum is the increased likelihood of more employees continuing to work from home in the long term.
This change brings with it some bonuses such as over 9% of respondents stating that they can more effectively manage incidents from home. However, some respondents also noted an increase in the total number of incidents, primarily due to increases in traffic and capacity issues. The report hypothesis that there could potentially be a relationship between better incident management and the fact that companies have slowed their software releases.
Overall, considering the distributed nature of SRE and acknowledging challenges that may have previously been ignored offers the perfect opportunity to make improvements in the field. Taking the initiative now of boosting observability, reliability, operations-development balance, and workforce distribution can help companies better prepare for the post-COVID world.