Is DevOps and SRE same (Part 2) ?

In last post , we described DevOps and its various principles or pillars. Let’s check Google SRE which focuses primarily on service management.


From Googles owns SRE site “A primary building block of Google’s approach to service management is the composition of each SRE team. As a whole, SREs can be broken down into 2 main categories” ie Engineering and Ops. Engineering being “availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning”

Per Google , it’s by design that SRE teams are focused on engineering. Without engineering, Ops load increases and teams will need more people for workload. Few more characteristics

  1. Service Codes : The team tasked with a service codes. Because the service basically runs and repairs itself: the systems are automatic, not just automated. In practice, scale and new features keep SREs on their toes.

2) Development : Google’s thumb rule : SRE team must spend the remaining 50% of its time actually doing development

3) PSR Focus : As said, an SRE team is responsible for PSR (ie Performance, Scalability and Reliability ) which encompasses availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of their service(s).This helps focus on engineering work, as opposed to Ops work.

4)Engineering Focused : The Ops work consists of App monitoring and Analysis that helps Devs to build systems that don’t requires manual intervention. Usually the Ops Engineers manages the event quickly, clean up and restore normal service, and then conduct a RCA.This RCA should uncover the logs, sequence, time and actions to improve or address it next time.

5) 99.99% Availability : The Dev teams 99.99 availability is based on the assumption that the features should be launched asap and with a phased rollout ie release on release or on demand. Other assumption is 1% of the time or budget should be spent on error management and fixes.

6) Cost of Failure : Rollback early , rollback often, When an error is discovered or even suspected in a release, the team rolls back first and explores the problems. This approach reduces the Mean Time to Recovery (MTTR) — or the average time needed to recover the service from a failure. Regular measurement is key to keep the system uptime.

7 ) Canary Release : Canary release is to make the rollout process quicker. Any change is introduced to a small portion or focused group/users. Its tested and a feedback is provided. After all required changes are made, the release is made available to everybody. Canary releases cuts the Mean Time to Detect (MTTD) that shows how long it takes the team to detect an issue. Also this reduces the number of customers affected by system failures. Good example is an Ecommerce system.

8) Playbooks : Playbooks / runbooks are documents that describes procedures and steps to respond to automated logs/alerts. They reduce Mean Time to Repair (MTTR). So, for daily releases, these guides need daily updates. Since good documentation is hard, SREs promotes creating only general instructions that change slowly.Entries in playbooks are out of date as soon as the environment changes. But for agility , keep the documentation low.

Toolset Similarities : Let’s check similarities between SRE/Devops

  1. Containers and microservices : Containers and microservices helps in creating a scalable system. Thus docker for building and deployment containerised apps and Kubernetes for container orchestration are an integral parts of SRE/DevOps toolchains.

2) CI/CD : CI/CD tools like Jenkins /GitHub, Azure DevOps Server etc promotes the idea of gradual change, enabling teams to build, test, and deploy code faster thus facilitating Ci/CD.

3) Infrastructure as Code (IaC) : These tools promotes “automate everything” concept. Tools such as Chef, Puppet, Ansible, Cloud Formtion , Terraform etc are the most widely-used tools to automate infrastructure deployments and configurations.

4)Automated tests: In Prod they can be performed with the help of many open source tools like Selenium with Jscript etc tools. This is for UI. Some tools uses unix or built in langagues to automate . Other approach is to automate the product itself using the Pyramid Approach.

5) Monitoring : This play a crucial role in SRE and DevOps frameworks. Services delivered by Splunk, Dynatrace, BroadCom, Datadog, and many other platforms allow for metrics-based continuous monitoring of network and application performance across cloud environments.

Finally some comparision below:

Final Thoughts : The term “DevOps” was coined in late 2008 . Its core principle — involve both IT and Devs in each phase of a overall system’s design and development, high level of automation instead of human effort, the application of engineering practices and tools to operations tasks — are consistent with many of SRE’s principles and practices. One could view DevOps as a generalisation function where as core SRE principles are ment to be a wider range of organisations, management structures, and tooset. Thus in short , one could view SRE as a specific implementation of DevOps with some idiosyncratic extensions.




Not a geek but interest include one , i write on practicing work that genuinely reflects the experience | Runner | Avid Walker

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Software design pattern #3: Builder

Features of Python Programming

Streamline Your Oracle Cloud Migration with Intelligent Test Automation

HTTP — Hyper Text Transfer Protocol

My One Month Open Source Bootcamp at Ortelius #ContributhonbySCA

Microservices for Startups: How Teams Get Microservices Wrong From the Start

Abacus secondary reward system: Plan B

DotVVM:Build Conditional Validation Attribute

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Kumar Anil

Kumar Anil

Not a geek but interest include one , i write on practicing work that genuinely reflects the experience | Runner | Avid Walker

More from Medium

Compare Archive Storage Classes in AWS, GCP and Azure

FinOps best practices: How to find and cleanup orphaned and unused snapshots in MS Azure and…

A tool to execute them all: the Job Executor Service

Cloud Lock In — something to think about