Is DevOps and SRE same (Part 2) ?
In last post , we described DevOps and its various principles or pillars. Let’s check Google SRE which focuses primarily on service management.
From Googles owns SRE site “A primary building block of Google’s approach to service management is the composition of each SRE team. As a whole, SREs can be broken down into 2 main categories” ie Engineering and Ops. Engineering being “availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning”
Per Google , it’s by design that SRE teams are focused on engineering. Without engineering, Ops load increases and teams will need more people for workload. Few more characteristics
- Service Codes : The team tasked with a service codes. Because the service basically runs and repairs itself: the systems are automatic, not just automated. In practice, scale and new features keep SREs on their toes.
2) Development : Google’s thumb rule : SRE team must spend the remaining 50% of its time actually doing development
3) PSR Focus : As said, an SRE team is responsible for PSR (ie Performance, Scalability and Reliability ) which encompasses availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of their service(s).This helps focus on engineering work, as opposed to Ops work.
4)Engineering Focused : The Ops work consists of App monitoring and Analysis that helps Devs to build systems that don’t requires manual intervention. Usually the Ops Engineers manages the event quickly, clean up and restore normal service, and then conduct a RCA.This RCA should uncover the logs, sequence, time and actions to improve or address it next time.
5) 99.99% Availability : The Dev teams 99.99 availability is based on the assumption that the features should be launched asap and with a phased rollout ie release on release or on demand. Other assumption is 1% of the time or budget should be spent on error management and fixes.
6) Cost of Failure : Rollback early , rollback often, When an error is discovered or even suspected in a release, the team rolls back first and explores the problems. This approach reduces the Mean Time to Recovery (MTTR) — or the average time needed to recover the service from a failure. Regular measurement is key to keep the system uptime.
7 ) Canary Release : Canary release is to make the rollout process quicker. Any change is introduced to a small portion or focused group/users. Its tested and a feedback is provided. After all required changes are made, the release is made available to everybody. Canary releases cuts the Mean Time to Detect (MTTD) that shows how long it takes the team to detect an issue. Also this reduces the number of customers affected by system failures. Good example is an Ecommerce system.
8) Playbooks : Playbooks / runbooks are documents that describes procedures and steps to respond to automated logs/alerts. They reduce Mean Time to Repair (MTTR). So, for daily releases, these guides need daily updates. Since good documentation is hard, SREs promotes creating only general instructions that change slowly.Entries in playbooks are out of date as soon as the environment changes. But for agility , keep the documentation low.
Toolset Similarities : Let’s check similarities between SRE/Devops
- Containers and microservices : Containers and microservices helps in creating a scalable system. Thus docker for building and deployment containerised apps and Kubernetes for container orchestration are an integral parts of SRE/DevOps toolchains.
Shift Left: With Containers
Most modern applications requires complicated installations and integrations every time they are deployed leading to…
2) CI/CD : CI/CD tools like Jenkins /GitHub, Azure DevOps Server etc promotes the idea of gradual change, enabling teams to build, test, and deploy code faster thus facilitating Ci/CD.
3) Infrastructure as Code (IaC) : These tools promotes “automate everything” concept. Tools such as Chef, Puppet, Ansible, Cloud Formtion , Terraform etc are the most widely-used tools to automate infrastructure deployments and configurations.
IaC: Reigning the Deployment Pipeline
Recently for a customer whose site experienced explosive growth over a small period of time , ie 10x times the traffic…
4)Automated tests: In Prod they can be performed with the help of many open source tools like Selenium with Jscript etc tools. This is for UI. Some tools uses unix or built in langagues to automate . Other approach is to automate the product itself using the Pyramid Approach.
5) Monitoring : This play a crucial role in SRE and DevOps frameworks. Services delivered by Splunk, Dynatrace, BroadCom, Datadog, and many other platforms allow for metrics-based continuous monitoring of network and application performance across cloud environments.
Finally some comparision below:
Final Thoughts : The term “DevOps” was coined in late 2008 . Its core principle — involve both IT and Devs in each phase of a overall system’s design and development, high level of automation instead of human effort, the application of engineering practices and tools to operations tasks — are consistent with many of SRE’s principles and practices. One could view DevOps as a generalisation function where as core SRE principles are ment to be a wider range of organisations, management structures, and tooset. Thus in short , one could view SRE as a specific implementation of DevOps with some idiosyncratic extensions.