Site Reliability Engineering (SRE) Primer

You might have heard the terms Site Reliability Engineering or SRE being thrown in the sentences lately. So what is SRE anyway? Here is my attempt to answer a few basic questions about the jargon.

History of SRE

Site Reliability Engineering (SRE) is a group of engineers (just like DevOps or DevSecOps) that sums software engineering with the IT infrastructure and operations to focus on the reliability, performance, and scalability of large-scale software systems. It was first coined at Google. It was conceptualized by Ben Treynor Sloss, who founded the first SRE team at Google in 2003. By 2016, Google had over 1,000 site reliability engineers. The same pattern was followed across the industry by firms like Airbnb, Dropbox, IBM, LinkedIn, Netflix, and Wikimedia successfully.

Now that we know the definition of SRE at a high level – let’s dive into more details!

Various Aspects of SRE

Focus on Automation	SREs use software engineering principles to automate tasks that the SysAdmins traditionally performed manually. Tasks include system management, change management, incident response, and emergency response.
SRE Roles and Reponsibilities	o SREs are developers with IT operations experience. They understand both coding and maintenance of large-scale IT environments. o SREs are expected to spend their time analyzing logs, responding to incidents, and conducting postmortems. The rest of their time is spent on developing code that automates manual tasks.
SRE Sucecss Criteria	We shall go into more detail later in the blog regarding how to measure the SRE success. Here is a quick introduction to that topic. o Service Level Indicators (SLIs): Metrics like availability, and latency of the service. o Service Level Objectives (SLOs): Agreed-upon targets for service level indicators. o Error Budgets: Maximum allowable time for system failures without violating SLAs.
Typical SRE Tasks	o Reducing repetitive and inefficient system maintenance o Developing scalable solutions for complex problems. o Allowing room for innovation in a stable technological context. o Designing for and implementing observability. o Defining, testing, and running an incident management process. o Ensuring optimal resource allocation (e.g. CPU, Memory and so on). o Implementing effective change and release management.

A few SRE aspects

While SRE and DevOps are related, there are a few key differences between them, academically speaking.

What is the difference between DevOps and SRE?

SRE is a specific approach to reliability engineering, while DevOps is a broader cultural movement that encompasses collaboration, automation, and continuous improvement. Both aim to enhance software delivery and system reliability, but they emphasize different aspects of the development and operations lifecycle.

Focus	o SRE: Primarily focuses on reliability, high availibility, performance, and resiliency. o DevOps: Focuses on collaboration between development (Dev) and operations (Ops) teams.
Origins	o SRE: Was introduced by Google to manage their large-scale services like e.g. Gmail. o DevOps: Emerged as a cultural movement to address the silos between development and operations.
Responsibilities	o SRE: SREs are primarily software engineers with focus on operations. o DevOps: DevOps is a collaborative culture that encourages shared responsibilities. DevOps engineers focus on automation, CI/CD, and infrastructure as code.
Metrics	o SRE: Use Service Level Indicators (SLIs) and Service Level Objectives (SLOs). o DevOps: Use deployment frequency, lead time, and mean time to recovery.
Tooling	o SRE: Focus on observability, incident response, and capacity planning using tools E.g. Prometheus, Application Insights, etc. o DevOps: Leverages tools for CI/CD, configuration management, and infrastructure automation using tools like E.g. Azure DevOps, Jenkins, Ansible, and Terraform.
Culture	o SRE: Encourages learning from the incidents and consistent monitoring o DevOps: Promotes collaboration, empathy, and shared responsibility and values continuous learning and experimentation.

DevOps vs SRE

Benefits of SRE

Enhanced Reliability and Performance:
- SRE promotes a relentless focus on reliability. By continually refining processes and leveraging automation, organizations witness a significant improvement in system reliability and application performance.
Operational Efficiency and Cost Savings:
- SRE enhances operational efficiency by minimizing downtime and automating tasks.
Efficiency through Automation:
- By implementing automation, SRE enables organizations to streamline repetitive tasks, reducing the likelihood of human error which accelerates incident resolution and frees up valuable resources to focus on strategic initiatives.
Improved Collaboration and Communication:
- SRE bridges the gap between development and operations teams. It fosters a culture of collaboration, transparency, and continuous improvement.
Meeting Reliability Targets:
- SRE ensures that software systems meet specific reliability targets and service level objectives (SLOs).
Customer Satisfaction:
- Reduced downtime and improved reliability will lead to higher customer satisfaction, user experience and in general increase the trust in the product.

How to measure the success of the SRE team?

Quantifying the success of SRE implementation can be done using relevant metrics and assessing the impact on factors like system reliability, availability performance and overall efficiency.

Using Golden Signals of Monitoring:
- SRE teams rely upon the 3 golden signals to evaluate system health:
  - Availability: Measure uptime and service accessibility.
  - Performance: Monitor latency and response times.
  - Errors: Track error rates and failures.
Using Service Level Indicators (SLIs), Objectives (SLOs), and Agreements (SLAs):
- Make sure that clear SLIs that represent critical aspects of service are in place. (e.g., API response time, service error rate).
- Set SLOs as targets for these SLIs (e.g., 99.9% availability). Ensure that SLOs are aligned with business goals and customer expectations.
Using Error Budgets:
- An error budget represents the permissible downtime or error rate. Quantify how much of the budget is consumed by production incidents.
Using Utilization Metrics:
- Ensure correct monitors are in place to capture resource utilization (E.g. CPU, memory, disk, network) to prevent saturation.
Using Change Management Failure Rate:
- Measure how often changes (deployments, updates) result in the failures.
Using Incident Response Metrics:
- Evaluate the mean time to detect (MTTD) and the mean time to resolve (MTTR) for production incidents.
Using Workflow Visibility and Collaboration:
- Assess how well SRE practices encourage collaboration between development, operations, and business teams.

SRE using Microsft Azure

As a Solution Architect aiming to implement Site Reliability Engineering (SRE) principles using Microsoft Azure, here are some patterns and best practices that can be considered:

Retry using exponential backoff pattern: Implements retries for handling transient failures gracefully (e.g., network issues, timeouts).
Circuit Breaker Pattern: Prevents cascading failures by temporarily blocking requests to a failing service.
Health Endpoint Monitoring: Regularly check service health to detect issues early.
Bulkhead Pattern: Isolates components to prevent one failure from affecting others.
Leader Election: Ensures consistent state in distributed systems.
Graceful Shutdown: Handles application/service shutdown without disrupting users.

Closing

This primer’s aim was just to introduce the concept and compile all the basic angles of SRE. Feel free to dig deep and learn/share.

References

It’s what happens when you ask a software engineer to design an operations team!

Organizations need to invest in training or hire experienced SREs

Educate and communicate the benefits of SRE to gain buy-in

Teams work together to ensure reliability and customer satisfaction

Users experience fewer disruptions, leading to increased trust in the service.

Monitor compliance with SLAs to ensure contractual commitments are met.

	Suman on 4 principles, we in IT need to…
	Using Azure Open AI… on TL;DR using Azure Open AI…
	Riz Ang on Once upon a time in Azure…
	Girish on Cloud Migration Strategies
	Girish on Multi-phased Cloud Migration…

Site Reliability Engineering (SRE) Primer

Published by Girish

Leave a comment Cancel reply

Share this:

Published by Girish

Leave a comment Cancel reply