reliability | The fastest way for busy developers to keep up with technologies 🚀

Story

@squadcast shared a post, 1 year, 1 month ago

Striking a Balance: Reliability Management for Innovation-Driven Companies

#reliabi... #inciden... #reliabi...

This blog post dives into the world of reliability management for SRE teams. It emphasizes the importance of achieving a balance between innovation and system stability. The article explores various frameworks and best practices that SRE teams can leverage to achieve this equilibrium. Some of the key takeaways include implementing SLOs and error budgets, adopting DevOps practices, and utilizing Infrastructure as Code (IaC). The blog also highlights the importance of fostering a culture of collaboration and learning within the SRE team.

937 views

Story

@boldlink shared a post, 2 years, 11 months ago

AWS DevOps Consultancy, Boldlink

An Overview of AWS Well-Architected Framework

#Perform... #Securit... #reliabi... #cost #aws

Thinking of getting started with AWS cloud computing or migrating your existing workloads to AWS? Here is a quick guide on how the 5 pillars of AWS’s well-architected framework will help you build a secure, high-performing, resilient and efficient cloud infrastructure for your workloads.So basically..

2k views

Story

@yair_stark shared a post, 3 years, 4 months ago

Error Budget Is All You Need - Part 2

#monitor... #reliabi... #slo

In part 1 I proposed a simple modification to Google’s Multi-Window Multi-Burn Rate alerting setup and I showed how this modification addresses the cases of varying-traffic services and typical latency SLOs.

2k views

Story

@yair_stark shared a post, 3 years, 4 months ago

Error Budget Is All You Need - Part 1

#reliabi... #slo

One of the great chapters of Google’s Site Reliability Engineering (SRE) second book is chapter 5 — Alerting on SLOs (Service Level Objectives). This chapter takes you on a comprehensive journey through several setups of alerts on SLOs, starting with the simplest non-optimized one and by iterating through several setups reach the ultimate one, which is optimized w.r.t to the main four alerting attributes: recall, precision, detection time and reset time.

2k views

Story

@tharunshiv shared a post, 3 years, 5 months ago

Site Reliability Engineer, PhonePe

#1 What's Site Reliability Engineering [SRE] | Roles & Responsibilities | Technologies involved

#SRE #enginee... #enginee... #site #reliabi...

Site Reliability Engineering, also popularly referred to as the SRE, is a role in Computer Science Engineering where the main purpose is to provision, maintain, monitor, and manage the infrastructure in order to provide maximum application uptime and reliability. SRE is an emerging role, but the tasks that the SRE does were always there ever since the first application that was developed. The scope of the software developers ends where they write code to develop the application and right from setting up the infrastructure, the various services that run on them, the network connectivity that is required, providing a platform for the application to run and making sure every part of the application is up and running reliably 24x7 is the duty of an SRE. In fact, we can consider Site Reliability Engineers are the strong bridge between the users and a reliable application.

2k views

Link

@prathamesh-sonpatki shared a link, 1 year, 11 months ago

SRE, Last9.io

MTTF vs. MTBF vs. MTTD vs. MTTR

#MTTR #reliabi... #Softwar... #observa...