Photo

The Essentiality Of SRE When Cloud Providers Fail: Outages, Mitigation, and Postmortems

Sayan Mondal

from Harness (India)

About speaker

Senior Software Engineer II at Harness, Maintainer & LFX Mentor of LitmusChaos

Sayan Mondal is a Senior Software Engineer II at Harness, building their Chaos Engineering platform and helping them shape the customer experience market.

About speakers company

Harness is a modern software delivery platform that focuses on automating and simplifying Continuous Integration (CI) and Continuous Delivery (CD) pipelines, with a strong emphasis on reliability, efficiency, and ease of use. Known for its powerful suite of DevOps tools, Harness helps engineering teams streamline deployments, manage costs, and improve overall developer productivity. The platform includes modules for feature flagging, cloud cost management, chaos engineering, and more, making it a comprehensive solution for organizations looking to enhance their software delivery practices with automation and intelligence.

Abstracts

specific

With business-critical functions now heavily reliant on the cloud, addressing potential cloud outages is vital yet challenging. While multi-cloud and multi-regional architectures boost resilience, they also bring trade-offs like increased complexity, costs, and latency. Achieving the ideal balance between delivery speed and operational safety requires a strategic approach. This talk focuses on leveraging observability, chaos engineering, and postmortem analysis to navigate these complexities, guiding teams to prioritize the right cloud and DevOps capabilities for resilient systems that don’t compromise agility.


In a world where a growing number of companies are entrusting the cloud with critical business functions, dealing with significant outages from public cloud providers is essential.

Several methods can be used to remediate in case of failure; however, each of them has its drawbacks. A multi-cloud and multi-vendor setup amplifies complexity. Multi-regional architectures boost resilience but entail cost and latency consideration, and targeting 100% reliability leads to diminishing returns as, beyond a certain threshold, the effort required to achieve even higher levels of reliability becomes significantly greater.

This talk aims to talk about achieving the perfect balance between speed and safety in delivery. It will shed light on three of its main capabilities; observability, chaos engineering, and postmortem analysis, combined with different cloud and DevOps capabilities companies should emphasize upon.

The Program Committee has not yet taken a decision on this talk

other talks of this topic

Photo
Actionable Observability

Lesley Cordero

The New York Times

broad
Photo
Delivering SaaS on-prem with Cloud-native tools

George Hantzaras

MongoDB

specific
Photo
The Balancing Act of Reliability

Yusuf Aytas

Workday

broad
Photo
AI for Next-Gen Security: OpenAI and Copilot for Security Synergy

Sergey Chubarov

Independent consultant

specific
Photo
CNCF sandbox project k8up under the hood

Aarno Aukia

VSHN - The DevOps Company

specific
Photo
K8s load testing at scale with k6-operator

Ant(on) Weiss

PerfectScale

specific
Photo
Guarding the ML Galaxy: Beyond Accuracy to Privacy and Security

Rishabh Misra

Attentive Mobile Inc

broad
Photo
Platform Engineering for a Greener Future

Pini Reznik

re:cinq

broad
Photo
Autonomous Agents and Their Role in Incident Management

Yoseph Reuveni

Not Affiliated

specific
Photo
Reduce Alert Fatigue with AIOps

Birol Yildiz

ilert GmbH

broad
Photo
An Intro to Kubernetes Hardening

Ayesha Kaleem

MBition GmbH

broad
Photo
Empowering Developers: Building an Application Catalogue with Crossplane

Aarno Aukia

VSHN - The DevOps Company

specific
Photo
Knowledge Discovery Efficiency: The FeedHenry Case Study

Benjamin Igna

Stellar Work GmbH

specific
Photo
DevOps for AI: running LLMs in production with Kubernetes and KubeFlow

Aarno Aukia

VSHN - The DevOps Company

specific
Photo
DevOps done right: RBAC

Daniel Drack

FullStackS GmbH

specific
Photo
Securing K8s: back and forth to RBAC Enforce

Roman Levkin

Exness

specific
Photo
Pentesting Kubernetes Services in the Cloud

Sergey Chubarov

Independent consultant

specific
Photo
How to Measure PromQL/MetricsQL Expression Complexity

Roman Khavronenko

VictoriaMetrics

specific
Photo
How do we deliver Agile Service Management?

Cristan Massey

Pearson Education

specific
Photo
Behind the curtain of PowerShell cmdlets

Sergey Chubarov

Independent consultant

specific
Photo
CRaCing Java Snapshots

Pasha Finkelshteyn

BellSoft

specific