Tech Internals Conf is the largest conference for developers of complex and high-load systems

stay tuned

The Essentiality Of SRE When Cloud Providers Fail: Outages, Mitigation, and Postmortems

from Harness (India)

About speaker

Senior Software Engineer II at Harness, Maintainer & LFX Mentor of LitmusChaos

Sayan Mondal is a Senior Software Engineer II at Harness, building their Chaos Engineering platform and helping them shape the customer experience market.

About speakers company

Harness is a modern software delivery platform that focuses on automating and simplifying Continuous Integration (CI) and Continuous Delivery (CD) pipelines, with a strong emphasis on reliability, efficiency, and ease of use. Known for its powerful suite of DevOps tools, Harness helps engineering teams streamline deployments, manage costs, and improve overall developer productivity. The platform includes modules for feature flagging, cloud cost management, chaos engineering, and more, making it a comprehensive solution for organizations looking to enhance their software delivery practices with automation and intelligence.

Abstracts

specific

With business-critical functions now heavily reliant on the cloud, addressing potential cloud outages is vital yet challenging. While multi-cloud and multi-regional architectures boost resilience, they also bring trade-offs like increased complexity, costs, and latency. Achieving the ideal balance between delivery speed and operational safety requires a strategic approach. This talk focuses on leveraging observability, chaos engineering, and postmortem analysis to navigate these complexities, guiding teams to prioritize the right cloud and DevOps capabilities for resilient systems that don’t compromise agility.

In a world where a growing number of companies are entrusting the cloud with critical business functions, dealing with significant outages from public cloud providers is essential.

Several methods can be used to remediate in case of failure; however, each of them has its drawbacks. A multi-cloud and multi-vendor setup amplifies complexity. Multi-regional architectures boost resilience but entail cost and latency consideration, and targeting 100% reliability leads to diminishing returns as, beyond a certain threshold, the effort required to achieve even higher levels of reliability becomes significantly greater.

This talk aims to talk about achieving the perfect balance between speed and safety in delivery. It will shed light on three of its main capabilities; observability, chaos engineering, and postmortem analysis, combined with different cloud and DevOps capabilities companies should emphasize upon.

The Program Committee has not yet taken a decision on this talk