Photo

Operational Excellence in Large-Scale Systems: Ensuring Performance and Stability in High-Load Env

Vamsi Krishna Rao

from Salesforce (USA)

About speaker

Principal Storage Engineer , Salesforce

Vamsi is a highly experienced enterprise infrastructure architect with over 20 years specializing in SAN, NAS, cloud, and distributed storage technologies.

About speakers company

.

Abstracts

broad

Ensuring performance and stability in large-scale, high-load systems requires more than just reactive measures—it demands a proactive approach through Site Reliability Engineering (SRE) and operational excellence. This session will provide key insights into maintaining large-scale systems, with a focus on multi-petabyte storage, observability, and automation to prevent downtime and improve system reliability.

Running large-scale systems in production is a balancing act between performance, stability, and operational efficiency. In this session, I will explore the key principles of Site Reliability Engineering (SRE) and DevOps that ensure the smooth operation of high-load systems, focusing on the management of multi-petabyte storage and cloud environments. From managing large data pipelines to automating incident response, this talk will provide insights into how to create reliable systems that minimize downtime and improve performance through observability and automation. Attendees will learn how to implement best practices in monitoring, alerting, and stress testing, ensuring that their systems remain resilient under heavy loads. Real-world examples will highlight the importance of proactive problem-solving and the lessons learned from addressing operational bottlenecks in distributed systems.

The Program Committee has not yet taken a decision on this talk

other talks of this topic

Photo
Actionable Observability

Lesley Cordero

The New York Times

broad
Photo
Delivering SaaS on-prem with Cloud-native tools

George Hantzaras

MongoDB

specific
Photo
DevOps done right: RBAC

Daniel Drack

FullStackS GmbH

specific
Photo
How to Measure PromQL/MetricsQL Expression Complexity

Roman Khavronenko

VictoriaMetrics

specific
Photo
Securing K8s: back and forth to RBAC Enforce

Roman Levkin

Exness

specific
Photo
Platform Engineering for a Greener Future

Pini Reznik

re:cinq

broad
Photo
Reduce Alert Fatigue with AIOps

Birol Yildiz

ilert GmbH

broad
Photo
CRaCing Java Snapshots

Pasha Finkelshteyn

BellSoft

specific
Photo
The Balancing Act of Reliability

Yusuf Aytas

Workday

broad
Photo
Pentesting Kubernetes Services in the Cloud

Sergey Chubarov

Independent consultant

specific
Photo
An Intro to Kubernetes Hardening

Ayesha Kaleem

MBition GmbH

broad
Photo
Behind the curtain of PowerShell cmdlets

Sergey Chubarov

Independent consultant

specific
Photo
Empowering Developers: Building an Application Catalogue with Crossplane

Aarno Aukia

VSHN - The DevOps Company

specific
Photo
Autonomous Agents and Their Role in Incident Management

Yoseph Reuveni

Not Affiliated

specific
Photo
How do we deliver Agile Service Management?

Cristan Massey

Pearson Education

specific
Photo
CNCF sandbox project k8up under the hood

Aarno Aukia

VSHN - The DevOps Company

specific
Photo
K8s load testing at scale with k6-operator

Ant(on) Weiss

PerfectScale

specific
Photo
DevOps for AI: running LLMs in production with Kubernetes and KubeFlow

Aarno Aukia

VSHN - The DevOps Company

specific
Photo
AI for Next-Gen Security: OpenAI and Copilot for Security Synergy

Sergey Chubarov

Independent consultant

specific
Photo
Guarding the ML Galaxy: Beyond Accuracy to Privacy and Security

Rishabh Misra

Attentive Mobile Inc

broad
Photo
Knowledge Discovery Efficiency: The FeedHenry Case Study

Benjamin Igna

Stellar Work GmbH

specific