Ensuring performance and stability in large-scale, high-load systems requires more than just reactive measures—it demands a proactive approach through Site Reliability Engineering (SRE) and operational excellence. This session will provide key insights into maintaining large-scale systems, with a focus on multi-petabyte storage, observability, and automation to prevent downtime and improve system reliability.
Running large-scale systems in production is a balancing act between performance, stability, and operational efficiency. In this session, I will explore the key principles of Site Reliability Engineering (SRE) and DevOps that ensure the smooth operation of high-load systems, focusing on the management of multi-petabyte storage and cloud environments. From managing large data pipelines to automating incident response, this talk will provide insights into how to create reliable systems that minimize downtime and improve performance through observability and automation. Attendees will learn how to implement best practices in monitoring, alerting, and stress testing, ensuring that their systems remain resilient under heavy loads. Real-world examples will highlight the importance of proactive problem-solving and the lessons learned from addressing operational bottlenecks in distributed systems.