About speaker
Sr.
About speakers company
.
Imagine a system where incidents are detected and addressed autonomously, reducing downtime and enabling teams to stay ahead of potential failures. Our paper, Autonomous Agents and Their Role in Incident Management, delves into a new approach to operational resilience. By leveraging machine learning-driven agents, we empower Site Reliability Engineers to pinpoint root causes, prioritize critical events, and take immediate corrective action—often before human intervention is needed. This innovative framework redefines incident management, offering proactive solutions for high-stakes, high-load systems and helping teams keep services reliable at scale.
The increasing complexity of high-load systems has pushed Site Reliability Engineering (SRE) practices to evolve rapidly, focusing on proactive measures for managing incidents and maintaining system stability. In traditional incident management, engineers rely heavily on reactive processes, manually sifting through monitoring dashboards, interpreting alerts, and deploying remediation. However, as systems scale, this approach becomes increasingly inefficient and error-prone. Our paper, Autonomous Agents and Their Role in Incident Management, introduces a novel framework in which machine learning-driven autonomous agents streamline the incident management process, reducing downtime and enhancing operational resilience.
This framework leverages autonomous agents designed to monitor complex infrastructure in real time, identifying potential incidents and executing preemptive actions based on both historical data and contextual insights. Each agent specializes in a specific infrastructure component—such as databases, network switches, load balancers, or application metrics—enabling focused, deep-dive analysis without overwhelming engineers with redundant alerts. Agents are equipped with machine learning algorithms trained on historical incident data, which allows them to detect patterns and anomalies that could indicate impending system issues. These agents can also correlate data across different system layers, giving a holistic view that improves the accuracy of alerts and helps prioritize high-impact events.
A key feature of the autonomous agents is their ability to conduct root cause analysis independently. By using causality algorithms and event correlation techniques, agents can trace the origin of an issue through complex dependency chains, saving valuable time for SRE teams. For instance, an agent monitoring DNS may detect a sudden spike in response times and correlate this with recent network changes, identifying a configuration issue as the likely root cause. Once identified, the agent can either alert the engineering team or execute predefined remediation steps—such as reverting configurations, restarting services, or scaling resources. This capability empowers teams to mitigate incidents swiftly, reducing Mean Time to Recovery (MTTR) and preserving system stability even under high-load conditions.
Our framework also includes a feedback loop powered by reinforcement learning, enabling agents to learn from each incident. Engineers can provide feedback on false positives or low-impact anomalies, which the agent then uses to improve its future performance, reducing alert fatigue and refining incident classification. By building intelligence over time, these autonomous agents become more adept at distinguishing between actionable incidents and minor fluctuations, thereby minimizing the noise in incident management.
The paper provides an in-depth examination of real-world case studies where autonomous agents were deployed to monitor critical services in high-load environments, including insights into deployment challenges, edge cases, and operational outcomes. We detail the benefits of this autonomous framework in terms of reliability, operational efficiency, and engineer productivity. Finally, we discuss future directions, such as enhancing multi-agent collaboration and expanding machine learning models to handle increasingly complex system architectures.
By transforming incident management from a reactive to a proactive paradigm, autonomous agents play a crucial role in modern SRE, enabling systems to self-manage incidents at scale. This approach not only reduces downtime and operational costs but also frees engineers to focus on strategic improvements, thereby advancing the resilience and reliability of high-load systems.
The Program Committee has not yet taken a decision on this talk
Vamsi Krishna Rao
Salesforce
Hoang Dinh Nguyen
Viettel Group
Gursimar Singh
DevOps Consultant
Christoph Menzel
inovex GmbH
Sayan Mondal
Harness
Grzegorz Sztandera
Commerzbank
Aarno Aukia
VSHN - The DevOps Company
Aarno Aukia
VSHN - The DevOps Company
Edgar Mikayelyan
Qrator Labs
Rishabh Misra
Attentive Mobile Inc
Sergey Chubarov
Independent consultant
Tech Internals Conf is the leading conference for developers of complex and highly loaded systems
Participation options
Offline
The price is soaring —> the closer the conference is, the more it costs.
The current price of a ticket is —> 360 EUR
If you have any questions you can reach out to our support service —> support@internals.tech
Special offer (from 5 tickets)
To order from 5 tickets, contact us support@internals.tech
leave a requestChanged your mind?
Please tell us why.
Thank you for your reply!
Professional conference for developers of high-load systems