Photo

The Balancing Act of Reliability

Yusuf Aytas

from Workday (Ireland)

About speaker

Yusuf Aytas, a Bilkent University graduate, is a seasoned software engineer, leader and author. Yusuf started his journey with startups. After gaining experience, he worked for renowned companies like Amazon, Workday, and TripAdvisor in technical and leadership positions.

About speakers company

Workday unites finance and HR on a single AI-driven platform, empowering people, enabling fast decisions, and ensuring flawless operations to drive business

Abstracts

broad

Building software is just the beginning; ensuring its reliability is essential. Unreliable systems frustrate users and disrupt operations. The talk will cover principles including MTTR, MTTD, SLIs, SLOs, and more to enhance your application's dependability.


Ensuring software reliability is the real deal after deployment. Unreliable systems lead to user dissatisfaction, operational disruptions, and damage to your reputation. In today's competitive market, customers won't hesitate to switch to competitors if your system fails them. This session delves into why reliability matters and how to measure and improve it effectively.

We'll explore key principles of reliability engineering, including:

Service Level Indicators (SLIs): Metrics that reflect your system's performance and health, such as latency, error rates, and availability.

Service Level Objectives (SLOs): Targets set for SLIs that define acceptable performance levels to meet user expectations.

Error Budgeting: An allowance for acceptable risk without compromising reliability, helping balance new features with system stability.

Mean Time to Detect (MTTD): Measures how quickly your monitoring systems identify issues, emphasizing the importance of effective detection.

Mean Time to Repair (MTTR): Assesses how swiftly you can recover from issues, highlighting the efficiency of your incident response.

Attendees will learn practical tools and strategies to build systems that detect issues promptly, recover quickly, and maintain functionality during failures. We'll discuss real-life scenarios, best practices, and examine tools that help developers create resilient systems capable of graceful degradation and rapid recovery.

This talk is ideal for software engineers, architects, DevOps professionals, and anyone involved in software development who is keen on enhancing application dependability and maintaining customer trust.

The Program Committee has not yet taken a decision on this talk

other talks of this topic

Photo
How do we deliver Agile Service Management?

Cristan Massey

Pearson Education

specific
Photo
Empowering Developers: Building an Application Catalogue with Crossplane

Aarno Aukia

VSHN - The DevOps Company

specific
Photo
How to Measure PromQL/MetricsQL Expression Complexity

Roman Khavronenko

VictoriaMetrics

specific
Photo
Behind the curtain of PowerShell cmdlets

Sergey Chubarov

Independent consultant

specific
Photo
DevOps done right: RBAC

Daniel Drack

FullStackS GmbH

specific
Photo
An Intro to Kubernetes Hardening

Ayesha Kaleem

MBition GmbH

broad
Photo
Actionable Observability

Lesley Cordero

The New York Times

broad
Photo
Delivering SaaS on-prem with Cloud-native tools

George Hantzaras

MongoDB

specific
Photo
Knowledge Discovery Efficiency: The FeedHenry Case Study

Benjamin Igna

Stellar Work GmbH

specific
Photo
Autonomous Agents and Their Role in Incident Management

Yoseph Reuveni

Not Affiliated

specific
Photo
CRaCing Java Snapshots

Pasha Finkelshteyn

BellSoft

specific
Photo
Platform Engineering for a Greener Future

Pini Reznik

re:cinq

broad
Photo
Pentesting Kubernetes Services in the Cloud

Sergey Chubarov

Independent consultant

specific
Photo
Securing K8s: back and forth to RBAC Enforce

Roman Levkin

Exness

specific
Photo
DevOps for AI: running LLMs in production with Kubernetes and KubeFlow

Aarno Aukia

VSHN - The DevOps Company

specific
Photo
CNCF sandbox project k8up under the hood

Aarno Aukia

VSHN - The DevOps Company

specific
Photo
AI for Next-Gen Security: OpenAI and Copilot for Security Synergy

Sergey Chubarov

Independent consultant

specific
Photo
K8s load testing at scale with k6-operator

Ant(on) Weiss

PerfectScale

specific
Photo
Guarding the ML Galaxy: Beyond Accuracy to Privacy and Security

Rishabh Misra

Attentive Mobile Inc

broad
Photo
Reduce Alert Fatigue with AIOps

Birol Yildiz

ilert GmbH

broad