SLOs for Critical User Interactions
Anderson Carneiro · September 3, 2024 · 9 min read
AutoScout24 has developed a distinct approach to implementing Service Level Objectives (SLOs), focusing on what matters most for our users and the business. The shape we gave to SLOs fundamentally connects Business and Engineering and this blog post will walk you through our journey, the challenges we faced, and the strategies we employed to create effective and actionable SLOs.
What is Critical for AutoScout24’s Users?
SLOs are measurable goals designed to ensure a service’s reliability and performance from the user’s perspective. While the concept is straightforward, implementing SLOs can be challenging, particularly in defining the appropriate scope — an important aspect in a microservices environment. Unlike simple monitors, SLOs require teams to invest considerable effort in improving telemetry, creating meaningful indicators, and aligning on objectives. Therefore, it is neither practical nor necessary to implement SLOs for every technical service; they should be reserved for those services that are most critical to the user experience.
AutoScout24 serves three distinct types of users, each with different objectives:
- The Car Buyer: An individual looking for a car on our website.
- The Dealer: A business that sells cars.
- The Private Seller: An individual trying to sell a car.
Each user follows a unique journey on AutoScout24, consisting of various user interactions. Critical User Interactions (CUIs) are those essential steps that must work flawlessly for users to achieve their goals. For example, Buyers must be able to search for cars, and Dealers must receive emails triggered by the platform.
The Buyer’s journey begins even before they arrive at AutoScout24, often triggered by life events such as starting a new family or beginning a new job. From the moment they identify the need for a car to receiving a proposal from a car Dealer, AutoScout24 offers various services to assist the Buyer. It is crucial that these critical services have SLOs configured to ensure a seamless and reliable experience for the Buyer.
The Need for Actionable SLOs
This is not the first time we have invested in SLOs. Our previous attempts at implementing SLOs were scattered and often failed to engage business stakeholders with concerns of service reliability. Despite introducing tools to support teams in creating SLOs, we encountered several challenges:
- Overly Technical and Non-Actionable SLOs: Many SLOs were designed with a focus on technical metrics that did not reflect user experience, lacking clear triggers for alarms and actionable steps when breaches occurred.
- Lack of scope: The scope of SLOs was not well-defined, leading to the creation of SLOs across numerous services. This resulted in a significant implementation effort with unclear benefits.
- Insufficient Product Manager Engagement: Product managers were not adequately involved in the SLO creation and management process, leading to misalignment with business objectives.
- Knowledge Deficiencies: There was a lack of understanding regarding the proper handling and management of SLOs.
- Neglected and Abandoned SLOs: After initial implementation, many SLOs did not become part of teams regular routine and thus, became obsolete due to a lack of maintenance.
Important: SLOs are important for answering a fundamental question: Are our services reliable enough to meet user expectations? Without data and a standard way to measure reliability, new features could inadvertently cause incidents, and over-investing in reliability could increase costs unnecessarily.
To address these issues, we developed a comprehensive plan to make SLOs useful:
1. Awareness through trainings
We introduced courses through our internal learning tool and live workshops, ensuring engineers and product managers understand the importance and implementation of SLOs.
2. Product and Management Buy-in
We secured commitment from product managers and senior management to support the SLO initiative.
3. Tooling for SLO Implementation
We developed tools to simplify the implementation and organization of SLOs.
4. Defining Critical User Interactions
With the support of UX and product teams, we identified CUIs for each user journey. Examples include:
- Buyer browses and filters cars.
- Dealer is contacted by a Buyer about a car.
- Private Seller books an appointment to sell a car.
For each CUI, we defined:
- Description: What the interaction involves.
- Failure Modes: Potential issues that could occur.
- User Impact: Consequences for the user if the interaction fails.
- Business Impact: How key performance indicators (KPIs) are affected.
- Technical and Business Owners: Responsible parties for each CUI.
This metadata defines a CUI and is made available in dashboards and documentation so that the organization knows what each interaction means, how they affect users, and the business.
Having owners that are part of high-level management brings importance to the topic and supports decisions.
5. Translating Error Budgets into KPI Impact
We created a framework to translate error budgets into potential KPI impacts, helping teams make data-driven decisions.
I’ll illustrate the concept using a hypothetical company that connects Buyers and Sellers of toy cars. At “ToyCarScout24”, part of the revenue comes from lead generation. Toy Car Sellers pay for our service because we connect Buyers to them, and the more Buyers contact a Seller, the more valuable our service is to them. This can be expressed as a monetary value.
As you can see, the Error Budget is not merely described in terms of minutes of downtime. It also encompasses the necessary business impact, which is key for informed decision-making. By integrating the Error Budget with key performance indicators (KPIs), we can translate technical downtime into tangible business consequences, aligning with the language and priorities of management for company-wide decisions.
Consider a comparison between iOS Push and Web Push indicators. In this scenario, Web Push Availability has a higher SLO than iOS Push Availability, even though the potential revenue generated by iOS Pushes is significantly greater. Several factors could contribute to this imbalance, such as the team’s focus on developing new features for both Web and mobile apps, the challenges of maintaining legacy code, or a lack of understanding of the business impact. This raises a crucial question: Will developing a new feature generate more revenue than improving reliability and increasing the iOS Push SLO to 99%?
By framing the discussion around business impact, we can guide more strategic decisions that effectively balance innovation with reliability, ensuring that resources are allocated where they provide the most value.
Workshops for Implementation
We organized workshops with teams and product managers responsible for customer-facing APIs. These workshops produced:
- Diagrams of user flows and technical service dependencies.
- Identification of dependencies on other teams or external services (e.g., AWS).
- Detailed definitions of SLOs covering all failure modes.
Teams then planned the implementation of their SLOs based on these specifications.
Ensuring Quality
We tracked the implementation across all involved teams, providing technical and conceptual support, while keeping management informed of our progress. To avoid past mistakes where SLOs were implemented but not used, we established a quality score for SLO implementation. Each CUI is evaluated based on criteria like:
- Setting Sensible SLOs:
- Latency and availability SLOs.
- Reflecting the user experience.
- Knowing the business impact.
- Realistic targets aligned with business goals.
- Covering all failure modes.
- Acting on SLOs:
- Configured alerts.
- Defined escalation paths for breaches.
- Runbooks for incident response.
- Usage in decision-making processes.
High-quality SLOs meet at least 50% of the criteria in both categories, ensuring they are not only implemented but also actively used. Teams must work to reach the Quality goal.
Measuring Reliability
We monitor error budget consumption to gauge service reliability:
- Unreliable: When the error budget is depleted.
- Outage: When the burn rate is very high.
We provide an overview of the company’s reliability using SLOs. High burn rates indicate potential incidents, prompting teams to investigate and address issues. Depleted error budgets signal the need for maintenance work or target adjustments.
Tooling
Our technical solution leverages Terraform and Datadog to create and manage SLOs. The tool simplifies complexity and ensures critical alerts reach the teams. We organize SLOs using tags (e.g., cui, journey, team, service, system) to extract precise data on reliability issues and affected users or services.
Outcomes
As teams are refining their SLOs and the process is still very recent, it is hard to quantify the success of our investment in SLOs for Critical User Interactions. However, we can list some important achievements after reintroducing SLOs:
- By defining the Critical User Interactions, we have gained a clearer understanding of which technical services are essential. This definition now serves as a foundation for various initiatives, including Quality Assurance.
- Everything critical for users and the business has undergone a thorough review and received sensible monitoring, which is continuously improved.
- The organization has a clear, data-driven language that shows how reliable we are.
- SLOs on CUIs sparked reliability aligment between teams that own dependencies.
- SLOs were used to identify a lack of telemetry on hidden and important external resources.
Challenges
Since this is the first time that the majority of the teams are acting on SLOs, some alarms are sounding too frequently. This frequency indicates that some teams, without proper data, were too ambitious. Understanding the reality is an important asset, and the refinement process is fundamental. It will provide clarity on the exact level of reliability our services can achieve and supply vital data to help teams prioritize their backlogs.
Enforcing a policy based on data is another area that requires time. It is hard to imagine that decision-making around reliability, which was done differently, will change in a matter of months. It takes time for the concepts to be set and for everyone to be aligned.
Conclusion
Implementing SLOs is not a one-time initiative but a cultural shift towards data-driven decision-making. SLOs are not just a technical solution for engineers; they are a framework for decision-making that involves engineers, product managers, and management. Despite its simplicity, SLO implementation can become complex and unusable without proper tools and management buy-in. Success requires continuous progress tracking, quality assurance, and alignment with business and user expectations. Ultimately, this approach leads to more reliable services, better decision-making, and improved user and business outcomes.
By focusing on what matters most for our users and the business, AutoScout24 has created a robust framework for implementing and using SLOs effectively. This ensures our platform remains reliable, meets user expectations, and supports our business goals.