Beyond the 3 Pillars of Observability – InApps 2022

Main Contents:

Beyond the 3 Pillars of Observability – InApps is an article under the topic Devops Many of you are most interested in today !! Today, let’s InApps.net learn Beyond the 3 Pillars of Observability – InApps in today’s post !

What Observability Is Not

Today, there are many who define observability as a collection of data types — the three pillars: logs, metrics, and distributed traces. Rather than focusing on the outcome, this siloed approach to observability is overly focused on technical instrumentation and underlying data formats.

Simply having systems emit all three data types doesn’t guarantee better outcomes. What’s more, many companies find little correlation between the amount of observability data produced and the value derived from this data.

Break Observability Down into 3 Phases

We’re not the first to criticize the three pillars. We agree with much of the critique that others — like Charity Majors and Ben Sigelman — have put out there. Instead of the three pillars of observability, we’ve developed an approach to observability that is focused on the outcomes instead of the inputs, and we call it the three phases. The phases are focused on positive observability outcomes and the steps teams can take to achieve these goals.

The traditional three pillars observability — logs, metrics, and distributed traces — outdated, overly-focused on technical instrumentation and underlying data formats, rather than outcome.

During each phase, the focus is on alleviating the customer impact — or remediating the problem — as fast as possible. Remediation is the act of alleviating the customer pain and restoring the service to acceptable levels of availability and performance. At each phase, the engineer is looking for enough information to remediate the issue, even if they don’t yet understand the root cause.

Phase 1: Know about the Problem

Knowing an issue is occurring is enough to trigger a remediation. For example, if you deploy a new version of a service and an alert triggers for that service, rolling back the deployment is the quickest path to remediating the issue without needing to understand the full impact or diagnose the root cause during the incident. Introducing changes to a system is the largest source of production issues, so knowing about problems as these changes are introduced is key.

Keys to success:

Fast alerting: Shrink the time between a problem occurring and a notification firing.
Scope notifications to just the teams that need to act: Scope the problem and route it to the right teams from the start.
Improve signal-to-noise ratio: Ensure that alerts are actionable.
Automate alert setup: Automated or templatized alerting can help engineers know about problems without a complicated setup process.

Tools and data:

Alerts
Metrics (native metrics as well as metrics generated from logs and traces)

Phase 2: Triage the Problem

Understanding the scope of an issue can lead to remediation. For example, if you determine that only customers in one experiment group are impacted, turning off that experiment would likely remediate the issue.

To help engineers triage issues, they need to be able to quickly put the alert into the context of understanding how many customers or systems are impacted, and to what degree. Great observability allows engineers to pivot the data and shine a spotlight on the contextualized data to diagnose issues.

Keys to success:

Contextualized dashboards: Having alerts directly link to dashboards that show not only the source of the alert, but related and relevant contextual data.
High cardinality pivots: Allowing engineers to further slice and dice the data allows them to further isolate the problem.
Leverage existing instrumentation: It’s not practical to always assume that every use-case is instrumented perfectly, so it’s important to be able to leverage existing instrumentation, but have them link as best possible for best contextualization.

Tools and data:

Phase 3: Understand the Problem.

Doing a post mortem on an incident is often an exercise in navigating a twisted web of dependencies and trying to determine which service owner you need to work with.

Great observability gives engineers a direct line of sight linking their metrics and alerts to the potential culprits. Additionally, it provides insights that can help fix underlying problems to prevent the recurrence of incidents.

Keys to success:

Easy understanding of service dependencies: Identifying the direct upstream and downstream dependencies of the service experiencing the active issue.
Ability to jump between tools and data types: For complex issues, you need to repeatedly jump between details given by logs and traces to the trends and outliers given by metrics on dashboards and ideally in a single tool.
Time to root cause: Sometimes it’s impossible to avoid having to perform root cause analysis during an incident and in those situations, having probable causes surface in alert notifications or during triage using dashboards reduces time to root cause.

Tools and data:

Traces
Logs
Metrics
Dashboards

Conclusion

Great observability can lead to competitive advantage, world-class customer experiences, faster innovation, and happier developers. But organizations can’t achieve great observability by just focusing on the input and data (three pillars). By focusing on the three phases and the outcomes outlined here, teams can achieve the promise of great observability.

Feature image via Pixabay.

Source: InApps.net

Rate this post

Phu Nguyen

As a Senior Tech Enthusiast, I bring a decade of experience to the realm of tech writing, blending deep industry knowledge with a passion for storytelling. With expertise in software development to emerging tech trends like AI and IoT—my articles not only inform but also inspire. My journey in tech writing has been marked by a commitment to accuracy, clarity, and engaging storytelling, making me a trusted voice in the tech community.