Observability is the ability to measure the internal states of a system by examining its outputs. Monitoring incoming calls, error scenarios, and traces is critical to ensure a robust system is running properly.

Dashboards

Several dashboards exist for monitoring BSR health:

  • slo-http - For monitoring all HTTP traffic for all traffic with only cluster groupings.
  • slo-rpc - For monitoring RPC traffic, with cluster, rpc_service, and rpc_method groupings.
  • slo-registry - For monitoring BSR registry endpoints (maven, go, npm, swift, other) can be shown in this dashboard.

These dashboards included with the BSR expose its overall health and can aid in identifying and diagnosing operational issues. Several charts and components on the dashboards are described in more detail below.

Dashboard

Service Level Objectives

The dashboards are primarily defined in terms of service level objectives (SLOs).

Each dashboard has a success rate and latency objective, and the dashboards keeps track of when those objectives are met or missed. The success rate objective is currently 99.5% for all dashboards, and the latency objectives can be viewed in the config panel of the dashboards. Generally, the dashboards consider a request failed if it returns an unsuccessful error code (e.g. 5xx or the Connect RPC equivalent) or if it exceeds its latency target.

Filters

Filters can be set to restrict the set of data the dashboard operates on. Setting cluster is important for sites with multiple BSR clusters. Other filters can be set based on which SLO dashboard is being used:

  • slo-http: cluster
  • slo-rpc: cluster, rpc_method, and rpc_service
  • slo-registry: cluster and request_type

High Level Information

The first two rows of the dashboard contain high level information to quickly provide the current operational status of the BSR.

First Row Boxes

The Availability, Error Budget, Slow Requests, Failed Requests, and Requests boxes display aggregate counts that give a very high level overview of what is happening with the BSR over the trailing SLO evaluation window.

SLO Table

This panel lists each grouping that has been observed and its availability. The lowest availability pairs are listed first to call out any parts of the BSR that may be having issues.

Error Budget Percentage

This panel shows the aggregate (based on filters) error budget burn down for the BSR. Each point on this chart shows the error budget used up over the trailing SLO evaluation window.

Other Dashboard Components

Most of the other dashboard panels have titles that explain exactly what they are displaying and should be self-explanatory, others are clarified below.

Firing Alerts

Any alerts that are firing will be visible in this panel.

Config

Internal configuration for the dashboard. Currently cannot be changed by end users.

Alerts Overview

Alerts are configured for every grouping for the dashboard in each deployed BSR cluster. Each grouping has an error budget based on an availability objective of 99.5% over the previous four weeks. An alert will fire if a given method returns errors at a rate that threatens the 99.5% objective. There are two classes of alerts defined by how rapidly they notify of errors.

High Priority Alerts respond swiftly, signalling immediate threats to the system. For these alerts to fire, 50% of the error budget must be consumed in the last hour and the method needs to have received at least 10 requests in that period.

Low Priority Alerts are designed for longer durations, capturing potential issues without responding to minor fluctuations. For these alerts to fire, 10% of the error budget must be consumed in the last 24 hours and the method needs to have had at least 10 requests in that period.

Simply put, if errors accumulate too quickly within the past one hour or 24 hours, alerts are triggered.

Diagnosing Alerts

When an alert fires, it's accompanied by specific labels to aid in your investigation:cluster, rpc_method, rpc_service and request_type , depending on which dashboard is being viewed. These labels highlight the affected areas, helping you pinpoint the problem's origin.

Once an alert is triggered:

  1. Firing Alerts Panel: Here, you can view all active alerts, including the one that notified you.
  2. Error Rate Diagram: Look for spikes in this chart, giving you a visual representation of when and where issues arose.
  3. Drill Down: Utilize the provided labels (cluster, rpc_method, request_type and rpc_service) to refine your search in the dashboard and focus on the affected areas.

By following these steps and utilizing the dashboard, you can swiftly identify, understand, and address any issues in your system.

Grafana Import

Steps for importing the dashboard in Grafana:

  1. From the Dashboards menu, select Import dashboard
  2. Upload the dashboard .json file, e.g. grafana-slo-dashboards.json
  3. Select a datasource for the dashboard
  4. Click the "Import (Overwrite)" button