Observability is the ability to measure the internal states of a system by examining its outputs. Monitoring incoming calls, error scenarios, and traces is critical to ensure a robust system is running properly.
Dashboards
Several dashboards exist for monitoring BSR health:
- slo-http - For monitoring all HTTP traffic for all traffic with only
cluster
groupings. - slo-rpc - For monitoring RPC traffic, with
cluster
,rpc_service
, andrpc_method
groupings. - slo-registry - For monitoring BSR registry endpoints (
maven
,go
,npm
,swift
,other
) can be shown in this dashboard.
These dashboards included with the BSR expose its overall health and can aid in identifying and diagnosing operational issues. Several charts and components on the dashboards are described in more detail below.
Service Level Objectives
The dashboards are primarily defined in terms of service level objectives (SLOs).
Each dashboard has a success rate and latency objective, and the dashboards keeps track of when those objectives are met
or missed. The success rate objective is currently 99.5% for all dashboards, and the latency objectives can be viewed in
the config
panel of the dashboards. Generally, the dashboards consider a request failed if it returns an unsuccessful
error code (e.g. 5xx or the Connect RPC equivalent) or if it exceeds its latency target.
Filters
Filters can be set to restrict the set of data the dashboard operates on. Setting cluster
is important for sites with
multiple BSR clusters. Other filters can be set based on which SLO dashboard is being used:
- slo-http:
cluster
- slo-rpc:
cluster
,rpc_method
, andrpc_service
- slo-registry:
cluster
andrequest_type
High Level Information
The first two rows of the dashboard contain high level information to quickly provide the current operational status of the BSR.
First Row Boxes
The Availability, Error Budget, Slow Requests, Failed Requests, and Requests boxes display aggregate counts that give a very high level overview of what is happening with the BSR over the trailing SLO evaluation window.
SLO Table
This panel lists each grouping that has been observed and its availability. The lowest availability pairs are listed first to call out any parts of the BSR that may be having issues.
Error Budget Percentage
This panel shows the aggregate (based on filters) error budget burn down for the BSR. Each point on this chart shows the error budget used up over the trailing SLO evaluation window.
Other Dashboard Components
Most of the other dashboard panels have titles that explain exactly what they are displaying and should be self-explanatory, others are clarified below.
Firing Alerts
Any alerts that are firing will be visible in this panel.
Config
Internal configuration for the dashboard. Currently cannot be changed by end users.
Alerts Overview
Alerts are configured for every grouping for the dashboard in each deployed BSR cluster. Each grouping has an error budget based on an availability objective of 99.5% over the previous four weeks. An alert will fire if a given method returns errors at a rate that threatens the 99.5% objective. There are two classes of alerts defined by how rapidly they notify of errors.
High Priority Alerts respond swiftly, signalling immediate threats to the system. For these alerts to fire, 50% of the error budget must be consumed in the last hour and the method needs to have received at least 10 requests in that period.
Low Priority Alerts are designed for longer durations, capturing potential issues without responding to minor fluctuations. For these alerts to fire, 10% of the error budget must be consumed in the last 24 hours and the method needs to have had at least 10 requests in that period.
Simply put, if errors accumulate too quickly within the past one hour or 24 hours, alerts are triggered.
Diagnosing Alerts
When an alert fires, it's accompanied by specific labels to aid in your investigation:cluster
, rpc_method
,
rpc_service
and request_type
, depending on which dashboard is being viewed. These labels highlight the affected
areas, helping you pinpoint the problem's origin.
Navigating the Dashboard
Once an alert is triggered:
- Firing Alerts Panel: Here, you can view all active alerts, including the one that notified you.
- Error Rate Diagram: Look for spikes in this chart, giving you a visual representation of when and where issues arose.
- Drill Down: Utilize the provided labels (
cluster
,rpc_method
,request_type
andrpc_service
) to refine your search in the dashboard and focus on the affected areas.
By following these steps and utilizing the dashboard, you can swiftly identify, understand, and address any issues in your system.
Grafana Import
Steps for importing the dashboard in Grafana:
- From the Dashboards menu, select Import dashboard
- Upload the dashboard
.json
file, e.g. grafana-slo-dashboards.json - Select a datasource for the dashboard
- Click the "Import (Overwrite)" button