PricingDocs

Announcing SLO alerting

Today we are extremely excited to announce support for Service Level Objective (SLO) alerts in Capture. SLO alerts are an industry best practice for operating reliable systems and they are now available as part of the bitdrift Capture dynamic observability system.

Announcing SLO Alerting
SLOs are a method of applying rigor to organization reliability objectives. An example of an SLO is: 99.9% of app start Time To First Interaction (TTFI) should be less than 5s. SLOs operate over windows (the most common being 7 or 30 days) and effectively allow for a certain amount of failure as part of normal operation. This failure is known as the error budget. For a much deeper dive on what SLOs are and how to use them see the chapter on implementing SLOs in the Google SRE book. Once SLOs are defined, they need to be alerted on. This may sound simple, but is deceptively complicated, as evidenced by the length of the accompanying chapter on alerting on SLOs in the Google SRE book. Up to now, Capture has only supported “basic” alerts on synthetic metric time series data. For example, alert when a synthetic counter is greater than 100 for 10 consecutive aggregation periods. With this release, it is now possible to define and then alert on SLOs within Capture. The following functionality is supported:
  1. 7 or 30 day SLOs.
  2. SLOs are supported on synthetic rate charts where the rate is assumed to be good events / total events. For example, the rate of TTFI 5s as described in the example above.
  3. We have implemented Multiwindow, Multi-Burn-Rate Alerts (MWMBR) as defined in part 6 of the Google SRE handbook chapter. This is the optimal method of alerting as it balances both good precision (how many alerts are significant) as well as good recall (how many significant events are detected).
The downside of MWMBR is that there are a lot of parameters to specify and if you are not an SLO expert it can be hard to reason about what parameters to use. To help with this, we provide default MWMBR windows for both 7 and 30 day SLOs that should be a good starting point for most users. The alerts can be modified from there if need be. Capture is changing the mobile observability game by adding a control plane and local storage on every mobile device, providing extremely detailed telemetry when you need it, and none when you don’t. If the lack of SLO alerting was keeping you away, now is the time to give us a try! Interested in learning more? Check out the sandbox to get a hands-on feel for what working with Capture is like or get in touch with us for a demo. Please join us in Slack as well to ask questions and give feedback!

Stay in the know, sign up to the bitdrift newsletter.

Author